Video header processing method, apparatus, device, and storage medium

By using a video head processing model for global head reconstruction and dynamic synchronization, the problem of insufficient overall visual coordination and consistency in existing face-swapping technologies is solved, achieving seamless integration of the head and background and high-quality video generation.

CN122243787APending Publication Date: 2026-06-19GUANGZHOU HUYA TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU HUYA TECH CO LTD
Filing Date
2026-01-29
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing face-swapping technology suffers from poor overall visual coordination and insufficient video consistency, resulting in mismatches between facial features and face shape, abnormal hairline edges, and other issues that affect the realism of the video and the viewing experience.

Method used

A video head processing model is adopted, which combines face recognition, image coding, variational autoencoder and video generation model with head mask and facial expression features to achieve global reconstruction and dynamic synchronization of the head region. Denoising is performed using latent spatial representation to ensure seamless connection between the head and the background.

Benefits of technology

It achieves a seamless connection between the head and the background, ensuring the overall visual coordination and consistency of the head in the video, and meeting the needs of high-quality video content creation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243787A_ABST
    Figure CN122243787A_ABST
Patent Text Reader

Abstract

This invention relates to the field of video processing technology, and discloses a video head processing method, apparatus, device, and storage medium. The video head processing method includes: acquiring model input data, wherein the model input data includes a target reference head image, a first video frame of target video data, facial key points corresponding to the first video frame, and a second video frame processed by a head mask; inputting the model input data into a pre-trained video head processing model for processing, and outputting a first latent spatial representation of the video frame with the replaced head region; inputting the first latent spatial representation into a decoder of a variational autoencoder for decoding, and obtaining a third video frame with the replaced head region. This invention achieves overall head replacement of human objects in a video, including facial features, face shape, and hairstyle, while maintaining the overall visual coordination and consistency of the dynamic video.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of video processing technology, and in particular to a video header processing method, apparatus, device, and storage medium. Background Technology

[0002] With the rapid development of digital image processing technology and computer vision, deep learning-based face replacement technology, commonly known as "face swapping," has been widely used in film and television production, short video entertainment, virtual reality, and digital humans. Most existing face swapping technologies learn the facial feature mapping between the source person (target identity) and the target person (original video) through training, and then accurately render the facial features (such as eyes, nose, and mouth) of the source person into the video sequence of the target person.

[0003] However, the existing face-swapping technology process is essentially limited to the reconstruction and fusion of the facial "feature areas," which leads to the following technical shortcomings: (1) Poor overall visual coordination. Existing solutions only replace the facial features of the person and cannot change the face shape, jawline, hairline, or overall hairstyle. This results in mismatches, unnatural boundaries, or visual incongruities between the replaced facial features and the original face shape and hairstyle; (2) Insufficient video consistency. In dynamic videos, existing solutions lack overall dynamic modeling of the complete head, which easily leads to problems such as shaking, misalignment, and abnormal hairline edges, affecting realism and viewing experience. Therefore, in application scenarios that require a complete change in the appearance of the person, the expressiveness and natural realism of existing solutions are clearly insufficient, making it difficult to meet the needs of high-quality video content creation. Summary of the Invention

[0004] The main objective of this invention is to provide a video head processing method, apparatus, device, and storage medium, aiming to solve the technical problems of poor overall visual coordination and insufficient video consistency in existing face-swapping technologies.

[0005] The first aspect of the present invention provides a video header processing method, the video header processing method comprising: Acquire model input data, which includes a target reference head image, a first video frame of target video data, facial key points corresponding to the first video frame, and a second video frame after head masking. The model input data is fed into a pre-trained video head processing model for processing, and the first latent space representation of the video frame with the replaced head region is output. The first latent spatial representation is input to the decoder of the variational autoencoder to obtain the third video frame with the replaced head region.

[0006] Optionally, in a first implementation of the first aspect of the present invention, the video head processing model includes: a face recognition model, an image encoder, a face encoder, a video generation model, a variational autoencoder, and a video denoising model. The step of inputting the model input data into a pre-trained video head processing model for processing and outputting a first latent spatial representation of the video frame with the replaced head region includes: The target reference head image is processed by the face recognition model, the image encoder, and the face encoder to obtain a head integrated code; the facial key points are processed by the video generation model to obtain facial expression features. The first video frame and the second video frame are encoded by the variational autoencoder to obtain a second latent space representation and a third latent space representation. The second latent space representation after noise injection is then concatenated with the third latent space representation to form a fused latent space representation. Using the head integration encoding and the facial expression features as conditions, the fused latent space representation is denoised by the video denoising model, and the first latent space representation of the video frame with the replaced head region is output.

[0007] Optionally, in a second implementation of the first aspect of the present invention, the step of processing the target reference head image through the face recognition model, the image encoder, and the face encoder to obtain the head integration code includes: After fusing the target reference head image with the head mask corresponding to the first video frame, the images are input into the face recognition model and the image encoder, respectively. The face recognition model is used to perform face recognition on the target reference head image to obtain the face identification information of the reference head image; The image encoder extracts head appearance features from the target reference head image to obtain head structure information of the target reference head image, including head outline, hairstyle and accessory outline, and head posture. The face identification information and the head structure information are input into the face encoder for normalization and integration to obtain a head integrated code that includes face identification and head appearance features.

[0008] Optionally, in a third implementation of the first aspect of the present invention, the step of encoding the first video frame and the second video frame respectively using the variational autoencoder to obtain a second latent space representation and a third latent space representation, and concatenating the noise-injected second latent space representation with the third latent space representation to form a fused latent space representation includes: The first video frame is input into the first encoder of the variational autoencoder for latent space encoding to obtain the second latent space representation of the first video frame. The second video frame is input into the second encoder of the variational autoencoder for latent space encoding to obtain the third latent space representation of the second video frame; Random noise is added to the second latent space representation, and the noise-added second latent space representation is concatenated with the third latent space representation to obtain a fused latent space representation.

[0009] Optionally, in a fourth implementation of the first aspect of the present invention, the training method used for the video head processing model includes: Acquire training samples, which include a reference head image, multiple first target video frames in time sequence, facial key points corresponding to each first target video frame, and second target video frames processed by head masking. The reference head image is processed using a face recognition model, an image encoder, and a face encoder to obtain a head integrated code; the facial key points are processed using a video generation model to obtain facial expression features. The first target video frame and the second target video frame are encoded by a variational autoencoder to obtain a fourth latent space representation and a fifth latent space representation. The fourth latent space representation after noise injection is then concatenated with the fifth latent space representation to form a fused latent space representation. Using the head integration encoding and the facial expression features as conditions, the fused latent space representation is denoised using a video denoising model to obtain the sixth latent space representation; Based on the fourth latent space representation and the sixth latent space representation, a target loss function is constructed, and the model parameters are updated based on the target loss function to obtain a trained video head processing model.

[0010] Optionally, in a fifth implementation of the first aspect of the present invention, the method for generating the training samples includes: Acquire multiple raw video data sets containing human heads; Each of the original video data is sliced ​​and extracted into video frames to obtain multiple sample video segments; Facial landmark detection is performed on each video frame in each of the aforementioned sample video segments to obtain the facial landmarks of the same target person in each video frame; Pixel-level segmentation is performed on the complete head region of the target person in each video frame to obtain a complete head mask containing the target person's face shape, hairstyle, forehead, and chin; Each video frame is paired and stored with its corresponding facial key points and head mask. During model training, one frame is selected from the same video frame sequence as the reference head image, another frame is selected as the first target video frame, and the second target video frame is generated.

[0011] Optionally, in a sixth implementation of the first aspect of the present invention, the step of constructing a target loss function based on the fourth latent space representation and the sixth latent space representation, and updating the model parameters based on the target loss function to obtain the trained video head processing model includes: Based on the fourth latent space representation and the sixth latent space representation, a target loss function including reconstruction loss is constructed; Based on the target loss function, the gradient of the current round of model training is calculated using the backpropagation algorithm; Based on the gradient, the trainable parameters in the face encoder, the video generation model, and the video denoising model are updated, wherein the parameters of the variational autoencoder, the face recognition model, and the image encoder are frozen respectively during model training. When the target loss function converges or reaches the preset number of training rounds, the model training is stopped, and the trained video head processing model is obtained.

[0012] A second aspect of the present invention provides a video header processing apparatus, the video header processing apparatus comprising: The acquisition module is used to acquire model input data, which includes a target reference head image, a first video frame of target video data, facial key points corresponding to the first video frame, and a second video frame after head masking. The replacement module is used to input the model input data into a pre-trained video head processing model for processing, and output the first latent space representation of the video frame with the replaced head region. The decoding module is used to decode the decoder of the first latent spatial representation input variational autoencoder to obtain the third video frame with the replaced head region.

[0013] Optionally, in a first implementation of the second aspect of the present invention, the video head processing model includes: a face recognition model, an image encoder, a face encoder, a video generation model, a variational autoencoder, and a video denoising model, wherein the replacement module is specifically used for: The target reference head image is processed by the face recognition model, the image encoder, and the face encoder to obtain a head integrated code; the facial key points are processed by the video generation model to obtain facial expression features. The first video frame and the second video frame are encoded by the variational autoencoder to obtain a second latent space representation and a third latent space representation. The second latent space representation after noise injection is then concatenated with the third latent space representation to form a fused latent space representation. Using the head integration encoding and the facial expression features as conditions, the fused latent space representation is denoised by the video denoising model, and the first latent space representation of the video frame with the replaced head region is output.

[0014] Optionally, in a second implementation of the second aspect of the present invention, the replacement module is further configured to: After fusing the target reference head image with the head mask corresponding to the first video frame, the images are input into the face recognition model and the image encoder, respectively. The face recognition model is used to perform face recognition on the target reference head image to obtain the face identification information of the reference head image; The image encoder extracts head appearance features from the target reference head image to obtain head structure information of the target reference head image, including head outline, hairstyle and accessory outline, and head posture. The face identification information and the head structure information are input into the face encoder for normalization and integration to obtain a head integrated code that includes face identification and head appearance features.

[0015] Optionally, in a third implementation of the second aspect of the present invention, the replacement module is further configured to: The first video frame is input into the first encoder of the variational autoencoder for latent space encoding to obtain the second latent space representation of the first video frame. The second video frame is input into the second encoder of the variational autoencoder for latent space encoding to obtain the third latent space representation of the second video frame; Random noise is added to the second latent space representation, and the noise-added second latent space representation is concatenated with the third latent space representation to obtain a fused latent space representation.

[0016] Optionally, in a fourth implementation of the second aspect of the present invention, the video header processing device further includes: The training module is used to acquire training samples, which include a reference head image, multiple first target video frames in time sequence, facial key points corresponding to each first target video frame, and second target video frames processed by head masking. The reference head image is processed using a face recognition model, an image encoder, and a face encoder to obtain a head integrated encoding. The facial key points are processed using a video generation model to obtain facial expression features. The first and second target video frames are encoded using a variational autoencoder to obtain a fourth latent space representation and a fifth latent space representation, and the fourth latent space representation with injected noise is concatenated with the fifth latent space representation to form a fused latent space representation. Using the head integrated encoding and the facial expression features as conditions, the fused latent space representation is denoised using a video denoising model to obtain a sixth latent space representation. Based on the fourth and sixth latent space representations, a target loss function is constructed, and the model parameters are updated based on the target loss function to obtain a trained video head processing model.

[0017] Optionally, in a fifth implementation of the second aspect of the present invention, the training module is further configured to: Acquire multiple raw video data sets containing human heads; Each of the original video data is sliced ​​and extracted into video frames to obtain multiple sample video segments; Facial landmark detection is performed on each video frame in each of the aforementioned sample video segments to obtain the facial landmarks of the same target person in each video frame; Pixel-level segmentation is performed on the complete head region of the target person in each video frame to obtain a complete head mask containing the target person's face shape, hairstyle, forehead, and chin; Each video frame is paired and stored with its corresponding facial key points and head mask. During model training, one frame is selected from the same video frame sequence as the reference head image, another frame is selected as the first target video frame, and the second target video frame is generated.

[0018] Optionally, in a sixth implementation of the second aspect of the present invention, the training module is further configured to: Based on the fourth latent space representation and the sixth latent space representation, a target loss function including reconstruction loss is constructed; Based on the target loss function, the gradient of the current round of model training is calculated using the backpropagation algorithm; Based on the gradient, the trainable parameters in the face encoder, the video generation model, and the video denoising model are updated, wherein the parameters of the variational autoencoder, the face recognition model, and the image encoder are frozen respectively during model training. When the target loss function converges or reaches the preset number of training rounds, the model training is stopped, and the trained video head processing model is obtained.

[0019] A third aspect of the present invention provides a computer device, comprising: a memory and at least one processor, wherein the memory stores instructions; the at least one processor invokes the instructions in the memory to cause the computer device to perform the video header processing method described above.

[0020] A fourth aspect of the present invention provides a computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform the video header processing method described above.

[0021] The core idea of ​​the technical solution provided by this invention lies in shifting from a local perspective of facial replacement to a global perspective of head reconstruction. It doesn't simply overlay the facial features of the source person onto the target person; instead, within a unified and controlled generation framework, it completely redraws the head for each frame based on the dynamic information of the person in each frame of the target video. This process relies on the video head processing model and its input data employed in the technical solution. The model's network architecture includes: a face recognition model, an image encoder, a face encoder, a variational autoencoder, a video generation model, and a video denoising model. The model input data includes a reference head image, the target video frame, and the corresponding head mask and facial key points for the target video frame.

[0022] (1) The purpose of processing the reference head image through a face recognition model is not only to identify the person's identity, but also to extract a highly condensed and discriminative identity feature vector. This vector ensures that the final generated head maintains a high degree of consistency with the source person in terms of facial feature layout and skeletal structure. This fundamentally solves the problem of incomplete facial contour replacement in traditional methods. The image encoder can capture rich information beyond facial features from the reference head image, such as the strands and contours of the hairstyle, the texture and luster of the skin, the plumpness and contour lines of the face, and even the style of accessories. Through the face encoder, the identity features and appearance features of the person in the reference head image are integrated and encoded to form a unified and information-complete head integrated code.

[0023] (2) The introduction of a head mask is a key inventive feature of this invention. It is no longer limited to the facial area, but precisely outlines the entire head (including the chin, jawline, and even part of the neck). During training, the model clearly understands through this mask that the generation task is limited to the area covered by the mask, and the areas outside (such as the background and body) must remain unchanged. This fundamentally eliminates the problems of seams, blurring, or unnatural blending at the facial edges caused by traditional face-swapping methods.

[0024] (3) The present invention introduces a variational autoencoder, which can compress high-dimensional image information into a low-dimensional latent space. The target video frame is encoded as a first latent space representation, and the head mask is encoded as a second latent space representation. These representations are then concatenated in the latent space to form a fused latent space representation. Concatenating the latent representation of the mask with the noisy latent representation of the target frame effectively defines a clear "working area" for the model before generation begins. During denoising generation, the model naturally "redraws" within this area, while information outside the area is preserved, thus achieving seamless integration of the head with the background and body.

[0025] (4) To ensure visual consistency, the replaced head can make completely synchronized and natural expressions and movements with the target video, facial key points are introduced to accurately describe the facial posture and expression changes of the person in each frame of the target video. These key point sequences are processed by the video generation model to extract facial expression features. This feature vector is no longer static, but a dynamic instruction containing temporal information.

[0026] (5) In the denoising process of the video denoising model, head integration encoding and facial expression features are injected as dual conditions. Head integration encoding provides global semantic guidance related to identity and appearance for the generation process through a cross-attention mechanism; while facial expression features are injected through another attention mechanism to provide precise geometric constraints related to pose and expression for the generation process. This dual condition control achieves perfect decoupling and re-fusion of identity and action.

[0027] (6) The loss function used in this invention does not directly compare pixels, but rather compares in the more fundamental latent space. It compares the third latent space representation generated by the model with the true, clean first latent space representation. This supervision method is more stable and can guide the model to learn deeper distribution patterns of the data, rather than just fitting pixel-level details. The objective loss function can be designed as a weighted sum of multiple components, including reconstruction loss and temporal consistency loss. Reconstruction loss ensures the generation quality of a single frame; temporal consistency loss forces the model to generate a smooth, flicker-free video stream by comparing the differences between adjacent frames. This multi-objective collaborative optimization ultimately ensures that the trained model can output high-quality, highly stable, and highly consistent video head-swapping results.

[0028] This invention's technical solution achieves comprehensive control over identity and appearance through head integration encoding, precise definition of generation boundaries through head masking, and accurate driving of dynamic temporal sequences through facial expression features. Finally, these capabilities are solidified into the model through training with a multi-objective loss function in the latent space. Therefore, it can perfectly achieve overall head replacement including facial features, face shape, and hairstyle, ensuring visual overall coordination and consistency in dynamic videos, thus meeting the needs of high-quality video content creation in special scenarios. Attached Figure Description

[0029] Figure 1 This is a schematic diagram of an embodiment of the video head processing model training method in this invention; Figure 2 This is a schematic diagram of one embodiment of the video header processing method in this invention; Figure 3 This is a schematic diagram of one embodiment of the video head processing device in this invention; Figure 4 This is a schematic diagram of one embodiment of the computer device in this invention. Detailed Implementation

[0030] The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0031] For ease of understanding, the specific process of the embodiments of the present invention is described below. Please refer to [link / reference]. Figure 1 One embodiment of the video head processing model training method in this invention includes: 101. Obtain training samples, the training samples including a reference head image, multiple first target video frames with time sequence, and facial key points corresponding to each first target video frame and second target video frames after head masking. In this embodiment, the selection of training samples is related to the video head processing model. To achieve complete head replacement of characters in the video while maintaining the overall visual coordination and consistency of the dynamic video, the video head processing model in this embodiment is not trained using a single model, but rather uses a network composed of multiple models for model training.

[0032] In one embodiment, the preferred video head processing model includes: a face recognition model, an image encoder, a face encoder, a variational autoencoder, a video generation model, and a video denoising model. Accordingly, the training samples used in this embodiment include: (1) Reference head image: This is the sole carrier of the identity and appearance of the "source" person. Ideally, it is a high-quality, frontal, evenly lit, and naturally expressive high-resolution photograph or video frame of the source person. To enhance the robustness of the model, the training dataset should contain a large number of reference frames of different people, covering different ages, genders, races, skin colors, hairstyles, and face shapes. During training, the reference head image is a random frame selected from the video sequence, but there are no restrictions during inference.

[0033] (2) Multiple first target video frames with time sequence: This is the source of the "target" dynamic information, i.e., the original video in which the person whose head is being replaced is located. In order to capture rich dynamic changes, the target video should contain a variety of head postures (such as pitch, sway, and turn), a variety of facial expressions (such as joy, anger, sorrow, happiness, surprise, etc.), and different lighting environments. The video length should not be too short, and usually needs to contain several seconds to tens of seconds of continuous frames to ensure the integrity of the time sequence information.

[0034] (3) Facial landmarks: For each first target video frame, its corresponding facial landmarks need to be accurately generated. This process is usually completed by a pre-trained automated model. For example, facial landmark detection can use mature industry models, such as a 68-point detector, a 468-point ultra-high precision grid, or a deep learning model.

[0035] (4) Second target video frame: To facilitate the replacement of the head region in the first target video frame with a reference head image, the head region in the first target video frame needs to be pre-painted. Specifically, a head mask for the first target video frame is generated first, and then the generated head mask is used to paint over the first video frame, resulting in a second target video frame with the head region removed. Head mask generation is an image segmentation task. A dedicated U-Net model can be trained, or an existing semantic segmentation model can be used. The segmentation model performs pixel-level segmentation of the complete head region in each video frame, generating a complete head mask including face shape, hairstyle, forehead, jaw, etc. The generated head mask can be processed using an edge smoothing algorithm to ensure a natural transition in subsequent fusion.

[0036] The step of generating training samples is included before step 101 described above. In an optional embodiment, the training sample generation method includes: Step 1: Obtain multiple raw video data sets containing the heads of people; To ensure the model's generalization ability, the raw video data must cover the widest possible range of human attributes, shooting environments, and dynamic scenes. Utilizing large-scale video datasets released by academia and industry with usage licenses is the most efficient approach. After acquiring different raw videos, rigorous screening must be performed to eliminate low-quality data.

[0037] In each original video data, key points are first extracted from all video frames using a face key point detection algorithm. Then, using the key point detection results and basic image quality assessment indicators (such as brightness and sharpness scores), video frames that do not contain complete faces or have missing key points are automatically filtered out. At the same time, invalid video segments with severe motion blur, occlusion, or resolution below a set threshold are further removed to ensure the quality of the training set.

[0038] Step 2: Perform video slicing and video frame extraction on each of the original video data to obtain multiple sample video segments; This step involves decomposing the lengthy raw video data into short video segments with independent semantics, suitable for model training, and discretizing these segments into a sequence of image frames. The goal of video slicing is to divide the long video into multiple short, continuous segments, each serving as an independent training sample unit. The sliced ​​video remains a continuous data stream and needs to be further converted into a discrete sequence of image frames.

[0039] The original video data might be at 24fps, 30fps, or even 60fps, with highly similar adjacent frames. To reduce computational load and data redundancy, downsampling can be performed. For example, extracting one frame every four frames is equivalent to reducing the video's frame rate to one-quarter of the original. Using a random starting frame from each original video data set as a baseline, 81 consecutive frames are extracted to obtain a video segment. Through slicing and frame extraction, an original video is transformed into multiple sample video segments, each containing a continuous sequence of image frames. This method can increase data diversity while maintaining the continuity of video action, helping to improve the model's temporal modeling capabilities.

[0040] Step 3: Perform facial landmark detection on each video frame in each of the sample video segments to obtain the facial landmarks of the same target person in each video frame; The goal of this step is to extract precise facial geometric information for each frame, which is the core control signal driving subsequent dynamic changes in the head. Facial landmark detection technology is already very mature; choosing a suitable model is key to balancing accuracy and speed. For all frames in each video segment, the facial landmark detection algorithm is called to locate feature points, obtaining accurate facial contours and the coordinates of facial features. The landmark detection model typically outputs a confidence score. A threshold (e.g., 0.8) can be set; if the detection confidence score of a frame is lower than this threshold, the frame is marked as "invalid" and skipped in subsequent training to avoid introducing incorrect control signals.

[0041] Step 4: Perform pixel-level segmentation on the complete head region of the target person in each video frame to obtain a complete head mask containing the target person's face shape, hairstyle, forehead, and chin; This step is the key difference between this embodiment and traditional face-swapping technology. Its goal is to generate an accurate "complete head mask" as a spatial "canvas" for the generation model. Preferably, a segmentation model is used to perform pixel-level segmentation of the complete head region in each video frame, generating a complete head mask that includes face shape, hairstyle, forehead, chin, etc. The generated head mask is then processed by an edge smoothing algorithm to ensure a natural transition in subsequent fusion.

[0042] Step 5: Pair and store each video frame with its corresponding facial key points and head mask. During model training, select one frame from the same video frame sequence as the reference head image, select another frame as the first target video frame, and generate the second target video frame.

[0043] In this step, all video slices of each original video data are paired and stored with the corresponding facial key point sequences and head masks to form standardized data samples that can be directly used for training the overall head-swapping model.

[0044] 102. The reference head image is processed using a face recognition model, an image encoder, and a face encoder to obtain a head integrated code; the facial key points are processed using a video generation model to obtain facial expression features; This step is mainly used to extract and encode key information from multimodal input, providing precise instructions for the subsequent generation process.

[0045] (a) Header integration coding Head integration encoding is a comprehensive digital description of the source person's identity and appearance features. This is specifically achieved through a face recognition model, an image encoder, and a face encoder.

[0046] (1) Face recognition model processing (identity encoding) Face recognition models significantly enhance the discriminative power of their features by introducing additive angular margin into the loss function, making them highly robust to variations in pose, lighting, and occlusion. A reference head image is input into the pre-trained face recognition model. The model typically consists of a backbone network (such as ResNet-100 or Inception-ResNet-v1) and a fully connected classification layer. The feature vector truncated before the fully connected layer is output; this vector is typically a 512-dimensional or 1024-dimensional floating-point vector. This vector highly condenses the unique identity information of the source person, i.e., the identity feature vector.

[0047] (2) Image encoder processing (appearance encoding) Image encoders not only understand objects themselves, but also their visual style, contextual relationships, and high-level semantics. The same reference head image is input into a pre-trained image encoder. The output is also a high-dimensional feature vector (e.g., ViT-L / 14 outputs 768 dimensions). This vector captures rich appearance information that face recognition models might overlook: the specific style and texture of the hairstyle, the gloss and texture of the skin, the contours of the face, and even makeup and accessories—in other words, an appearance feature vector.

[0048] (3) Fusion of header integration coding A face encoder (usually a convolutional neural network, such as ResNet or InceptionNet) primarily aims to integrate facial information and head structure information, allowing the network to simultaneously learn facial features and hairstyle. It concatenates the identity feature vector and appearance feature vector along the feature dimension to obtain a fused vector.

[0049] In this embodiment, a face recognition model is used to extract identity features, while an image encoder is used to extract appearance features. Finally, a face encoder is used to fuse the two to obtain a perfect head integrated code that contains both face information and head structure information.

[0050] In an optional embodiment, generating the header integration code in step 102 specifically includes: Step 1: After fusing the reference head image with the head mask, input them into the face recognition model and the image encoder respectively; In this step, the purpose of fusing the reference head image with the head mask is to remove the original background interference, allowing the embedding of the reference head image to focus more on the head region. The head mask guides the subsequent encoding model to ignore irrelevant background information, concentrating all computational resources on the head region, thereby extracting cleaner and more representative features. The essence of fusion is to use the mask as a switch applied to the reference image. For example, element-wise multiplication can be used for fusion.

[0051] Step 2: Perform face recognition on the reference head image using the face recognition model to obtain the face identification information of the reference head image; The core task of this step is to extract the most stable and essential identity features from the reference head image. These features must be highly robust to changes in non-identity factors such as pose, lighting, expression, and age. A face recognition model is preferably used to perform face recognition on the reference head image.

[0052] Face recognition models typically use convolutional neural networks (CNNs) as the backbone, such as ResNet-100 or Inception-ResNet-v1. This backbone is responsible for extracting hierarchical features from the input image. The fused image is then fed into the pre-trained face recognition model. Data flows through the various convolutional layers, pooling layers, and residual blocks of the backbone. The final layer in the ArcFace model is usually a fully connected layer for classification. During inference, the output of this classification layer is not used; instead, the feature vector preceding the classification layer is extracted. This feature vector is the face identification information.

[0053] Due to ArcFace's training strategy, this feature vector is insensitive to changes in the pose, lighting, expression, and occlusion of the reference image. Regardless of whether the reference image shows a smiling face or a serious profile, the extracted identifiers should be highly similar. Therefore, this feature vector can serve as the identification information for the target person in the reference head image.

[0054] Step 3: Extract head appearance features from the reference head image using the image encoder to obtain head structure information of the reference head image, including head outline, hairstyle and accessory outline, and head posture; In step 2 of the previous step, the identity information of the target person in the reference head image was obtained through a face recognition model. In this step, the head appearance features of the target person in the reference head image are obtained through an image encoder. In this embodiment, an image encoder is preferably used for head appearance feature extraction.

[0055] The fused image is input into a pre-trained image encoder (such as ViT-L / 14). After a series of Transformer Blocks, the model outputs a final image feature vector. The vector output by the image encoder contains the head structure information, including: (1) Head contour: The model has seen countless face shapes (round face, square face, oval face) during training, so it encodes these contour information into specific dimensions or directions of the vector.

[0056] (2) Hairstyle and accessory outlines: long hair, short hair, curly hair, straight hair, ponytail, glasses, hats, earrings, etc. These are visual concepts that the image encoder can understand and encode. The vector contains rich information about the texture, length, color and shape of the hairstyle and accessories.

[0057] (3) Head pose: Although the image encoder is not a dedicated pose estimation model, its feature vectors implicitly contain information about head orientation (pitch, yaw, yaw) because its training data contains heads in various poses.

[0058] (4) Skin texture and skin color: Skin characteristics such as smooth, rough, fair, and dark will also be encoded.

[0059] (5) Lighting and shadows: The lighting information in the image is also reflected in the vector, which helps to maintain the consistency of lighting and shadows during subsequent generation.

[0060] Step 4: Input the face identification information and the head structure information into the face encoder for normalization and integration to obtain a head integrated code that includes face identification and head appearance features.

[0061] The goal of this step is to fuse two feature vectors from different sources, with different dimensions and semantic focuses, into a unified, coordinated, and comprehensive final code. Using a face encoder, the face identifier information obtained in step 2 (e.g., 512 dimensions) and the head structure information obtained in step 3 (e.g., 768 dimensions) are normalized and integrated, and then the features are mapped to the final target dimension, such as 512 dimensions, forming a unified head integration code.

[0062] (ii) Facial expression features Facial expression features encode dynamic information from a target video. The video generation model employs a multi-layer convolutional neural network to extract features from the input guiding information, "facial landmarks," and injects these features into the video denoising model. This ensures consistency between facial expressions and facial landmark sequences in each frame of the generated video. The model's output is a fixed-dimensional facial expression feature vector, which encapsulates the dynamic information of the entire video segment, including expression changes and head movement rhythm.

[0063] 103. The first target video frame and the second target video frame are encoded by a variational autoencoder to obtain a fourth latent space representation and a fifth latent space representation, and the fourth latent space representation after noise injection is concatenated with the fifth latent space representation to form a fused latent space representation; The core objective of this step is to transform the first target video frame and the second target video frame (with the head region removed) from the high-dimensional pixel space into an efficient latent space, and to introduce the noise necessary for training the diffusion model, ultimately forming a fused representation containing both "content" and "location" information, which serves as the input for the subsequent denoising process.

[0064] This embodiment transforms a first target video frame and a second video frame into a structured fusion latent space representation that includes content and location information and is subject to controllable noise. This fusion latent space representation is the core of the entire video head processing model training, providing a clear, specific, and challenging task for the subsequent video denoising model: under given head mask constraints, accurately remove noise from a noisy latent representation and reconstruct the head content that conforms to the dynamics of the target video.

[0065] In an optional embodiment, step 103 above further includes: 1031. Input the first target video frame into the first encoder of the variational autoencoder for latent space encoding to obtain the fourth latent space representation of the target video frame; This embodiment uses a Variational Auto-Encoder (VAE) compatible with mainstream diffusion models. The VAE consists of two encoders and one decoder. The target video frame is input into the VAE encoder. The encoder typically consists of a series of Conv2d, downsampling (e.g., convolutions with a stride of 2), ResNetBlock, and AttentionBlock. It downsamples a (3, H, W) image and encodes it into a low-dimensional latent representation.

[0066] For example, a 512x512 image might be encoded as a (4, 64, 64) tensor. The VAE encoder outputs two tensors: the mean *mu* and the log-variance *log_var*. The first latent space representation *z_0* is obtained by sampling from the formula *z = mu + exp(0.5 * log_var) * epsilon* (where *epsilon* is random noise sampled from a standard normal distribution N(0,1)). *z_0* is a clean, noise-free representation of the first target video frame in the latent space.

[0067] 1032. Input the second target video frame into the second encoder of the variational autoencoder for latent space encoding to obtain the fifth latent space representation of the second target video frame; The second target video frame M_t (a single-channel image) is input into another independent VAE encoder. Similarly, the second latent space representation m can be obtained through a formula. Its dimensions are the same as z_0, for example (4, 64, 64). This m will serve as a reference for spatial constraints.

[0068] 1033. Add random noise to the first latent space representation, and then concatenate the first latent space representation with the second latent space representation to obtain a fused latent space representation.

[0069] Training a diffusion model requires defining a noise schedule, which determines how much noise should be added at each step t (from 1 to T, e.g., 1000 steps) in the denoising process. Commonly used schedules include linear and cosine schedules. At each training step, a time step t is randomly sampled. Then, the noise level alpha_t associated with the corresponding time step is calculated based on the noise schedule. A noise epsilon with the same shape as z_0 is sampled from a standard normal distribution. According to the forward propagation formula for the diffusion process: z_t = sqrt(alpha_t) * z_0 + sqrt(1 - alpha_t) * epsilon, the noisy latent representation z_t at time step t is calculated.

[0070] The noisy latent representation z_t is concatenated with the latent representation m of the second target video frame along the channel dimension. For example, if z_t is (4, 64, 64) and m is (4, 64, 64), then the concatenated fused latent spatial representation is (8, 64, 64). The purpose of this operation is to initially fuse "what to draw" (the noisy target content) and "where to draw" (mask constraints) before they enter the core generative model, thereby providing strong spatial prior information for the subsequent denoising process.

[0071] 104. Using the head integration encoding and the facial expression features as conditions, the fused latent space representation is denoised using a video denoising model to obtain a sixth latent space representation; This step can be viewed as a condition-guided process of "carving" the target head out of noise. The core of this step is the video denoising model (i.e., the diffusion model), whose backbone network is a 3D U-Net or a 2D U-Net with temporal attention layers. The 3D U-Net uses 3D convolutional kernels, enabling simultaneous processing of spatial and temporal dimensions; while the U-Net with temporal attention adds Transformer modules to each or specific layers of the 2D U-Net to facilitate information exchange in the temporal dimension.

[0072] (1) Model input and condition injection The main input to the video denoising model is the fusion latent space representation (8, 64, 64) generated in step 103, and also includes a time-step condition: the current denoising step number t is converted into a time-step vector through a time-step embedding module. This vector is added to the feature map of each layer of the diffusion model to inform the model which stage of denoising it is currently in.

[0073] Head-based encoder injection is the most crucial global condition in video denoising. Injection is achieved through a cross-attention mechanism. For example, the head-based encoder vector (e.g., 512-dimensional) is projected onto a linear layer as K (Key) and V (Value) vectors. The feature maps of the intermediate layers of the diffusion model (projected through another linear layer) serve as Q (Query). Attention(Q, K, V) is calculated, and the result is added back to the original feature map. This allows the diffusion model to "query" the identity and appearance information of the source person when generating pixels at each spatial location, ensuring the accuracy of the generated content. Facial expression feature injection is a dynamic condition in the video denoising process.

[0074] (2) Noise reduction process The diffusion model receives all the model inputs and conditional injections described above. Its task is to predict the noise epsilon added at time step t. The model's output is a predicted noise epsilon_theta with the same shape as the noisy latent representation z_t. Video denoising is an iterative process. It starts with pure noise z_T and repeats T times: the current z_t and conditional inputs are given to the diffusion model to obtain the predicted noise epsilon_theta. Then, the z_t-1 of the previous time step is calculated using the formula provided by the scheduler. After T denoising steps, the resulting z_0 is the final generated third latent space representation.

[0075] 105. Based on the fourth latent space representation and the sixth latent space representation, construct a target loss function, update the model parameters based on the target loss function, and obtain a trained video head processing model.

[0076] This step mainly focuses on the learning and optimization of the video head processing model. Its core objective is to define a clear optimization target (loss function), utilize efficient optimization algorithms (backpropagation and parameter update), and adopt specific training strategies (freezing and fine-tuning) to ultimately enable the video head processing model to have the ability to replace heads with high quality.

[0077] In an optional embodiment, step 105 above further includes: 1051. Based on the fourth latent space representation and the sixth latent space representation, construct a target loss function that includes reconstruction loss; The target loss function is typically composed of multiple weighted components to balance different aspects of generation quality. These include reconstruction loss and temporal consistency loss. Considering that the video frame data used in this embodiment of the invention inherently contains spatiotemporal information, reconstruction loss is preferably used as the target loss function.

[0078] The reconstruction loss (L_rec) measures the difference between the model-generated third latent space representation z_pred and the true first latent space representation z_real; essentially, it directly compares the model's predicted latent representation with the true latent representation. L1 loss (mean absolute error) is typically used because it is more robust to outliers than L2 loss (mean squared error) and produces sharper edges. The formula is: L_rec = ||z_pred - z_real||_1. The role of the reconstruction loss is to provide supervision in the latent space, which is more stable than in the pixel space, and to guide the model in learning the deep structure and distribution of the data.

[0079] 1052. Based on the target loss function, calculate the gradient of the model training in this round using the backpropagation algorithm; This step is central to model learning, used to calculate how to adjust model parameters to improve performance next time based on feedback from the loss function. Backpropagation is an efficient implementation of the chain rule on a computational graph. Starting with the loss function, it calculates the partial derivatives (gradients) of the loss function with respect to each trainable parameter (weight W and bias b) in the model layer by layer. The gradient points in the direction in which the loss function increases most rapidly. The opposite direction is the direction in which the loss function decreases most rapidly. Therefore, by fine-tuning the parameters along the opposite direction of the gradient, the loss can be reduced.

[0080] 1053. Based on the gradient, update the trainable parameters in the face encoder, the video generation model, and the video denoising model, wherein the parameters of the variational autoencoder, the face recognition model, and the image encoder are frozen respectively during model training. This step is the model optimization phase, which uses the gradient calculated in the previous step to actually update the model parameters. Preferably, an optimizer is used to update the model parameters based on the gradient. This embodiment employs a freeze and fine-tuning training strategy.

[0081] The frozen parameters include those of the variational autoencoder, the face recognition model, and the image encoder. These models are pre-trained on massive datasets and have already learned how to efficiently encode images, recognize identities, and understand visual semantics. Therefore, relearning is unnecessary. Furthermore, freezing the parameters of these large models means that gradients do not need to be calculated during backpropagation, nor do the parameters need to be modified during parameter updates. This significantly reduces memory usage and computation time, making it possible to train such complex models on consumer-grade GPUs. It also considers that fine-tuning model parameters might disrupt the general capabilities learned by these pre-trained models, leading to overfitting and decreased generalization ability on specific tasks. The fine-tuned parameters (i.e., the trainable parameters) include those of the face encoder, the video generation model, and the video denoising model.

[0082] 1054. When the target loss function converges or reaches the preset number of training rounds, stop model training and obtain the trained video head processing model.

[0083] This embodiment uses one of the following two conditions to determine when to stop model training.

[0084] Condition 1: The objective loss function converges Over several consecutive training epochs, the value of L_total no longer decreases significantly, but fluctuates within a small range. This indicates that the model has found a local or global optimum, and further training will not yield much benefit. During training, in addition to calculating the training loss, the validation loss is also periodically calculated on an independent validation set. If the validation loss does not set a new minimum record within N consecutive epochs (e.g., 10 epochs), training can be terminated early. This effectively prevents the model from overfitting on the training set.

[0085] Condition 2: Reach the preset number of training rounds Set a maximum number of training epochs in advance, such as 1000 epochs. Regardless of whether the model converges, training stops once this number of epochs is reached. This is a hard limit that prevents the training process from running indefinitely and also facilitates the planning and reproduction of experiments.

[0086] When the termination condition for model training is met, the trained model parameters need to be saved for subsequent inference. These saved model parameters, together with the frozen pre-trained model, constitute a complete, well-trained video head-swapping model, which can be loaded into the inference environment to perform high-quality video head-swapping tasks.

[0087] This embodiment employs a standard deep learning training process. In each iteration, steps 101-104 are executed to complete a forward propagation, calculating L_total. Then, backpropagation is performed to calculate the gradient of L_total with respect to all trainable parameters in the model. All models are pre-trained on large-scale data, and their parameters are highly stable. When training the video head processing models, the parameters of these models should be frozen; their gradients should not be calculated, and their parameters should not be updated. The main training targets are the video denoising model, the video generation model, and the face encoder. This fine-tuning strategy can significantly reduce training costs and data requirements while leveraging the powerful prior knowledge of the pre-trained models.

[0088] By repeating steps 101-105 above until the loss function converges or the preset number of training rounds is reached, a well-trained video head processing model capable of high-quality overall head replacement is finally obtained.

[0089] This embodiment achieves comprehensive control over identity and appearance through head integration coding, precise definition of generation boundaries through head masking, and accurate driving of dynamic temporal sequences through facial expression features. Finally, by training with a multi-objective loss function in the latent space, these capabilities are solidified into the model. Therefore, it can perfectly realize the overall head replacement including facial features, face shape, and hairstyle, and ensure the overall visual coordination and consistency in dynamic videos, meeting the needs of high-quality video content creation in special scenarios.

[0090] Please see Figure 2 One embodiment of the video head-swapping method in this invention includes: 201. Obtain model input data, wherein the model input data includes a target reference head image, a first video frame of the target video data, and the facial key points corresponding to the first video frame and a second video frame after head masking. In this embodiment, the video head-swapping process is completed by a trained video head-swapping model. Before performing the video head-swapping process, the following model input data needs to be obtained: (1) Target reference head image: This image contains a complete head image of the first target person, which is used to replace the complete head area of ​​the second target person in the video. This image is preferably uploaded by the user.

[0091] (2) The first video frame of the target video data: The target video data contains a second target person, and the first video frame contains a complete head image of the second target person. The target video data consists of a time-series sequence of the first video frames.

[0092] (3) Facial key points: Describe in detail the facial expressions, head posture, mouth shape and expression changes of the person in the target video frame. By extracting facial key points for facial expression features and injecting them into the video denoising model, the facial expressions and facial key point sequences of each frame in the generated video result are consistent.

[0093] (4) Second video frame: Before generating the second video frame, a head mask for the first video frame needs to be generated first. The head mask is used to define the area in the first video frame where the head will be replaced, ensuring that the generated content blends seamlessly with the original background. Then, the generated head mask is used to smooth the first video frame to obtain the second video frame with the head area removed.

[0094] In one embodiment, the first video frame of the target video data, as well as the head mask and facial key points corresponding to the first video frame, are preferably generated in the following manner: 1) Obtain the original video data of the target person, including their head; 2) Perform video slicing and video frame extraction on the original target video data to obtain sample video segments; 3) Perform facial landmark detection on each video frame in the sample video segment to obtain the facial landmarks of the same target person in each video frame; 4) Perform pixel-level segmentation on the complete head region of the target person in each video frame to obtain a complete head mask containing the target person's face shape, hairstyle, forehead, and jaw. 5) Pair and store each video frame with its corresponding facial key points and head mask.

[0095] In this optional embodiment, the generation method of the video frame and the corresponding head mask and facial key points is the same as the generation method of the training samples in the above embodiment.

[0096] 202. Input the model input data into a pre-trained video head processing model for processing, and output the first latent space representation of the video frame with the replaced head region; The purpose of this step is to achieve conditional image generation in the latent space using a trained video head processing model. The video head processing model specifically consists of a face recognition model, an image encoder, a face encoder, a video generation model, a variational autoencoder, and a video denoising model. This video head-swapping process includes the following stages: Phase 1: The target reference head image is processed by a face recognition model, an image encoder, and a face encoder to generate a head integration code, which integrates face information and head structure information, so that the model can learn facial features and hairstyle at the same time.

[0097] Phase 2: The facial key points are processed by the video generation model to obtain facial expression features, which are then injected into the video denoising model to ensure that the facial expressions and facial key point sequences in each frame of the generated video are consistent.

[0098] Phase 3: The first and second video frames are latent space encoded by variational autoencoders and concatenated into a fused latent space representation. Then, the fused latent space representation is iteratively denoised by a video denoising model based on head integration encoding and facial expression features to generate the latent space representation of the video frames after head swapping.

[0099] During model training, noise needs to be added to video frames so that the model can predict the noise. However, during model inference, it starts with pure noise and gradually removes the noise.

[0100] First, sample a random noise tensor with the same shape as the target potential space from the standard normal distribution N(0, 1) as the initial z_T (T is the total number of denoising steps, such as 1000).

[0101] Secondly, start the denoising loop: loop from t=T to t=1: a. Constructing the fusion potential space: At each step t, the current noise tensor z_t and the potential space representation z_m of the head mask are concatenated along the channel dimension to obtain z_joint_t.

[0102] b. Conditional Denoising: Input z_joint_t into the trained video denoising model. Simultaneously, inject head ensemble encoding and facial expression features as conditional inputs.

[0103] c. Predicted noise: The diffusion model outputs a predicted noise epsilon_pred_t.

[0104] d. Update the latent state: Using the scheduler of the diffusion model, calculate a slightly cleaner latent representation z_{t-1} for the previous time step based on the correlation coefficients of z_t, epsilon_pred_t, and time step t.

[0105] e. Loop: Repeat step ad until t=1.

[0106] Output: z_0 obtained after the loop ends is the final, clean latent space representation that integrates the source character's identity and the target video's dynamics.

[0107] 203. The first potential space is input into the decoder of the variational autoencoder to obtain the third video frame with the replaced head region.

[0108] In this embodiment, the video head processing model in step 202 replaces the head region of the target person in each video frame of the target video data with a reference head image, thereby achieving overall head replacement including facial features, face shape and hairstyle.

[0109] This embodiment achieves comprehensive control over identity and appearance through head integration coding, precise definition of generation boundaries through head masking, and accurate driving of dynamic timing through facial expression features. Finally, it achieves overall head replacement including facial features, face shape, and hairstyle through video head processing model, and ensures visual overall coordination and consistency in dynamic videos, meeting the needs of high-quality video content creation in special scenarios.

[0110] Please see Figure 3 One embodiment of the video head processing device in this invention includes: The acquisition module 301 is used to acquire model input data, which includes a target reference head image, a first video frame of target video data, facial key points corresponding to the first video frame, and a second video frame after head masking. The replacement module 302 is used to input the model input data into a pre-trained video head processing model for processing, and output the first latent space representation of the video frame with the replaced head region. The decoding module 303 is used to decode the first latent space input variational autoencoder to obtain the third video frame with the replaced head region.

[0111] Optionally, in one embodiment, the video head processing model includes: a face recognition model, an image encoder, a face encoder, a video generation model, a variational autoencoder, and a video denoising model, and the replacement module 302 is specifically used for: The target reference head image is processed by the face recognition model, the image encoder, and the face encoder to obtain a head integrated code; the facial key points are processed by the video generation model to obtain facial expression features. The first video frame and the second video frame are encoded by the variational autoencoder to obtain a second latent space representation and a third latent space representation. The second latent space representation after noise injection is then concatenated with the third latent space representation to form a fused latent space representation. Using the head integration encoding and the facial expression features as conditions, the fused latent space representation is denoised by the video denoising model, and the first latent space representation of the video frame with the replaced head region is output.

[0112] Optionally, in one embodiment, the replacement module 302 is further configured to: After fusing the target reference head image with the head mask corresponding to the first video frame, the images are input into the face recognition model and the image encoder, respectively. The face recognition model is used to perform face recognition on the target reference head image to obtain the face identification information of the reference head image; The image encoder extracts head appearance features from the target reference head image to obtain head structure information of the target reference head image, including head outline, hairstyle and accessory outline, and head posture. The face identification information and the head structure information are input into the face encoder for normalization and integration to obtain a head integrated code that includes face identification and head appearance features.

[0113] Optionally, in one embodiment, the replacement module 302 is further configured to: The first video frame is input into the first encoder of the variational autoencoder for latent space encoding to obtain the second latent space representation of the first video frame. The second video frame is input into the second encoder of the variational autoencoder for latent space encoding to obtain the third latent space representation of the second video frame; Random noise is added to the second latent space representation, and the noise-added second latent space representation is concatenated with the third latent space representation to obtain a fused latent space representation.

[0114] Optionally, in one embodiment, the video header processing device further includes: Training module 304 is used to acquire training samples, which include a reference head image, multiple first target video frames in time sequence, facial key points corresponding to each first target video frame, and second target video frames processed by head masking. The reference head image is processed using a face recognition model, an image encoder, and a face encoder to obtain a head integrated encoding. The facial key points are processed using a video generation model to obtain facial expression features. The first and second target video frames are encoded using a variational autoencoder to obtain a fourth latent space representation and a fifth latent space representation, and the fourth latent space representation with injected noise is concatenated with the fifth latent space representation to form a fused latent space representation. Using the head integrated encoding and the facial expression features as conditions, the fused latent space representation is denoised using a video denoising model to obtain a sixth latent space representation. Based on the fourth and sixth latent space representations, a target loss function is constructed, and model parameters are updated based on the target loss function to obtain a trained video head processing model.

[0115] Optionally, in one embodiment, the training module 304 is further configured to: Acquire multiple raw video data sets containing human heads; Each of the original video data is sliced ​​and extracted into video frames to obtain multiple sample video segments; Facial landmark detection is performed on each video frame in each of the aforementioned sample video segments to obtain the facial landmarks of the same target person in each video frame; Pixel-level segmentation is performed on the complete head region of the target person in each video frame to obtain a complete head mask containing the target person's face shape, hairstyle, forehead, and chin; Each video frame is paired and stored with its corresponding facial key points and head mask. During model training, one frame is selected from the same video frame sequence as the reference head image, another frame is selected as the first target video frame, and the second target video frame is generated.

[0116] Optionally, in one embodiment, the training module 304 is further configured to: Based on the fourth latent space representation and the sixth latent space representation, a target loss function including reconstruction loss is constructed; Based on the target loss function, the gradient of the current round of model training is calculated using the backpropagation algorithm; Based on the gradient, the trainable parameters in the face encoder, the video generation model, and the video denoising model are updated, wherein the parameters of the variational autoencoder, the face recognition model, and the image encoder are frozen respectively during model training. When the target loss function converges or reaches the preset number of training rounds, the model training is stopped, and the trained video head processing model is obtained.

[0117] Since the embodiments of the device part correspond to the embodiments of the above method, the description of the video head processing device provided by the present invention should refer to the above method embodiments. The present invention will not be described again here, but it has the same beneficial effects as the above video head processing method.

[0118] Figure 4 This is a schematic diagram of the structure of a computer device 500 provided in an embodiment of the present invention. The computer device 500 can vary significantly due to different configurations or performance characteristics. It may include one or more central processing units (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) for storing application programs 533 or data 532. The memory 520 and storage media 530 can be temporary or persistent storage. The program stored in the storage media 530 may include one or more modules (not shown in the diagram), each module including a series of instruction operations on the computer device 500. Furthermore, the processor 510 may be configured to communicate with the storage media 530 and execute the series of instruction operations in the storage media 530 on the computer device 500.

[0119] Computer device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input / output interfaces 560, and / or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will understand that... Figure 4 The computer device structure shown does not constitute a limitation on the computer device and may include more or fewer components than shown, or combine certain components, or have different component arrangements.

[0120] The present invention also provides a computer device, the computer device including a memory and a processor, the memory storing computer-readable instructions, which, when executed by the processor, cause the processor to perform the steps of the video header processing method in the above embodiments. The present invention also provides a computer-readable storage medium, which can be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium, the computer-readable storage medium storing instructions, which, when executed on a computer, cause the computer to perform the steps of the video header processing method.

[0121] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0122] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0123] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A video header processing method, characterized in that, The video header processing method includes: Acquire model input data, which includes a target reference head image, a first video frame of target video data, facial key points corresponding to the first video frame, and a second video frame after head masking. The model input data is fed into a pre-trained video head processing model for processing, and the first latent space representation of the video frame with the replaced head region is output. The first latent spatial representation is input to the decoder of the variational autoencoder to obtain the third video frame with the replaced head region.

2. The video header processing method according to claim 1, characterized in that, The video head processing model includes: a face recognition model, an image encoder, a face encoder, a video generation model, a variational autoencoder, and a video denoising model. The step of inputting the model's input data into the pre-trained video head processing model for processing, and outputting the first latent space representation of the video frame with the replaced head region, includes: The target reference head image is processed by the face recognition model, the image encoder, and the face encoder to obtain a head integrated code; the facial key points are processed by the video generation model to obtain facial expression features. The first video frame and the second video frame are encoded by the variational autoencoder to obtain a second latent space representation and a third latent space representation. The second latent space representation after noise injection is then concatenated with the third latent space representation to form a fused latent space representation. Using the head integration encoding and the facial expression features as conditions, the fused latent space representation is denoised by the video denoising model, and the first latent space representation of the video frame with the replaced head region is output.

3. The video header processing method according to claim 2, characterized in that, The step of processing the target reference head image using the face recognition model, the image encoder, and the face encoder to obtain the integrated head encoding includes: After fusing the target reference head image with the head mask corresponding to the first video frame, the images are input into the face recognition model and the image encoder, respectively. The face recognition model is used to perform face recognition on the target reference head image to obtain the face identification information of the reference head image; The image encoder extracts head appearance features from the target reference head image to obtain head structure information of the target reference head image, including head outline, hairstyle and accessory outline, and head posture. The facial identification information and the head structure information are input into the facial encoder for normalization and integration to obtain a head integrated code that includes facial identification and head appearance features.

4. The video header processing method according to claim 2, characterized in that, The step of encoding the first video frame and the second video frame respectively using the variational autoencoder to obtain a second latent space representation and a third latent space representation, and concatenating the second latent space representation after injecting noise with the third latent space representation to form a fused latent space representation includes: The first video frame is input into the first encoder of the variational autoencoder for latent space encoding to obtain the second latent space representation of the first video frame. The second video frame is input into the second encoder of the variational autoencoder for latent space encoding to obtain the third latent space representation of the second video frame; Random noise is added to the second latent space representation, and the noise-added second latent space representation is concatenated with the third latent space representation to obtain a fused latent space representation.

5. The video header processing method according to any one of claims 2-4, characterized in that, The training methods used in the video header processing model include: Acquire training samples, which include a reference head image, multiple first target video frames in time sequence, facial key points corresponding to each first target video frame, and second target video frames processed by head masking. The reference head image is processed using a face recognition model, an image encoder, and a face encoder to obtain a head integrated code; the facial key points are processed using a video generation model to obtain facial expression features. The first target video frame and the second target video frame are encoded by a variational autoencoder to obtain a fourth latent space representation and a fifth latent space representation. The fourth latent space representation after noise injection is then concatenated with the fifth latent space representation to form a fused latent space representation. Using the head integration encoding and the facial expression features as conditions, the fused latent space representation is denoised using a video denoising model to obtain the sixth latent space representation; Based on the fourth latent space representation and the sixth latent space representation, a target loss function is constructed, and the model parameters are updated based on the target loss function to obtain a trained video head processing model.

6. The video header processing method according to claim 5, characterized in that, The method for generating the training samples includes: Acquire multiple raw video data sets containing human heads; Each of the original video data is sliced ​​and extracted into video frames to obtain multiple sample video segments; Facial landmark detection is performed on each video frame in each of the aforementioned sample video segments to obtain the facial landmarks of the same target person in each video frame; Pixel-level segmentation is performed on the complete head region of the target person in each video frame to obtain a complete head mask containing the target person's face shape, hairstyle, forehead, and chin; Each video frame is paired and stored with its corresponding facial key points and head mask. During model training, one frame is selected from the same video frame sequence as the reference head image, another frame is selected as the first target video frame, and the second target video frame is generated.

7. The video header processing method according to claim 5, characterized in that, The step of constructing a target loss function based on the fourth and sixth latent space representations, and updating the model parameters based on the target loss function to obtain the trained video head processing model includes: Based on the fourth latent space representation and the sixth latent space representation, a target loss function including reconstruction loss is constructed; Based on the target loss function, the gradient of the current round of model training is calculated using the backpropagation algorithm; Based on the gradient, the trainable parameters in the face encoder, the video generation model, and the video denoising model are updated, wherein the parameters of the variational autoencoder, the face recognition model, and the image encoder are frozen respectively during model training. When the target loss function converges or reaches the preset number of training rounds, the model training is stopped, and the trained video head processing model is obtained.

8. A video header processing device, characterized in that, The video header processing device includes: The acquisition module is used to acquire model input data, which includes a target reference head image, a first video frame of target video data, facial key points corresponding to the first video frame, and a second video frame after head masking. The replacement module is used to input the model input data into a pre-trained video head processing model for processing, and output the first latent space representation of the video frame with the replaced head region. The decoding module is used to decode the decoder of the first latent spatial representation input variational autoencoder to obtain the third video frame with the replaced head region.

9. A computer device, characterized in that, The computer device includes: a memory and at least one processor, wherein the memory stores instructions; The at least one processor invokes the instructions in the memory to cause the computer device to perform the video header processing method as described in any one of claims 1-7.

10. A computer-readable storage medium storing instructions thereon, characterized in that, When the instruction is executed by the processor, it implements the video header processing method as described in any one of claims 1-7.