Real-time augmentation of target faces
The system addresses real-time face swapping challenges by employing GPU parallel processing and predictive tracking to efficiently generate and overlay target faces that match input user expressions and movements, minimizing lag and achieving seamless integration.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- DEEP VOODOO LLC
- Filing Date
- 2024-05-17
- Publication Date
- 2026-06-11
AI Technical Summary
Existing technologies face challenges in manipulating target face images to mimic the expressions and movements of input users in real-time video frames, especially when multiple users are present, due to computational complexity and lag issues.
A system utilizing GPUs for parallel processing and predictive face tracking, combined with pre-generated target face models, to generate and overlay target face representations that match the expressions and movements of input users in real-time, minimizing lag through efficient computation and alignment.
The system achieves real-time face swapping with minimal lag, allowing seamless integration of target faces that mimic the expressions and movements of input users, even with multiple users, by optimizing computational efficiency and using predictive alignment.
Smart Images

Figure 2026519182000001_ABST
Abstract
Description
【Background Art】 【0004】 【0001】 This operation typically runs offline, involving a significant amount of calculation to manipulate the first image of the first face to match the expression of the second face in the second image. Therefore, it is difficult to manipulate the target face image so as to mimic the current face image of the input user in a video frame in real time and display the operation without a significant lag in the expression and movement of the current face of the user. Further, the manipulation of the target face image to match the image of the face of the input user is even more complicated when there are multiple input user faces present in the video frame. In particular, since the input user rotates or moves over multiple video frames, it is difficult to manipulate each target user's face image to match the current expression / rotation of the corresponding input user's face. 【Brief Description of the Drawings】 【0002】 The following detailed description and the accompanying drawings disclose various embodiments of the present invention. 【0003】 [Figure 1] A diagram showing an embodiment of a system for performing real-time augmentation of one or more target faces. 【0004】 [Figure 2] A diagram showing an example of a face augmentation system according to some embodiments. 【0005】 [Figure 3] A flowchart showing an embodiment of a process for real-time augmentation of a target face. 【0006】 [Figure 4] A flowchart showing an embodiment of a process for real-time augmentation of multiple target faces. 【0007】 [Figure 5] A diagram illustrating an example of a pipeline that generates a composite video frame containing representations of target faces corresponding to n input user faces detected within the input recorded video frames. 【0008】 [Figure 6] A flowchart illustrating an example of a process for detecting an input user face in a recorded video frame, according to several embodiments. 【0009】 [Figure 7] A flowchart illustrating an example of a process for detecting face landmarks corresponding to input user faces detected within recorded video frames, according to several embodiments. 【0010】 [Figure 8] A flowchart illustrating an example of a process for generating cropped images and alignment information sets corresponding to each input user face in a recorded video frame, according to several embodiments. 【0011】 [Figure 9] A diagram showing several examples of cropped images with aligned input user faces generated from each video frame. 【0012】 [Figure 10] A flowchart illustrating an example of a process for identifying an input user face in a recorded video frame, according to several embodiments. 【0013】 [Figure 11] A flowchart illustrating an example of a process for generating a 2D image of a target face corresponding to an input user face detected from recorded video frames, according to several embodiments. 【0014】 [Figure 12] A diagram showing an example of a target face and an example of a related mask associated with that target face. 【0015】 [Figure 13] A flowchart showing an example of a process for overlaying a 2D image of a target face over an input user face detected within a recorded video frame. 【0016】 [Figure 14A] A diagram showing a first example of an original video frame and a corresponding composite video frame. 【0017】 [Figure 14B] A diagram showing a second example of an original video frame and a corresponding composite video frame. 【0018】 [Figure 15] A diagram showing an example of a pipeline that utilizes predicted face positions to generate a composite video frame that includes representations of target faces corresponding to n input user faces detected in an input recorded video frame. 【0019】 [Figure 16] A flowchart showing an example of a process for generating a cropped image of an input user face from a current video frame using predicted face positions, according to some embodiments. 【MODE FOR CARRYING OUT THE INVENTION】 【0020】 The present invention can be implemented in various forms, including processes, apparatus, systems, compositions of materials, computer program products embodied on computer-readable storage media, and / or processors (processors configured to execute instructions stored and / or provided by memory connected to the processor). In this specification, these embodiments or any other forms the present invention may take may be referred to as "technologies." Generally, the order of the processes of the disclosed processes may be modified within the scope of the invention. Unless otherwise specified, components such as processors or memory described as configured to perform a task may be implemented as general components temporarily configured to perform a task at a given time, or as specific components manufactured to perform a task. In this specification, the term "processor" refers to one or more devices, circuits, and / or processing cores configured to process data such as computer program instructions. 【0021】 The following provides a detailed description of one or more embodiments of the present invention, with reference to drawings illustrating the principles of the present invention. While the present invention is described in relation to such embodiments, it is not limited to any of these embodiments. The scope of the present invention is limited only by the claims, and the present invention includes many substitutes, variations, and equivalents. The following description includes many specific details to provide a complete understanding of the present invention. These details are illustrative, and the present invention can be implemented in accordance with the claims without some or all of these specific details. For simplicity, technical matters well known in the art related to the present invention are not described in detail, so as not to complicate the present invention unnecessarily. 【0022】 Embodiments of real-time augmentation of target faces are described herein. A set of user face features corresponding to the faces of input users in recorded video frames is acquired. In some embodiments, the “input user” is a user whose face is recorded in video and whose facial expressions / movements are mapped to a selected target face. In various embodiments, the “target face” includes faces pre-generated by a model. For example, the target face may be a celebrity, a well-known person, or a portrait of another individual / avatar previously generated by an operable machine learning model. In some embodiments, for each video frame in the recording, the faces of different input users appearing in the video frame are first detected, and for each detected face of a single input user in the video frame, a corresponding representation of the target face corresponding to that input user is generated and overlaid on the recorded video frame. For the faces of input users detected in the recorded video frame, face features including a predetermined set of face landmarks may be determined from the faces of input users in the recorded video frame. For example, landmarks may include the corners of the eyes, corners of the mouth, ends of the eyebrows, and tip of the nose, and may be represented as coordinates in three-dimensional (3D) space. A set of user facial features associated with the input user is used to generate a cropped image containing the input user's face from the original recorded video frame. In some embodiments, the cropped image of the input user's face from the recorded video frame is generated to include standardized user facial dimensions and face alignment. In some embodiments, the cropped image is generated with a set of transformation information describing how to transform the cropped image so that the displayed input user's face can be restored to how it appeared in the recorded video frame. In various embodiments, a target face is selected to be manipulated to match the facial expression of the input user. For example, the target face belongs to a target individual (e.g., a known person, a celebrity, a computer avatar).In various embodiments, a target face (e.g., machine learning) model is pre-generated using training data including cropped images of the target face, where such cropped images have the same standardized dimensions and alignment as the input user's cropped images. This pre-generated target face model is used to encode at least a portion of the input user face's cropped images into a plurality of user extrinsic features. In various embodiments, encoding the input user face's cropped images includes extracting extrinsic features (represented as vectors) from the cropped images (e.g., extrinsic features are features related to facial movement and expression, as opposed to skin color). The target face swap model and the plurality of user extrinsic features are used to generate a representation of the target face. In various embodiments, the input user face's extrinsic features are transformed into corresponding features of the target face. The transformed corresponding features of the target face are then combined to output a representation of the target face, which includes a two-dimensional (2D) image of the target face showing the expression of the input user face in the recorded video frame. This representation of the target face is then overlaid on the position of the input user face in the recorded video frame. As described herein, by processing each detected input user face within each recorded video frame, and overlaying each representation of the target face onto consecutive recorded video frames, an output video of the target face can be obtained that mimics the facial expressions and movements of the input user in the recorded input video. Thus, the output composite video frame shows the input user's face replaced with that of the corresponding target face. 【0023】 Embodiments of real-time augmentation of multiple target faces are described herein. A first input user face and a second input user face are detected within a recorded video frame. In some embodiments, the first and second input user faces were known before detection. In some embodiments, the first and second input user faces were unknown before detection and were instead selected by an operator within a recorded video frame. If two or more input user faces are detected within a recorded video frame, and each such input user face is mapped to a corresponding target face, it is necessary to track the input user faces as they rotate or move within the recorded video so that the expressions of the input users can be accurately mapped to their respective target faces. After the first and second input user faces are detected, a first identifier is associated with the first input user face, and a second identifier is associated with the second input user face. In some embodiments, each identifier for each input user face is an operator input value. A first mapping between the first identifier and the first target face is stored, and a second mapping between the second identifier and the second target face is stored. For example, the selection of a corresponding target face can be made for each input user face or an associated identifier of an input user face. Then, a corresponding set of user face features is determined separately for each input user face and used to generate a cropped image of the input user face so that, as described above, the recorded video frame contains an overlay of the target face representation corresponding to the input user face identifier. If the recorded video frame contains two or more input user faces, each such input user face in the recorded video frame is overlaid with the corresponding representation of its respective / mapped target user face, and each representation of each / mapped target user face mimics the facial expression of the corresponding input user face in the recorded video frame. The effect is that the output video frame shows the face of each input user replaced with that of the corresponding target face. 【0024】 As will be detailed later, if two or more input users are detected within a recorded video frame, an efficient processor (e.g., a graphics processing unit (GPU)) can be used to process the representation of each target user face in parallel for each input user. Furthermore, in some embodiments, alignment information of input user faces from (recent) historical recorded video frames can be used to predict the position of face features in newly recorded video frames, which ultimately leads to faster computation of the representation of the target face corresponding to the input user face. In particular, such optimizations reduce the lag between recording a video frame and the resulting version of the video frame overlaid with the representation of the target face, resulting in a system that takes a video recording showing at least one input user face as input and outputs the same video recording in real time (e.g., with minimal lag) with only at least one target face that mimics the facial movements of the input user face and replaces the input user face in the video recording. 【0025】 Figure 1 shows one embodiment of a system for performing real-time augmentation of one or more target faces. As shown in Figure 1, the system 100 includes a camera 102, input users 104 and 106, a face augmentation system 108, and a display 110. 【0026】 Camera 102 is configured to record a video recording that includes at least one video frame showing the face of one or more input users. Although Figure 1 shows only two input users (104 and 106), in actual embodiments, camera 102 can record video showing a single input user or any number of input users greater than one. While the video is being recorded by camera 102, input users 104 and 106 may change their facial expressions, position, and / or head rotation. In some embodiments, camera 102 is configured to produce video frames / images in high resolution (e.g., 1080p (1,920 pixels horizontally on the screen and 1,080 pixels vertically on the screen)) or higher resolution. 【0027】 The face augmentation system 108 is configured to acquire recorded video frames from camera 102 related to video recording of input user 104 and / or input user 106. In some embodiments, the face augmentation system 108 is configured to receive recorded video frames related to video recording while video recording is in progress. In some embodiments, the face augmentation system 108 runs a specially configured driver customized to acquire image data for each video frame directly from camera 102 and store it directly in the memory of one or more GPUs in a manner that bypasses the central processing unit (CPU) of the face augmentation system 108. In some embodiments, the GPUs of the face augmentation system 108 are configured to compute, in parallel so that the computation can be completed efficiently, each representation of a target face corresponding to the face of one or more input users appearing in each video frame (e.g., an image of a target face manipulated so that the target face matches the expression / rotation of the corresponding input user face). 【0028】 The face augmentation system 108 is configured to first detect input user faces appearing within recorded video frames of a video recording. In some embodiments, prior to detection, the face augmentation system 108 receives a selection related to the number of input user faces to detect (and ultimately face-swap with each specified target face) within each video frame. The selection related to the number of input user faces to detect within each video frame may indicate a single input user face or any specified number of input user faces. If the face augmentation system 108 receives a selection indicating the detection of a single input user face, in various embodiments, the face augmentation system 108 is configured to detect the largest-looking input user face (e.g., the input user face enclosed by the largest bounding box) within the video frame if two or more input user faces are present within the video frame. Alternatively, if the face augmentation system 108 receives a selection to detect a specified number of input user faces (two or more), in various embodiments, the face augmentation system 108 is configured to detect the maximum specified number of input user faces (e.g., the input user faces enclosed by the largest number of bounding boxes) in the video frame if there are more input user faces in the video frame than the specified number. 【0029】 Regardless of the number of input user faces detected by the face augmentation system 108 within a video frame, the face augmentation system 108 is configured to process each input user face separately for that video frame in order to perform face swapping (for example, generating a representation of a target face corresponding to the identifier of that input user face). In some embodiments, if two or more input user faces are detected within a video frame, the face augmentation system 108 is configured to process the two or more input user faces within the video frame in at least parallel to generate a representation of each of the target faces. For each input user face detected within a video frame, the face augmentation system 108 is configured to determine the facial features of the input user face within the video frame. Examples of facial features include the coordinates in three-dimensional (3D) space of facial landmarks such as the outer corners of the eyes, the corners of the mouth, the tip of the nose, the bridge of the nose, and the ends of the eyebrows. After the facial features have been determined for the input user face within the video frame, the face augmentation system 108 is configured to determine alignment information from the facial features and generate a cropped image of the input user face based on the alignment information. In some embodiments, “alignment information” describes the position (e.g., rotation and translation around a given center) and / or scale of the input user’s head in 3D space. The input user’s face in the video frame is then oriented in a standardized orientation (e.g., upright), and at least some of the facial features (e.g., eyes) are aligned to a specified position and scaled to a specified dimension within a cropped image (which is generated to have the specified / standardized dimensions). Thus, cropped images of input user faces obtained from different video frames (related to the same or different input users) have standardized dimensions, and each represents a face with the same orientation where the facial features (e.g., eyes) are located in the same place within the cropped image. 【0030】 The face augmentation system 108 is configured to determine a user face identifier (ID) associated with a cropped image of each detected input user face in a video frame. In various embodiments, the face augmentation system 108 is configured to instruct the operator to specify whether the input user face to which the user face ID is assigned was previously "known" to the system. An input user face that is "known" to the face augmentation system 108 is one for which a corresponding signature (e.g., a mathematical embedding previously generated based on a known image of that user) is stored / available in the face augmentation system 108. Conversely, an input user face that is "unknown" to the face augmentation system 108 is one for which a corresponding signature (e.g., a mathematical embedding generated based on a known image of that input user) is not stored / available in the face augmentation system 108. If the face augmentation system 108 receives a suggestion that the input user was previously known, the face augmentation system 108 is configured to identify a cropped image of the input user in the video frame based on a stored signature corresponding to the known input user (e.g., assign a user face ID to the cropped image). Conversely, if the face augmentation system 108 receives a suggestion that the input user was not previously known, in some embodiments, the face augmentation system 108 is configured to present a cropped image of the detected input user face in the user interface and then receive an operator selection regarding which of the detected input user faces is associated with which user face ID and therefore to proceed with further processing (e.g., manipulation of the corresponding target face). The face augmentation system 108 then stores the cropped image corresponding to each selected input user face as a reference image to be used to detect the same / similar input user faces in video frames subsequently received from the same video recording. Each user face ID is mapped to a target face ID.In some embodiments, the mapping between each user face ID and the corresponding target face ID may be determined by an operator in the user interface of the face augmentation system 108 or may be pre-stored. Each target face ID available for mapping to a user face ID is a target face ID for which a corresponding target face exchange model has been pre-generated. 【0031】 The face augmentation system 108 is configured to generate a representation of a target face corresponding to each detected input user face in a video frame. In various embodiments, the face augmentation system 108 is configured to determine a face ID corresponding to each input user face. Thus, the face augmentation system 108 is configured to determine a target face ID to which its face ID is mapped, and then retrieves a pre-generated target face swap model associated with that target face ID. In various embodiments, the target face swap model includes a model generated based on multiple cropped images of the target individual's face, where each cropped image undergoes the same alignment process used to obtain the cropped image of the input user face. The retrieved target face swap model takes in the cropped image of the input user face and then outputs a representation of a target face that matches the input user face in the cropped image. Specifically, in some embodiments, the output representation of the target face includes a 2D image of the target face that mimics the expression and orientation of the input user face shown in the cropped image, as will be detailed later. In some embodiments, the 2D image includes RGB with an alpha channel mask. An alpha channel mask describes which parts of the manipulated target face's 2D image will be transparent and to what extent. 【0032】 For example, this 2D image of the target face with a mask applied includes only the target face and does not include the hair and body. The face augmentation system 108 is then configured to modify / deform the representation of the target face based on alignment information (e.g., rotation, translation, and scale) previously used to determine the cropped image of the input user face. By modifying / deforming the representation of the target face based on the alignment information, the target face is oriented / scaled to match the orientation / scale of the input user face in the video frame. The face augmentation system 108 is configured to composite the modified representation of the target face with the original video frame. Composite the modified representation of the target face with the original video frame includes overlaying the modified representation of the target face on the positions corresponding to each input user face in the original video frame. The face augmentation system 108 is configured to output the composite original video frame containing the modified representation of the target face to the display 110, where the composite video frame is presented. The display 110 (for example, a monitor) includes a device having a screen capable of displaying video and other media. 【0033】 The pipeline described above describes the process involved in generating a representation of a target face corresponding to each detected input user face within a single video frame, but in some embodiments, the process is repeated for each subsequent video frame of the video recording. 【0034】 In some embodiments, the face augmentation system 108 is configured to speed up the processing of the current video frame in a video recording by predicting / generating an appropriate crop image corresponding to each input user face in the current video frame using recent historical alignment information from recently processed video frames of the same video recording. This type of accelerated processing is also called “prediction.” As will be detailed later, the “prediction” process is faster than the pipeline described above, which generates a crop image for each detected input user face in the video frame using detected face features determined for that video frame, because it does not need to wait for face features to be determined for the video frame. Instead, at least in parallel with the determination of face features for the input user faces in the current video frame, the prediction pipeline generates a crop image of the input user face from the current video frame using a predicted trajectory of the input user face determined from historical alignment information determined for the input user face when it appeared in one or more recent previously processed video frames. 【0035】 According to various embodiments described herein, video recording is performed by camera 102, and each video frame is processed in real time by face augmentation system 108. The resulting composite video frame simulates a desired face swap between an input user face and a corresponding target face and is output to display 110. An efficient processing pipeline of video frames, utilizing parallel computing on the GPU and, in some embodiments, prediction of the input user face's trajectory based on recent historical video frames, minimizes the delay between recording of video frames by camera 102 and displaying a composite version of the same video frame on display 110. This delay, also known as "glass-to-glass delay," can be reduced to as little as 50-75 milliseconds, which is virtually imperceptible to the viewer. The low glass-to-glass delay provides the effect that the simulated face swap between the input user face and the selected target face is occurring in real time. 【0036】 Figure 2 shows an example of a face augmentation system according to several embodiments. In the example of Figure 2, the face augmentation system comprises a face signature generation engine 202, a face detection engine 204, a face landmark detection engine 206, an alignment engine 208, a face recognition engine 210, mapping storage 212, a face replacement engine 214, a target face replacement model storage 216, a synthesis engine 218, and a predictive face position engine 220. Each of the face signature generation engine 202, face detection engine 204, face landmark detection engine 206, alignment engine 208, face recognition engine 210, mapping storage 212, face replacement engine 214, target face replacement model storage 216, synthesis engine 218, and predictive face position engine 220 may be implemented using hardware (e.g., a GPU) and / or software. 【0037】 The face signature generation engine 202 is configured to generate a unique signature corresponding to a given face based on an image of that face. For example, in a pipeline that swaps a target face with an input user face, the input user face is first identified, and its input user identifier is used to determine the corresponding target face. In some embodiments, if the input user face on which such face swapping is performed is known in advance, the input user face may be identified based on a prepared / stored signature. In various embodiments, the “signature” (also called “embedding”) includes a set of vectors (e.g., 512 vectors) generated by a machine learning model (e.g., a neural network) trained to output a vector representation of a face based on images (e.g., 10-20) of the face displayed in different orientations. In some embodiments, each generated face signature is associated with a face ID (e.g., by an operator). For example, the face ID is a human-readable name. In some embodiments, the machine learning model configured to generate signatures corresponding to faces is trained on face images aligned in the same way that video frames are aligned with the input user face. More specifically, as video frames (for example, in progress) are being processed in real time, the face recognition engine 210 generates a signature from the aligned / cropped image of the input user face, and then compares that signature with a stored signature to determine whether a match can be determined, thereby identifying the input user face within the video frame. If the face recognition engine 210 can determine a match in the face signatures, the input user face is assigned a face ID associated with the stored matching face signature. 【0038】 The face detection engine 204 is configured to detect input user faces in each video frame of a video recording. In some embodiments, the video recording may be completed. In some embodiments, the video recording may be in progress. In some embodiments, the face detection engine 204 is configured to run a machine learning model (e.g., a neural network) trained to output a bounding box (or other polygon) around each input user face (excluding the body) recognized in the input video frame. In some embodiments, the face detection engine 204 is configured to detect up to a specified number of input user faces. For example, it may be desirable to perform face swapping only on a single input user or a specified number of input user faces (rather than any further input user faces that may appear in the video frame), and therefore the face detection engine 204 is configured to draw bounding boxes (or other polygons) around the maximum number of input user faces recognized up to a specified number of faces (even if further input user faces are present in the video frame). In various embodiments, the face detection engine 204 is configured to output the bounding boxes around the input user faces and the video frames to the face landmark detection engine 206 for further processing. 【0039】 The face landmark detection engine 206 is configured to detect the positions of face features (e.g., coordinates in 3D space) within a bounding box around the input user face determined by the face detection engine 204 in each video frame. In some embodiments, the face landmark detection engine 206 is configured to run a machine learning model (e.g., a neural network) trained to output the positions of face features (e.g., face landmarks) within a bounding box around each input user face in the video frame. Examples of face landmarks include the corners of the eyes, corners of the mouth, ends of the eyebrows, and tip of the nose, and can be represented as coordinates in three-dimensional space. In various embodiments, the face detection engine 204 is configured to output the bounding box around the input user face to the face landmark detection engine 206 for further processing. In various embodiments, the face landmark detection engine 206 is configured to output the 3D coordinates of the face features of each input user face, along with the video frame, to the alignment engine 208 for further processing. 【0040】 The alignment engine 208 is configured to generate a cropped image containing the face of the input user detected in each video frame, based on the face features corresponding to each input user face from the face landmark detection engine 206. In various embodiments, the cropped image of each input user face includes a portion of the original video frame and shows the input user face after being scaled, rotated, and / or translated from its appearance in the original video frame to fit a standardized format. For example, each input user face in the cropped image has eyes aligned according to a specified position / width on the cropped image. In addition to the cropped image of each input user face, the alignment engine 208 is further configured to generate a set of alignment information corresponding to each cropped image of the input user face, where the alignment information describes the rotation, translation, and / or scaling determined to transform the input user face in the video frame into the version displayed in the cropped image. In some embodiments, each set of alignment information is represented by a transformation matrix that describes the corresponding scaling, rotation, and / or translation of how the input user face appearing in the video frame was transformed to become the version displayed in the cropped image. As will be described later, the set of alignment information associated with the cropped image of the input user face may be used to transform the target face image corresponding to the input user face so that the transformed representation of the target face can be appropriately scaled, rotated, and / or translated relative to the rest of the video frame before being composited / overlaid on the video frame, as will be detailed later. In various embodiments, the alignment engine 208 is configured to output the cropped image of each input user face, the corresponding set of alignment information, and the image frame to the face recognition engine 210 for further processing. 【0041】 The face recognition engine 210 is configured to determine a face ID corresponding to each input user for which a cropped image is received from the alignment engine 208. In some embodiments, if stored signatures of known faces (e.g., pre-generated by the face signature generation engine 202) are available, the face recognition engine 210 is configured to generate a signature from each cropped image of an input user face and then compare that signature to each stored signature to determine whether a match can be found. If a match is found, the cropped image and the input user face shown therein are assigned a face ID associated with the matching stored signature. Conversely, if no stored signature matching the signature of the cropped image of an input user face is found, the face recognition engine 210 is configured to present the cropped image in the user interface and instruct the operator to provide the respective face ID for each cropped image. The face recognition engine 210 is configured to store the face ID presented by each operator corresponding to each cropped image, and further to store the cropped image of each distinct input user face as a reference image corresponding to the face ID. In some embodiments, upon receiving a cropped image corresponding to an input user face from the same video recording, the face recognition engine 210 can compare the cropped image with the reference image to determine whether a match exists. If a match exists between the cropped image and the stored reference image, the face recognition engine 210 can assign the face ID associated with the matching stored reference image to the input user face in the cropped image. In various embodiments, the face recognition engine 210 is configured to output each cropped image, the corresponding set of alignment information, and the face ID of each input user to the face swapping engine 214 for further processing. 【0042】 The face swapping engine 214 is configured to generate a representation of a target face corresponding to each cropped image of the input user face received from the face recognition engine 210. The face swapping engine 214 is configured to look up the corresponding target face ID from the face ID vs. target face ID mapping stored in the mapping storage 212 using the face ID corresponding to the cropped image of the input user face. The face swapping engine 214 is then configured to find a pre-generated target face swapping model associated with that target face ID from the target face swapping model storage 216 using the looked-up target face ID corresponding to the cropped image of the input user face. In various embodiments, each target face swapping model stored in the target face swapping model storage 216 is trained on images showing various orientations / angles of the corresponding target face. For example, the target faces may include the faces of celebrities, famous people, or other individuals whose facial expressions / movements may be augmented / replaced with those of the input user. In some embodiments, each image on which the target face swap model is trained is also a cropped image aligned according to a standardized format associated with the cropped image of the input user face generated by the alignment engine 208. If there are two or more input user faces received by the face swap engine 214, the face swap engine 214 retrieves a separate target face swap model corresponding to each input user face in order to separately generate different representations of the target face corresponding to the input user face. In some embodiments, each target face swap model stored in the target face swap model storage 216 includes a separate neural network or other machine learning model. In some embodiments, the face swap engine 214 is configured to run each target face swap model to output a 2D image showing a target face that mimics the expression / orientation of the corresponding input user face shown in the input cropped image.As will be detailed later, specifically, when executing the target face swap model, the face swap engine 214 is configured to encode an input crop image of the input user face into a set of vectors (e.g., 512 vectors) that mathematically represent various extrinsic features of the input user face (e.g., aspects relating to facial expressions that can be transferred from one face to another, unlike intrinsic features such as eye color and skin color). The target face swap model is then configured to decode the set of vectors by transforming the mathematical model of the target face using the set of vectors in order to ultimately generate a representation (e.g., a 2D image) of the target face that appears to have the same expression and orientation as the corresponding input user face. In some embodiments, the 2D image of the target face output by the target face swap model includes an RGB image with an alpha channel that represents the contour of the target face by masks indicating where the output image is transparent (e.g., outside the boundary of the target face) and where the output image is not transparent or semi-transparent (e.g., within the boundary of the target face, where the image is not transparent and the edges of the target face may be semi-transparent). The face swapping engine 214 is configured to send each representation (e.g., a 2D image) of the target face corresponding to each input user face detected from the video frame, along with each set of alignment information and the original video frame, to the synthesis engine 218. 【0043】 The synthesis engine 218 is configured to modify each target face representation (e.g., a 2D image) based on a corresponding set of alignment information associated with the input user face received from the face swapping engine 214, and then overlay the modified representation onto the input user face in the original video frame. By modifying / transforming each target face representation / 2D image with the set of alignment information associated with the input user face, the target face image is given the same scaling / rotation / translation as the corresponding input user face in the original video frame. If there are multiple input user faces / representations of a target face in a single video frame, the resulting synthesized video frame will include, for each input user face, a corresponding overlay of a target face image in which the target face mimics the expression / orientation of the corresponding detected input user face. Since each target face image contains only the target face without the target individual's hair or body, the modified expression of each target face, when overlaid on the video frame, inherits the input user's original hair and body, resulting in only the desired face swap. 【0044】 The predictive face position engine 220 is configured to skip face detection output by the face detection engine 204, face landmark detection output by the face landmark detection engine 206, and alignment information by the alignment engine 208 for the current video frame by generating a cropped image from the current video frame based on the history face bounding box, face landmark coordinates in 3D space, and / or alignment information determined for recent history video frames from the same video recording. In particular, when the frame rate used by the camera recording the video frames is high (e.g., 60 frames per second (fps) or higher), the input user face is not expected to change significantly between adjacent video frames. Accordingly, in some embodiments, the predictive face position engine 220 determines the face trajectory for a given number of recent historical video frames using the historical face bounding box, face landmark coordinates, and alignment information of the cropped image determined by the face detection engine 204, face landmark detection engine 206, and alignment engine 208, and uses this trajectory to run a machine learning model that can output / predict the bounding box, face landmark coordinates, and / or alignment information of the input user face in the current video frame. This predicted information is then used to generate a cropped image of the input user face directly from the current video frame, which is then output to the face recognition engine 210 and proceeds to the rest of the pipeline for generating a composite video frame, as described above. In some embodiments, the current video frame may still be processed by the face detection engine 204, face landmark detection engine 206, and alignment engine 208, and the bounding box, face landmark coordinates, and alignment information of the input user face in the current video frame determined by the engines may be used to evaluate the error of the predictive cropped image.Without waiting for the output from the face detection engine 204, face landmark detection engine 206, and alignment engine 208 for subsequent current video frames, any errors may be used to retrain the model associated with the predictive face position engine 220 for the purpose of creating a more accurate future predictive / cropped image from those video frames. 【0045】 Figure 3 is a flowchart showing one embodiment of the process for real-time augmentation of a target face. In some embodiments, the process 300 is performed by the face augmentation system 108 of the system 100 in Figure 1. 【0046】 In step 302, a set of user face features corresponding to the input user face in the recorded video frame is obtained. After the input user face is detected in the recorded video frame, a set of face landmarks is determined from the input user face in the video frame. In some embodiments, the detected bounding box around the input user face in the video frame is input to a machine learning model trained to output the 3D space coordinates of the face landmarks on the input user face. 【0047】 In step 304, at least the recorded video frames and the set of user face features are used to generate a cropped image containing the input user face. 3D space coordinates describing the positions of face landmarks on the input user face within the video frames are used to scale, rotate, and / or translate the input user face within the video frames for the purpose of generating a cropped image (a portion of the video frame) containing the input user face that conforms to normalized orientation, alignment, size, and / or dimensions. 【0048】 In step 306, a target face swap model is used to encode at least a portion of the cropped image into a set of user extrinsic features. A target face is determined that is augmented so that its representation matches the facial expression and / or orientation of the input user face. In some embodiments, the target face corresponding to the input user face is determined based on a face ID determined for the input user face based on at least a portion of the cropped image. A pre-generated target face swap model corresponding to this target face is determined. The cropped image of the input user face is input to the target face swap model, which then encodes the cropped image into a set of vectors that mathematically describe the various extrinsic features of the input user face shown in the cropped image. 【0049】 In step 308, the target face swap model and multiple user extrinsic features are used to generate a representation of the target face. A set of vectors encoded from the cropped image is used by the target face swap model to transform a pre-generated mathematical model so as to yield a 2D image of the target face containing extrinsic features (e.g., facial expressions, orientation) that match those of the input user face. In some embodiments, the 2D image further includes a mask that represents the target face as an opaque object with soft edges against an otherwise transparent background. 【0050】 In step 310, the representation of the target face is overlaid on the recorded video frame. The 2D image of the target face is overlaid on the recorded video frame to result in a composite video frame showing the target face having an expression and orientation that matches that of the input user's face. In some embodiments, before being overlaid on the original video frame, the 2D image of the target face is first transformed based on the alignment information used when scaling, orienting, and / or translating the input user's face in the video frame to obtain a cropped image of the input user's face so that the transformed 2D image of the target face matches the scale, orientation, and / or translation of the input user's face covered by the overlaid 2D image of the target face. 【0051】 Figure 4 is a flowchart showing one embodiment of the process for real-time augmentation of multiple target faces. In some embodiments, the process 400 is performed by the face augmentation system 108 of the system 100 in Figure 1. 【0052】 In step 402, the first input user face and the second input user face are detected within the recorded video frame. At least two input user faces are detected within the same recorded video frame. 【0053】 In step 404, a first face identifier is associated with a first input user face. The corresponding face identifier is associated with each of at least two detected input user faces in the video frame. 【0054】 In step 406, a first mapping between a first face identifier and a first target face is stored. Each face ID is assigned to a different input user face, and this face identifier is also mapped to a corresponding target face ID so that the same input user face can be consistently mapped to the same target face, which is augmented to match the extrinsic features of the input user face, across multiple video frames. 【0055】 In step 408, a first representation of a first target face, generated at least partially based on a portion of a recorded video frame containing the face of a first input user, is overlaid on the recorded video frame using a first mapping. A pre-generated target face replacement model corresponding to each target face is configured to take a cropped image of the corresponding input user face from the video frame as input and output a 2D image of the target face containing extrinsic features (e.g., facial expression, orientation) that match those of the input user face. The 2D image of the target face is then overlaid on the corresponding input user face on the video frame to replace the input user face with a version of the target face having the same facial expression / orientation, for example, as described above in process 300. 【0056】 Figure 5 shows an example of a pipeline that generates a composite video frame containing representations of target faces corresponding to n input user faces detected in an input recorded video frame. The source 502 of the face swap pipeline 500 records video frames in the process of creating a video recording. The source can be a camera or storage where the video frames are stored. For example, suppose one or more input users are standing in front of the camera and facing the camera. The recorded video frames are output to the face detector inference 504. The face detector inference 504 runs a first machine learning model (e.g., a neural network) configured to draw bounding boxes around the input user faces detected in the video frame. In some embodiments, the face detector inference 504 is configured to receive an operator-presented parameter that specifies the maximum number (n) of faces to detect, so that the face detector inference 504 is configured to draw bounding boxes around the same number of input user faces as the maximum number n in the video frame. The bounding boxes of up to n detected input user faces (described, for example, by coordinates / vectors in 3D space representing the four corners / edges of each bounding box) are output from the face detector inference 504 to the face landmark inference 506, along with the video frame. The face landmark inference 506 is configured to run a second machine learning model (e.g., a neural network) to detect the coordinates in 3D space of a set of face landmarks for each input user face within each bounding box for the video frame. The face landmark inference 506 is configured to output the face landmark coordinates and video frame for each input user face to the face alignment 508. The face alignment 508 is configured to generate a cropped image containing each input user face from the video frame, where the input user faces shown in the cropped image are oriented / aligned / scaled according to normalized / standardized parameters.The face alignment 508 is further configured to generate a set of alignment information corresponding to each cropped image, where the set of alignment information describes the scaling, rotation, and / or translation performed on the input user face shown in the video frame to become the version shown in the cropped image. The face alignment 508 is configured to output the cropped image and the set of alignment information corresponding to each input user face to the face ID inference 510. The face ID inference 510 is configured to run a third machine learning model (e.g., a neural network) to determine the face ID corresponding to each cropped image. As will be detailed later, the face ID inference 510 is configured to determine the face ID corresponding to the input user face in each cropped image in different ways depending on whether the input user face was a known face before the video frame was processed in the pipeline and whether a face signature had been previously generated for it. After a face ID is determined for each cropped image, the face ID inference 510 determines the target face ID corresponding to each face ID based on the stopped mapping and can output each cropped image and its corresponding face ID to the respective target face swap model associated with the target face ID corresponding to the face ID. In the example in Figure 5, there are n input user faces, and therefore n cropped images, and each cropped image is output to the respective pre-generated target face swap model, where each face swap inference (e.g., face swap inference 1 / n 512 to face swap inference n / n 514) is configured to run the respective machine learning model configured to output the respective 2D representation / image (e.g., "exchanged image") of the associated target face that matches the extrinsic features (e.g., facial expressions, movements) of the respective input user face shown in the corresponding cropped image. Each of the n 2D representations / images of the target face (with masks) is output to the synthesis 516 by the respective face swap inference.The composite 516 is configured to transform each n 2D representation / image of the target face by its respective set of alignment information so that each 2D representation / image of the target face inherits the orientation / scale of the corresponding input user face as it was shown in the video frame before being aligned to the cropped image. The composite 516 is then configured to composite / overlay the n transformed 2D representations / images of the target face onto the original video frame (received from face ID inference 510). The composite 516 is configured to output the composite video frame to the display 518, which outputs the composite video frame containing the overlaid 2D images of the target face to a screen or other type of user interface. The processing pipeline 500 shown in Figure 5 can be iterated over for each of at least some video frames during video recording. As shown in the processing pipeline 500, for each video frame input to the pipeline, a corresponding composite video frame is output. 【0057】 Figure 6 is a flowchart showing an example of a process for detecting an input user face in a recorded video frame according to several embodiments. In some embodiments, process 600 is performed by the face augmentation server 108 in Figure 1. In some embodiments, the face detector inference 504 of pipeline 500 in Figure 5 is performed using process 600. 【0058】 In step 602, a video frame is received. 【0059】 In step 604, the number of input user faces to be detected is received. Since more input user faces may be detected in the video frame than desired, the operator can restrict the maximum number of input user faces to be detected for further processing in the face replacement pipeline. For example, it is desirable that only the faces of actors in the foreground of the video frame, and not the faces of actors in the background, be replaced with the target face in the resulting composite video frame. 【0060】 In step 606, one or more input user faces are detected within the video frame according to a specified number. A machine learning model is run to detect bounding boxes around the maximum specified number of input user faces within the video frame. For example, if the specified number is 2, up to two bounding boxes drawn around the input user faces within the video frame are used for further processing in the face swap pipeline. 【0061】 Figure 7 is a flowchart illustrating an example of a process for detecting face landmarks corresponding to input user faces detected within recorded video frames, according to several embodiments. In some embodiments, process 700 is performed on the face augmentation server 108 in Figure 1. In some embodiments, face landmark inference 506 in pipeline 500 in Figure 5 may be performed using process 700. 【0062】 In step 702, the input user face detected (next) in the video frame is received. In some embodiments, the detection of each input user face in the video frame includes a bounding box around that input user face. For example, the bounding box may be represented as four coordinates in 3D space corresponding to the four corners of the box. 【0063】 In step 704, multiple face landmarks corresponding to the detected input user faces are determined. The 3D coordinates corresponding to a given set of face landmarks are determined for each input user face within the bounding box around the face in the video frame. 【0064】 In step 706, it is determined whether there is at least one more detected input user face within the video frame. If there is at least one more detected input user face within the video frame, control is returned to step 702. Conversely, if there are no more detected input user faces within the video frame, process 700 terminates. While process 700 suggests that face landmarks may be detected sequentially for each input user face, in actual embodiments, face landmarks may be detected at least in parallel for two or more input user faces based on their respective bounding boxes. 【0065】 Figure 8 is a flowchart illustrating an example of a process for generating cropped images and alignment information sets corresponding to each input user face in a recorded video frame, according to several embodiments. In some embodiments, process 800 is performed on the face augmentation server 108 in Figure 1. In some embodiments, face landmark inference 506 in pipeline 500 in Figure 5 may be performed using process 800. 【0066】 In step 802, a face landmark corresponding to the input user face detected (next) in the video frame is received. 【0067】 In step 804, the face landmark is used to generate a cropped image containing the input user face detected from the video frame, along with a set of alignment information. In some embodiments, the face landmark is used to calculate the 3D head position of the input user face. This 3D head position is used to detect the translation, rotation, and scale of the input user face in a 3D coordinate system. The translation, rotation, and scale of the input user face are then used to align the input user face to the crop of the original video frame in order to generate a cropped image of the input user face in which the size of the input user face is standardized, and the face landmark of the input user face is aligned to the standardized face landmark position in the cropped image. Furthermore, the cropped image has standardized dimensions (e.g., length and width). The scaling, translation, and / or rotation performed on the version of the input user face shown in the video frame in order to generate the version shown on the corresponding input user face are collectively referred to as the “set of alignment information” associated with the cropped image. Intuitively speaking, a cropped image of the same input user's face generated from a series of video frames (even if the input user is moving around within the video frames) indicates that the center of gravity of this input user's head remains stable within the cropped image. 【0068】 In step 806, it is determined whether there is at least one more input user face detected within the video frame. If there is at least one more input user face detected within the video frame, control is returned to step 802. Conversely, if there are no more input user faces detected within the video frame, process 800 terminates. Process 800 suggests that crop images may be generated sequentially for each input user face, but in actual embodiments, crop images may be generated at least in parallel for two or more input user faces based on their respective face landmarks. 【0069】 Figure 9 shows several examples of aligned crop images of input user faces generated from each video frame. Figure 9 shows eight crop images, each generated from each video frame, based on user face features determined for the input user face. For example, each crop image in Figure 9 was generated using a process such as process 800 in Figure 8. As shown in Figure 9, each crop image shows the respective input user face that is in the same alignment within the frame of the crop image. For example, in all of the crop image examples, the eyes of the input user faces are in approximately the same position within the crop image. Furthermore, in all of the crop image examples, the size of the input user faces is substantially equivalent / standardized. 【0070】 Figure 10 is a flowchart showing an example of a process for identifying an input user face in a recorded video frame, according to several embodiments. In some embodiments, process 1000 is performed by the face augmentation server 108 in Figure 1. In some embodiments, face ID inference 510 in pipeline 500 in Figure 5 may be performed using process 1000. 【0071】 In step 1002, one or more cropped images of the detected input user face are received. 【0072】 In step 1004, it is determined whether a prepared face signature for a known face is available. If a prepared face signature for a known face is available, control proceeds to step 1006. Conversely, if a prepared face signature for a known face is not available, control proceeds to step 1014. In some examples, specific input users appearing in the video recording are known in advance, and therefore, face signatures for such input users can be prepared in advance so that the faces of these users can be later programmatically identified in the face replacement processing pipeline. As described above, the face signature of an input user can be generated based on inputting images of the input user's face (ideally from various angles, e.g.) into an identification network to obtain a resulting value that represents a unique face signature for the input user. For example, a face signature can be a set of 512 vectors. In some embodiments, each known face is assigned a corresponding face ID (e.g., a human-readable value). 【0073】 In step 1006, pre-generated face signatures associated with known faces are obtained. The pre-prepared face signatures of known input user faces can be retrieved during the face replacement processing pipeline to programmatically identify those input user faces detected in video frames that are actually known faces. 【0074】 In step 1008, a new face signature is generated corresponding to the cropped image of the detected input user face. The cropped image of each detected input user face can be input into the same discriminative network used to generate the face signatures of known faces in order to obtain a new face signature corresponding to each detected input user face. 【0075】 In step 1010, pre-generated face signatures are compared with new face signatures. Pre-generated face signatures corresponding to known faces are compared with new face signatures generated for detected input user faces to determine whether any matches exist. For example, the similarity value between each pre-generated face signature and each new face signature may be comparable, and this similarity value may be compared to a threshold. If the similarity value is greater than the threshold, the detected input user face with a matching new face signature is determined to be the same face as the known face. 【0076】 In step 1012, detected input user faces whose new signature matches a pre-generated face signature are associated with their respective face IDs. Each detected input user face whose face signature matches a stored face signature of a known face is assigned a face ID associated with that known face. 【0077】 In step 1014, it is determined whether a reference crop image is available. If a reference crop image is available, control proceeds to step 1016. Conversely, if a reference crop image is not available, control proceeds to step 1020. If a face signature corresponding to a known face is not available (e.g., pre-prepared), a stored reference image (e.g., a crop image of an input user face generated from a previous video frame of the video recording and labeled by the operator with the corresponding face ID) may be compared to the crop images of each detected input user face in the current video frame. 【0078】 In step 1016, the cropped image is compared to a reference cropped image. A previously stored reference image, labeled with a face ID, is compared to the cropped image generated for the detected input user face to determine if any matches exist. For example, the similarity value between each reference image and each new cropped image can be compared to a threshold. If the similarity value is greater than the threshold, it is determined that the detected input user face in the matching cropped image is assigned the face ID associated with the matching reference image. 【0079】 In step 1018, the cropped images are associated with the respective face IDs of the cropped images. 【0080】 In step 1020, the operator submits a face ID corresponding to a cropped image of the input user's face. If a reference image is not available, a prompt may be presented to the user interface to ask the operator to label each cropped image with the corresponding face ID. 【0081】 In step 1022, the cropped images are stored as reference cropped images. After the cropped images are labeled with their respective face IDs, they are stored as reference images corresponding to those face IDs and can be used to identify the input user face in cropped images generated from subsequent video frames (for example, from the same video recording). 【0082】 In step 1024, the face ID is stored along with the reference cropped image. 【0083】 Figure 11 is a flowchart illustrating an example of a process for generating a 2D image of a target face corresponding to an input user face detected from recorded video frames, according to several embodiments. In some embodiments, process 1100 is performed on the face augmentation server 108 in Figure 1. In some embodiments, face swap inference 512 or 514 of pipeline 500 in Figure 5 may be performed using process 1100. 【0084】 In step 1102, a face ID corresponding to the (next) cropped image from the video frame is received. The face ID corresponding to the cropped image of each detected input user face is determined using a process such as process 1000 in Figure 10. 【0085】 In step 1104, the target face ID associated with the face ID is determined based on the stored mapping. The mapping between the face ID of the detected input user face and the target ID is predetermined or submitted by the operator during the face exchange pipeline. 【0086】 In step 1106, a pre-generated target face swap model associated with a target face ID is obtained. A face swap model corresponding to each target face ID is pre-generated and stored. In some embodiments, the face swap model corresponding to a particular target face is generated using a cropped image generated with the same type of alignment used to obtain a cropped image of the input user face detected from the video frame. 【0087】 In step 1108, a pre-generated target face swap model is used to encode the cropped image into a set of user extrinsic features. The target face swap model encoder unit receives a cropped image of the detected input user face as input and then encodes the image into a set of vectors representing the extrinsic features of the input user face. In various embodiments, the extrinsic features describe features of the input user face that can be transferred to another face. Examples of extrinsic features include facial expressions or facial movement patterns related to pixel patterns recognized by the target face swap model within the cropped image. For example, the set of user extrinsic features is a compact representation of the input user face in the cropped image and contains 512 vectors. 【0088】 In step 1110, a pre-generated target face swap model is used to decode a set of user extrinsic features to obtain a 2D target face image. The target face swap model includes a mathematical representation of the target face generated based on a cropped image of the target face. For example, the decoder of the target face model is a neural network that has learned the intrinsic facial features of the target face and renders the intrinsic facial features of the target face as indicated by a set of user extrinsic features (e.g., a set of 512 vectors). The resulting matrix forms a 2D image of the target face showing the target face with extrinsic features that match those of the input user face shown in the cropped image. In some embodiments, the output 2D image includes red, green, and blue (RGB) values along with an alpha value for the target face mask. The mask depicts the contour around the target face excluding the target individual's hair, neck, and body. 【0089】 In step 1112, it is determined whether there is at least one more input user face detected within the video frame. If there is at least one more input user face detected within the video frame, control is returned to step 1102. Conversely, if there are no more input user faces detected within the video frame, process 1100 ends. Process 1100 suggests that 2D target face images may be generated sequentially for each input user face, but in actual embodiments, 2D target face images may be generated at least in parallel for two or more input user faces based on their respective cropped images. 【0090】 Figure 12 shows an example of a target face and an example of an associated mask related to the target face. Cropped images 1202 and 1204 show the aligned target face and can be used to generate a target face swap model that characterizes the shown target face. Cropped image 1206 shows a version of the target face in which the area of the image outside the target face has been blurred. Mask 1208 shows a mask of the target face that can be output by a target face swap model that shows only the area of the target face within the softened boundary of the target face. 【0091】 Figure 13 is a flowchart illustrating an example of a process for overlaying a 2D image of a target face onto an input user face detected within a recorded video frame. In some embodiments, process 1300 is performed on the face augmentation server 108 in Figure 1. In some embodiments, the synthesis 516 of pipeline 500 in Figure 5 may be performed using process 1300. 【0092】 In step 1302, a 2D target face image associated with the (next) face ID related to the video frame is received. In some embodiments, the 2D target face image associated with the face ID is generated using a process such as process 1100 in Figure 11. The 2D target face image includes a mask that makes everything in the image except the target face transparent. 【0093】 In step 1304, the 2D target face image is transformed based on a set of alignment information associated with a cropped image linked to a face ID. As described above, the cropped image of the detected input user face, based on when the 2D target face image was generated, was associated with a set of alignment information describing how the input user face appears with the cropped image in the original video frame. The inverse of this same set of alignment information is then used to transform the 2D target face image to match the scaling, rotation, and / or translation of the detected input user face when it appears in the video frame. 【0094】 In step 1306, the colors associated with the 2D target face image are optionally adjusted. Optionally, the colors associated with the 2D target face (before or after transformation using the alignment information set) are adjusted to better harmonize with the ambient lighting in the video frame. For example, the color temperature of the video frame can be determined, and then the 2D target face image can be adjusted to match the determined color temperature of the video frame. 【0095】 In step 1308, the 2D target face image is overlaid on the detected input user face associated with the face ID within the video frame. The transformed 2D target face image is then composited onto the video frame at the position of the corresponding detected input user face. As described above, the 2D target face image includes only the target face with soft edges, and therefore the overlaid target face within the video frame will appear to have the same hair, neck, and body as the corresponding input user face. Thus, the composite video frame will show the input user having swapped faces with the target individual. 【0096】 In step 1310, it is determined whether or not there is at least one more detected input user face in the video frame. If there is at least one more detected input user face in the video frame, control is returned to step 1302. Conversely, if there are no more detected input user faces in the video frame, process 1300 ends. Process 1300 suggests that 2D target face images may be sequentially composited onto the video frame for each input user face, but in actual embodiments, 2D target face images may be composited onto the video frame in at least parallel for two or more input user faces. 【0097】 Figure 14A shows a first example of an original video frame and a corresponding composite video frame. Video frame 1402 is the original video frame showing one input user face. After applying the processing pipeline described in pipeline 500 in Figure 5 to the recorded video frame 1402, a representation of a selected target face is generated, which has been manipulated to inherit the extrinsic features of the input user face, and is then overlaid on video frame 1402 to result in a composite video frame 1404, which is a target face overlaid on the input user face, but which has extrinsic features (e.g., facial expressions) that match the input user face. 【0098】 Figure 14B shows a second example of an original video frame and a corresponding composite video frame. Video frame 1450 is the original video frame showing one input user face. After applying the processing pipeline described in pipeline 500 in Figure 5 to video frame 1450, a representation of a selected target face is generated, which has been manipulated to inherit the extrinsic features of the input user face, and is then overlaid on video frame 1450 to result in composite video frame 1452, which shows the target face overlaid on the input user face, but with extrinsic features (e.g., facial expressions) that match the input user face. As shown in the example in Figure 14B, composite video frame 1452 shows not only the target face overlaid on the input user face, but also the hair, neck, and body of the original input user, as well as the background of the original video frame 1450. 【0099】 Figure 15 shows an example of a pipeline that utilizes predicted face positions to generate a composite video frame containing representations of target faces corresponding to n input user faces detected in an input recorded video frame. The prediction processing pipeline 1500 is a modification of the processing pipeline 500 in Figure 500, in that it predicts the updated positions of face features of up to n input user faces in the current video frame and uses bounding boxes, face landmarks, and alignment information corresponding to up to n input user faces obtained from one or more recent historical video frames from the same video recording to generate a new crop image of the input user faces in the current video frame using these updated positions. As described below, by using the historical positions of the face features of the input user faces, the trajectory and / or rotation of each input user face can be predicted and then used to generate a crop image of the input user faces from the current video frame without waiting for the precise bounding boxes, face landmarks, and alignment information corresponding to the input user faces to be determined from the current video frame. 【0100】 Source 1502 of the face swap pipeline 1500 records the current video frame in the process of creating a video recording. The source can be a camera or storage where the video frame is stored. For example, suppose one or more input users are standing in front of the camera and facing the camera. The recorded video frame is output to both the predicted face position 1504 and the face detector inference 1506. The predicted face position 1504 is configured to receive historical bounding boxes, face landmarks, and / or alignment information related to the input user face determined from a predetermined number (e.g., 10) of recent historical video frames from the same video recording. In particular, if source 1502 is operating at a faster clock and can generate video frames at high fps (e.g., 60 fps) (e.g., high resolution / 1080p), the relative change between input user face positions from one video frame to the next is expected to be small. Therefore, the predicted face position 1504 can accurately predict the new position (e.g., bounding box, face landmark, and / or alignment) of the input user face in the current video frame without waiting for the actual position to be determined, using the historical bounding box, face landmark, and / or alignment information associated with the input user face detected from the most recent historical video frame generated by the source 1502. In some embodiments, the predicted face position 1504 stores the historical bounding box, face landmark, and / or alignment information associated with the input user face detected from a predetermined number (e.g., 10) of the most recent historical video frames, and then uses this recent historical information to generate a cropped image of each detected input user face and a corresponding set of alignment information, which are output to the face ID inference 1512 for further processing. 【0101】 While the predicted face position 1504 generates a cropped image of the input user face from the current video frame, the face detection inference 1506, face landmark inference 1508, and face alignment 1510 are configured to determine bounding boxes, face landmarks, and / or alignment information related to the input user face detected in the current video frame, similar to the operation of the face detector inference 504, face landmark inference 506, and face alignment 508 described above for the pipeline 500 in Figure 5. In some embodiments, the face detector inference 1506, face landmark inference 1508, and face alignment 1510 may also output bounding boxes, face landmarks, and / or alignment information related to the input user face detected in the current video frame to the predicted face position 1504 for use as historical information to predict the position of the input user face in the next video frame. In some embodiments, the face alignment 1510 generates a cropped image of the detected input user face in parallel with the generation of a cropped image of the predicted face position 1504. The face alignment 1510 can then transmit the cropped image based on the actually detected input user face position (not the predicted input user face position) so that these actual cropped images can be used to determine the discrepancy between the actual cropped image and the predicted cropped image, for the purpose of determining whether the predicted face position 1504 needs to update its prediction technique (e.g., a machine learning model for predicting the input user face position). For example, if the discrepancy between the actual cropped image and the predicted cropped image is greater than a predetermined threshold, the machine learning model for predicting the input user face position is retrained to adjust the model's weights. Face ID inference 1512, face swap inference 1 / n 1514, face swap inference n / n 1516, synthesis 1518, and display 1520 are configured to operate similarly to face ID inference 510, face swap inference 1 / n 512, face swap inference n / n 514, synthesis 516, and display 518 described for pipeline 500 in Figure 5.Generally, as long as historical input user face positions from recent video frames are available, the prediction processing pipeline 1500, through predictions made based on such historical position information, can provide faster output of composite video frames (and consequently, lower glass-to-glass latency) than the processing 500. 【0102】 Figure 16 is a flowchart illustrating an example of a process that generates a cropped image of an input user face from the current video frame using predicted face position, according to several embodiments. In some embodiments, process 1600 is performed on the face augmentation server 108 in Figure 1. In some embodiments, the predicted face position 1504 in pipeline 1500 in Figure 15 may be performed using process 1600. 【0103】 In step 1602, historical alignment information corresponding to the input user face detected in a pre-recorded video frame is received. In some embodiments, bounding boxes, face landmarks, and / or alignment information of the input user face detected in a predetermined number of the most recent video frames for the current video frame of the same video recording are retrieved from storage (e.g., memory). 【0104】 In step 1604, historical alignment information and predicted trajectories are used to generate a cropped image from the current video frame with standardized attributes including the input user face. Such historical bounding boxes, face landmarks, and / or alignment information of the input user face detected in a predetermined number of most recent video frames are used to determine the predicted trajectory for each of one or more input user faces, and these trajectories are used to predict the current position (e.g., the user's head, the user's facial features) within the current video frame. The predicted positions are then used to align, orient, and / or scale each input user face within the current video frame to a standard position within the crop of the current video frame, so that a cropped image of the input user face is generated from the current video frame and used for further processing in the face replacement pipeline. 【0105】 Although the embodiments described above have been explained in some detail for the sake of clarity, the present invention is not limited to the details provided. Many alternative methods exist for carrying out the present invention. The disclosed embodiments are illustrative and not intended to be limiting.
Claims
[Claim 1] It is a system, Memory and A processor connected to the aforementioned memory, Equipped with, The aforementioned processor, Obtain a set of user face features corresponding to the input user face in the recorded video frame. To generate a cropped image including the input user face, at least the recorded video frame and the set of user face features are used, To encode at least a portion of the cropped image into multiple user-external features, a target face swap model is used. To generate a representation of the target face, the target face exchange model and the plurality of user extrinsic features are used. A system configured to overlay the representation of the target face onto the recorded video frames. [Claim 2] The system according to claim 1, wherein the processor further comprises Upon receiving the recorded video frame, Receive the specified number of input user faces to detect, A system configured to detect one or more input user faces within the recorded video frames according to the number specified above. [Claim 3] The system according to claim 1, wherein the processor is further configured to determine a face identifier (ID) associated with the input user face. [Claim 4] The system according to claim 3, wherein the processor further comprises: Determine the target face ID that is mapped to the face ID associated with the input user face, A system configured to retrieve the target face replacement model associated with the target face ID from storage. [Claim 5] The system according to claim 1, wherein the target face replacement model is pre-generated using an image of the target face, and the image of the target face comprises cropped images aligned according to standardized parameters. [Claim 6] The system according to claim 5, wherein the cropped image including the input user face is also aligned according to the standardized parameters. [Claim 7] The system according to claim 1, wherein the representation of the target face comprises a two-dimensional (2D) image of the target face. [Claim 8] The system according to claim 7, wherein the 2D image of the target face comprises an RGB image with a mask. [Claim 9] The system according to claim 1, wherein the processor is further configured to generate a set of alignment information relating to the input user face, the set of alignment information describing one or more of the following: the scale of the input user face in the cropped image relative to the recorded video frame, the rotation of the input user face, and the translation of the input user face. [Claim 10] A system according to claim 9, wherein the processor is further configured to modify the representation of the target face using the alignment information before overlaying the representation of the target face on the recorded video frame. [Claim 11] A system according to claim 10, wherein the processor is further configured to output the recorded video frame, including the overlay of the representation of the target face, to a display. [Claim 12] The system according to claim 1, wherein the recorded video frame comprises a first recorded video frame, and the cropped image comprises a first cropped image. The aforementioned processor further, Secondly, obtain the recorded video frame, A system configured to use at least the second recorded video frame and the set of user face features related to the input user face in the first recorded video frame in order to predict a second crop image including the input user face. [Claim 13] The system according to claim 12, wherein the plurality of user extrinsic features comprises a first plurality of user extrinsic features, and the representation of the target face comprises a first representation of the target face. The aforementioned processor further, To encode at least a portion of the second cropped image into a second plurality of user-external features, the target face swap model is used, To generate a second representation of the target face, the target face exchange model and the second set of user extrinsic features are used. A system configured to overlay the second representation of the target face onto the second recorded video frame. [Claim 14] It is a method, Obtain a set of user face features corresponding to the input user face in the recorded video frame. To generate a cropped image including the input user face, at least the recorded video frame and the set of user face features are used, To encode at least a portion of the cropped image into multiple user-external features, a target face swap model is used. To generate a representation of the target face, the target face exchange model and the plurality of user extrinsic features are used. Overlaying the representation of the target face onto the recorded video frame, A method that includes [a certain feature]. [Claim 15] The method according to claim 14, further, Upon receiving the recorded video frame, Receive the specified number of input user faces to detect, To detect one or more input user faces within the recorded video frame according to the number specified above, A method that includes [a certain feature]. [Claim 16] A method according to claim 14, further comprising determining a face identifier (ID) associated with the input user face. [Claim 17] The method according to claim 16, further, Determine the target face ID that is mapped to the face ID associated with the input user face, Retrieving the target face replacement model associated with the target face ID from storage, A method that includes [a certain feature]. [Claim 18] A method according to claim 14, further comprising generating a set of alignment information relating to the input user face, wherein the set of alignment information describes one or more of the following: the scale of the input user face in the cropped image relative to the recorded video frame, the rotation of the input user face, and the translation of the input user face. [Claim 19] The method according to claim 14, wherein the recorded video frame comprises a first recorded video frame, and the cropped image comprises a first cropped image. moreover, Secondly, obtain the recorded video frame, To predict a second cropped image including the input user face, at least the second recorded video frame and the set of user face features related to the input user face in the first recorded video frame are used. A method that includes [a certain feature]. [Claim 20] A computer program product, which is embodied in a non-temporary computer-readable storage medium, Computer instructions for obtaining a set of user face features corresponding to an input user face in a recorded video frame, A computer instruction for generating a cropped image including the input user face, using at least the recorded video frame and the set of user face features, Computer instructions for using a target face swap model to encode at least a portion of the cropped image into multiple user-external features, Computer instructions for using the target face exchange model and the plurality of user extrinsic features to generate a representation of the target face, A computer instruction for overlaying the representation of the target face onto the recorded video frame, A computer program product that includes the following features.