A screen-end interactive animation AI imaging method and system
By employing high-precision face detection, PuLID and ControlNet generation, InstantID feature transfer, and deep learning color matching, the problems of visual realism and computational efficiency in existing face-swapping technologies have been solved, achieving efficient and natural face-swapping effects.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TOTEM VISION (GUANGZHOU) DIGITAL TECH CO LTD
- Filing Date
- 2026-02-25
- Publication Date
- 2026-06-19
Smart Images

Figure CN122243723A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of robot screen interaction technology, and in particular to a screen-based interactive animation AI imaging method and system. Background Technology
[0002] With the rapid development of artificial intelligence technology, face-swapping technology in the field of digital image processing has made significant progress and has been widely used in many fields such as entertainment, film and television production, and virtual social networking. Face-swapping technology aims to accurately transfer the features of one face to the target face location in another image or video to achieve a face replacement effect. However, despite the achievements of existing technologies, many challenges and shortcomings still exist in practical applications.
[0003] Traditional face-swapping technology suffers from problems such as lack of visual realism, weak scene adaptability, and low computational efficiency. The root causes are directly related to deficiencies in the underlying algorithm logic, data processing capabilities, and physical modeling. Specifically, visually, face-swapping results often exhibit flaws such as blurred edges, distorted textures, and stiff expressions. For example, "discontinuities" easily appear where the face blends with the background, and the muscle movement logic is flawed during dynamic expressions. In complex scenes, it has poor handling capabilities for changes in lighting, multi-angle movement, or facial occlusion; for instance, in backlit environments, the contrast between light and dark areas in the face-swapping region is obvious. In terms of computational efficiency, traditional models are time-consuming to train and process.
[0004] The core problems of traditional face-swapping techniques in existing technologies stem from several key issues: First, algorithms primarily rely on 2D planar feature matching, failing to construct a 3D facial structure model. For example, using 2D transformations to handle head rotation can lead to perspective errors. Second, the generative model architecture is simple (e.g., early GANs), with a single loss function relying solely on pixel-level loss, making it difficult to capture subtle textures such as pores and wrinkles. Third, training data is mostly aligned frontal faces, lacking multi-angle and multi-lighting scene data; manually annotated feature points also struggle to cover all micro-expressions. Fourth, the technology fails to simulate facial light and shadow reflection characteristics and the physical laws of muscle movement, resulting in unnatural lighting and abrupt expression transitions in the swapped area. Fifth, the model has a large number of parameters and lacks lightweight optimization, leading to high computational complexity and difficulty meeting real-time processing requirements. These problems essentially represent a contradiction between 2D processing logic and 3D reality, as well as a mismatch between model capabilities and complex facial features. Summary of the Invention
[0005] Therefore, the purpose of this invention is to provide a screen-based interactive animation AI imaging method and system to address the shortcomings of the prior art.
[0006] In a first aspect, the present invention provides a screen-based interactive animation AI imaging method, the method comprising: Upload a face-swapping template and an image of the target person; process the image of the target person using a high-precision face detection model; and locate the face bounding box. Facial key points are extracted based on the located face bounding box, and the target face is cropped to a standard size based on the key points to generate the facial region of the target person, and the facial region is dynamically adjusted. A pre-trained face recognition model is used to extract the FaceID embedding of the target face. The FaceID embedding is converted into a conditional control signal through a PuLID adapter. The initial face-swapping image is generated with the template image as the structural guide and the guidance of PuLID and ControlNet. A facial structure map is extracted from the initial face-swapping image. Based on InstantID, the facial features of the target person in the initial face-swapping image are mapped to the template face. The transition and structure of the target person's facial features are adjusted by combining the features of the template face. The target person is then processed through InstantID to output a secondary image. The secondary image is adaptively color-adjusted using a deep learning color matching model. The skin color statistics of the non-face area in the template image are used as a reference to align the colors of the face-swapping area to the secondary image to obtain the adjusted secondary image. Identify the original size of the face-swapping template, stitch the secondary image based on the original size of the face-swapping template, and soften the stitched area to output the finished face-swapping product.
[0007] Furthermore, the step of processing the target person image using a high-precision face detection model includes: The YOLOv8-face model is used to perform standardized preprocessing on the face-swapping template and the facial information of the target person, and the facial information of the target person is mapped onto the corresponding area of the face-swapping template.
[0008] Furthermore, the step of dynamically adjusting the facial region includes: The facial information of the target person is analyzed by a deep learning model to locate the facial features, so as to generate a contour mask, and the proportion of the facial area of the target person in the template face is calculated. If the proportion exceeds a preset threshold, the facial area is cropped while the facial features are retained; If the percentage does not exceed the preset threshold, the facial region is enlarged according to the preset ratio, and the facial region is processed by bicubic interpolation.
[0009] Furthermore, prior to the step of generating the initial face-swapped image using a template image as a structural guide and guided by PuLID and ControlNet, the method further includes: A high-precision mask is generated based on the facial region, and the facial region and the background region of the template face are semantically segmented.
[0010] Furthermore, the step of mapping the facial features of the target person in the initial face-swapping image to the template face based on InstantID, and then adjusting the transition and structure of the target person's facial features using the features of the template face, includes: Based on the InstantID node, the facial texture, shape, proportion, and details of the target person are mapped to the template face using a feature transfer algorithm. The light and shadow transitions and three-dimensional structure of the target person's facial features are dynamically adjusted in combination with the light and shadow and expression features of the template face. Based on the InstantID, the facial features of the target person are detected and the connection between them and the surrounding area are processed by edge feathering and pixel supplementation.
[0011] Furthermore, the step of adaptively color adjusting the secondary image using a deep learning color matching model includes: Analyze the overall color distribution of the template face and extract the main color tone and color temperature parameters; Based on the inherent color features of the skin and hair of the target face in the secondary image, as well as the dominant hue and color temperature parameters, the colors of the secondary image and the template face are unified through histogram matching and linear transformation.
[0012] Furthermore, the steps of identifying the original size of the face-swapping template, stitching the secondary image based on the original size of the face-swapping template, and softening the stitched area include: Identify the original size and cropping coordinates of the face-swapping template, and stitch the secondary image back to the initial position of the face-swapping template; Edge processing algorithms are used to soften the seams.
[0013] Secondly, the present invention also provides a screen-based interactive animation AI imaging system, the system comprising: The upload processing module is used to upload the face-swapping template and the target person image, process the target person image through a high-precision face detection model, and locate the face bounding box. The extraction and generation module is used to extract facial key points based on the located face bounding box, and to crop the target face to a standard size based on the key points to generate the facial region of the target person, and to dynamically adjust the facial region. The recognition and conversion module is used to extract the FaceID embedding of the target face using a pre-trained face recognition model, convert the FaceID embedding into a conditional control signal through a PuLID adapter, and generate the initial face-swapping image with the template image as the structural guide and the guidance of PuLID and ControlNet. The extraction and mapping module is used to extract facial structure maps from the initial face-swapping image, map the facial features of the target person in the initial face-swapping image to the template face based on InstantID, adjust the transition and structure of the target person's facial features based on the features of the template face, and process the target person through InstantID to output a secondary image. An adaptive color adjustment module is used to adaptively adjust the secondary image using a deep learning color matching model. Taking the skin color statistics of non-face areas in the template image as a reference, the color of the face-swapping area is aligned to the secondary image to obtain the adjusted secondary image. The identification and stitching module is used to identify the original size of the face-swapping template, stitch the secondary image based on the original size of the face-swapping template, and soften the stitching area to output the finished face-swapping product.
[0014] Thirdly, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the screen-based interactive animation AI imaging method as described above.
[0015] Fourthly, the present invention also provides a storage medium storing a computer program thereon, which, when executed by a processor, implements the above-described screen-based interactive animation AI imaging method.
[0016] Compared with existing technologies, the beneficial effects of this invention are as follows: By processing the face-swapping template and the target person's facial information and making dynamic adjustments, and then generating the initial face-swapping image through PuLID and FaceID dual face-swapping nodes, the target face and template are accurately mapped and the contour is reconstructed, ensuring that the facial area size is adapted and the contour fits. The InstantID node is used to perform facial feature transfer, edge feathering and pixel compensation to eliminate face-swapping traces. Through color transfer algorithms and AI intelligent color adjustment engines, the color space of the target and template is unified to achieve global coordination of light and shadow and color temperature, and finally output a visually harmonious face-swapping product. Furthermore, by stitching the secondary image, non-facial elements such as the original background and hairstyle can be retained, which can maximize the maintenance of the original image composition. Attached Figure Description
[0017] Figure 1This is a flowchart of the screen-based interactive animation AI imaging method in the first embodiment of the present invention; Figure 2 This is a structural block diagram of the screen-based interactive animation AI imaging system according to the second embodiment of the present invention; Figure 3 This is a schematic diagram of the structure of the electronic device in the third embodiment of the present invention.
[0018] Explanation of key component symbols: 10. Upload processing module; 20. Extraction and generation module; 30. Recognition and conversion module; 40. Extraction and mapping module; 50. Adaptive color adjustment module; 60. Recognition and stitching module; 70. Bus; 71. Processor; 72. Memory; 73. Communication interface.
[0019] The following detailed description, in conjunction with the accompanying drawings, will further illustrate the present invention. Detailed Implementation
[0020] To facilitate understanding of the present invention, a more complete description will be given below with reference to the accompanying drawings. Several embodiments of the invention are illustrated in the drawings. However, the invention can be implemented in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
[0021] It should be noted that when a component is said to be "fixed to" another component, it can be directly on the other component or there may be an intervening component. When a component is said to be "connected to" another component, it can be directly connected to the other component or there may be an intervening component. The terms "vertical," "horizontal," "left," "right," and similar expressions used in this document are for illustrative purposes only.
[0022] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.
[0023] Example 1 Please see Figure 1 The image shows a robot large-screen interactive multi-terminal interaction method according to the first embodiment of the present invention, the method including steps S1 to S6: S1, upload the face-swapping template and the target person image, process the target person image using a high-precision face detection model, and locate the face bounding box; Specifically, step S1 includes step S11: S11, The YOLOv8-face model is used to perform standardized preprocessing on the face-swapping template and the facial information of the target person, and the facial information of the target person is mapped to the corresponding area of the face-swapping template; Understandably, uploading the face-swapping template and the target person's facial information automatically completes image size standardization preprocessing, aiming to accurately map the target person's face to the corresponding area of the original image template through algorithms, laying the foundation for subsequent face-swapping. It should be noted that a high-precision face detection model (YOLOv8-face) is used to process the target person image, locate the face bounding box, and extract 68 or 106 facial landmarks to accurately identify the positions of facial features (eyes, nose, mouth, and contours). Based on the landmarks, the target face is cropped to a standard size (512×512) to generate the target person's facial region (Aligned Target Face).
[0024] S2, extract facial key points based on the located face bounding box, and crop the target face to a standard size based on the key points to generate the facial region of the target person, and dynamically adjust the facial region; Specifically, step S2 includes steps S21 to S23: S21, the facial information of the target person is analyzed by a deep learning model to locate the facial features, so as to generate a contour mask, and the proportion of the facial area of the target person's facial information in the template face is calculated. S22, if the proportion exceeds a preset threshold, then the facial area is cropped while the facial features are retained; S23, if the proportion does not exceed the preset threshold, the facial region is enlarged according to the preset ratio, and the facial region is processed by bicubic interpolation; Understandably, AI uses deep learning models (such as keypoint detection technology) to automatically analyze the facial features of a target portrait, accurately locate key parts such as the eyes, nose, and mouth, and generate a facial contour mask. This process simultaneously calculates the proportion of the facial area within the entire image. If the proportion of the target portrait's facial area differs from the template by more than a threshold (e.g., ±20%), an intelligent scaling mechanism will be activated. When the template's facial area is too small, the AI will enlarge the target face proportionally, while maintaining pixel clarity through a bicubic interpolation algorithm to avoid stretching and distortion. If the template's facial area is too large, the system will use intelligent cropping technology to shrink the target face while preserving the relative positions of key facial feature points.
[0025] It should be noted that face detection and alignment are also performed on the face-swapping template image (i.e., the face position in the target body / scene to be replaced). Based on the pose, scale, and lighting direction of the template face, the geometry of the target facial region is dynamically adjusted (e.g., through latent space editing of the 3DMM model) so that it initially matches the spatial layout of the template face in terms of angle, size, and expression, in order to reduce geometric inconsistencies during subsequent fusion.
[0026] S3, a pre-trained face recognition model is used to extract the FaceID embedding of the target face, the FaceID embedding is converted into a conditional control signal through a PuLID adapter, and the initial face-swapping image is generated with the template image as the structural guide and the guidance of PuLID and ControlNet. It needs to be explained that a pre-trained face recognition model (such as InsightFace) is used to extract the FaceID embedding (512-dimensional vector) of the target face. The FaceID embedding is then converted into a conditional control signal compatible with Stable Diffusion UNet through a PuLID adapter. During the denoising process of the diffusion model, this signal is injected into the intermediate layer (Cross-Attention layer) to guide the generation process to preserve the target's identity. The template face is used as the input of ControlNet (using OpenPose and Depth Map) to control the pose, contour, and layout of the generated image. Under the dual guidance of PuLID (identity) and ControlNet (structure), the diffusion model generates an image that preserves the target face's identity but whose pose, lighting, and background are consistent with the template, i.e., the initial face-swapping image.
[0027] Specifically, before performing step S3, step S031 is also included: S031, a high-precision mask is generated based on the facial region, and the facial region and the background region of the template face are semantically segmented. Understandably, after the size adjustment is completed, the system automatically generates a high-precision mask for the facial area of the template, and strictly distinguishes between facial and non-facial areas through semantic segmentation technology. This ensures that in the subsequent redrawing process, only the facial area is replaced at the pixel level, while non-facial elements such as the background and hairstyle of the original image remain unchanged, maximizing the preservation of the overall composition of the original image.
[0028] It should be noted that the target person's facial region, after being resized and masked, and the template facial region are imported into the PuLID and FaceID dual-face-swapping node processing module. These two nodes ensure that the swapped facial contours strictly match the target person's features. After the dual-face-swapping nodes work together, the system generates the first swapped image, providing the basic framework for subsequent facial feature fusion.
[0029] S4, extract facial structure map from the initial face-swapping image, map the facial features of the target person in the initial face-swapping image to the template face based on InstantID, adjust the transition and structure of the target person's facial features based on the features of the template face, and process the target person through InstantID to output a secondary image; Specifically, step S4 includes steps S41 to S42: S41, based on the InstantID node and through a feature transfer algorithm, the facial texture, shape, proportion and details of the target person are mapped to the template face, and the light and shadow transition and three-dimensional structure of the target person's facial features are dynamically adjusted in combination with the light and shadow and expression features of the template face. S42, based on the InstantID, detect the connection between the facial features of the target person and the surrounding area, and process the connection area by edge feathering and pixel supplementation; Understandably, the initial face-swapped image enters the InstantID node. Through feature transfer algorithms, the facial texture, shape, and proportions of the target person are mapped to the template face. Simultaneously, combined with the template's original lighting and facial expression features, the brightness transitions and three-dimensional structure of the facial features are dynamically adjusted. Furthermore, the InstantID model automatically detects the connection points between the facial features and the surrounding skin and hair. Through edge feathering and pixel compensation techniques, it eliminates face-swapping artifacts, ensuring a natural transition between the facial features and the template background, generating a secondary image that closely resembles reality. It is worth noting that the facial structure map is extracted from the initial face-swapping image and used as a conditional input for ControlNet. During the generation process, InstantID maps the target identity features onto the geometric structure of the template face and intelligently adjusts the proportions, spacing, and contours of the facial features so that they not only match the target identity but also naturally fit the template's face shape.
[0030] S5, use a deep learning color matching model to adaptively adjust the color of the secondary image. Using the skin color statistics of the non-face area in the template image as a reference, align the color of the face-swapping area to the secondary image to obtain the adjusted secondary image. Specifically, step S5 includes steps S51 to S52: S51, Analyze the overall color distribution of the template face and extract the main color tone and color temperature parameters; S52, based on the inherent color features of the skin and hair of the target face in the secondary image and the main color tone and color temperature parameters, the colors of the secondary image and the template face are unified through histogram matching and linear transformation; Understandably, even though the first two raw images have already replaced the face, there are often differences in the original color space (such as hue, saturation, and brightness) between the target person and the template. The system uses a color transfer algorithm to adaptively adjust the color of the swapped facial area: first, it analyzes the overall color distribution of the template image, extracting parameters such as the dominant hue and color temperature; then, based on the inherent color features of the target face's skin and hair, it unifies the color space of the target face with the template through histogram matching and linear transformation. Simultaneously, an AI intelligent color adjustment engine is used to fine-tune the color transition in the swapped area, ensuring that the face and background maintain consistency in lighting and color temperature, eliminating any sense of disharmony, and ultimately outputting a visually harmonious image. It's worth noting that a deep learning color matching model (Deep Photo Style Transfer) was used. The skin tone statistics (mean, variance) of non-face areas (such as the neck and shoulders) in the template image were used as a reference to align the color distribution of the face-swapping area. The texture details of the target face were preserved, and only the overall hue, brightness, and saturation were adjusted. After color transfer and AI fine-tuning, a high-quality face-swapping result with natural skin tone, consistent lighting, and seamless edges was obtained.
[0031] S6, identify the original size of the face-swapping template, stitch the secondary image based on the original size of the face-swapping template, and soften the stitching area to output the face-swapping product; Specifically, step S6 includes steps S61 to S62: S61, identify the original size and cropping coordinates of the face-swapping template, and stitch the secondary image back to the initial position of the face-swapping template; S62 uses an edge algorithm to soften the splicing area; Since the masking processing in steps S22, S23, and S031, as well as the subsequent generated images, are all based on the cropped area of the template image, an image restoration operation is ultimately required. The system automatically identifies the original size and cropping coordinates of the original template image and accurately stitches the face-swapped local image back to its initial position. Simultaneously, an edge blending algorithm softens the stitched areas, eliminating jagged edges or color differences caused by cropping. Thus, from material upload, facial analysis, contour reconstruction, feature fusion, color coordination to image restoration, the entire AI face-swapping process is fully automated, outputting a complete and natural face-swapped product.
[0032] It is worth noting that the above steps achieve a fast image face swapping and image fusion in about 40 seconds on an RTX 3090 graphics card.
[0033] In other alternative embodiments, while ensuring image quality loss is kept below 10%, a 50% speed improvement is achieved, compressing the processing time for a single image to 20-25 seconds. This acceleration process corely employs a deep fusion of SDXL-Lightning lightweight acceleration technology and the original SDXL large model. Traditional SDXL models typically require 20-30 iterations per sampling step to complete feature calculation and pixel generation during image generation, while SDXL-Lightning only requires 6-8 iterations. In practical implementation, this is achieved by replacing PuLID, FaceID, and SDXL-Lightning with the SDXL-Lightning model and related parameters. This combination of SDXL-Lightning technology not only significantly shortens sampling time but also maintains core indicators such as facial contour accuracy and natural feature fusion, providing a technology that balances speed and quality for applications with stringent time-sensitive requirements.
[0034] In summary, the screen-based interactive animation AI imaging method described in the above embodiments of the present invention achieves a single-image face-swapping process of approximately 40-50 seconds on an RTX 3090 graphics card, outputting a high-quality face-swapped image. Using SDXL-Lightning lightweight acceleration technology, the processing time is compressed to 25-35 seconds, a 40% speed increase, while quality loss is controlled within 10%. Through AI facial analysis and dynamic size adjustment, combined with PuLID and FaceID dual face-swapping nodes, accurate mapping and contour reconstruction of the target face and template are achieved, ensuring facial area size adaptation and contour fit. InstantID nodes are used for facial feature transfer, edge feathering, and pixel compensation to eliminate face-swapping traces. Color transfer algorithms and an AI intelligent color adjustment engine unify the color space of the target and template, achieving global coordination of light and shadow and color temperature, ultimately outputting a visually harmonious face-swapped product. From material preprocessing and facial analysis to image restoration, the entire process is fully automated, preserving non-facial elements such as the original background and hairstyle, maximizing the maintenance of the original image composition.
[0035] Example 2 This invention also provides a screen-based interactive animation AI imaging system; please refer to [link / reference]. Figure 2 The image shows a screen-based interactive animation AI imaging system according to a second embodiment of the present invention. The system includes: The upload processing module 10 is used to upload the face-swapping template and the target person image, process the target person image through a high-precision face detection model, and locate the face bounding box. The extraction and generation module 20 is used to extract facial key points based on the located face bounding box, and crop the target face to a standard size based on the key points to generate the facial region of the target person, and dynamically adjust the facial region. The recognition and conversion module 30 is used to extract the FaceID embedding of the target face using a pre-trained face recognition model, convert the FaceID embedding into a conditional control signal through a PuLID adapter, and generate an initial face-swapping image with the template image as the structural guide and the guidance of PuLID and ControlNet. The extraction mapping module 40 is used to extract a facial structure map from the initial face-swapping image, map the facial features of the target person in the initial face-swapping image to the template face based on InstantID, adjust the transition and structure of the target person's facial features by combining the features of the template face, and process the target person through InstantID to output a secondary image. The adaptive color adjustment module 50 is used to adaptively adjust the secondary image using a deep learning color matching model. Taking the skin color statistics of the non-face area in the template image as a reference, the color of the face-swapping area is aligned to the secondary image to obtain the adjusted secondary image. The identification and stitching module 60 is used to identify the original size of the face-swapping template, stitch the secondary image based on the original size of the face-swapping template, and soften the stitching area to output the face-swapping product.
[0036] In some alternative embodiments, the upload processing module 10 includes: The processing and mapping unit is used to perform standardized preprocessing on the face-swapping template and the facial information of the target person using the YOLOv8-face model, and to map the facial information of the target person onto the corresponding area of the face-swapping template.
[0037] In some alternative embodiments, the steps of the analysis and adjustment module 20 include: The analysis and generation unit is used to analyze the facial information of the target person to locate the facial features through a deep learning model, generate a contour mask, and calculate the proportion of the facial area of the target person in the template face. The cropping unit is used to crop the facial area while retaining the facial features if the proportion exceeds a preset threshold. The magnification processing unit is used to magnify the facial region by a preset ratio if the proportion does not exceed a preset threshold, and to process the facial region by bicubic interpolation.
[0038] In some alternative embodiments, the steps of the processing generation module 30 include: A segmentation unit is generated to generate a high-precision mask based on the facial region and to semantically segment the facial region and the background region of the template face.
[0039] In some alternative embodiments, the steps of the mapping adjustment module 40 include: The mapping and combining unit is used to map the facial texture, shape, proportion and details of the target person to the template face based on the InstantID node and through the feature transfer algorithm, and dynamically adjust the light and shadow transition and three-dimensional structure of the target person's facial features in combination with the light and shadow and expression features of the template face. The detection and processing unit is used to detect the connection between the facial features and the surrounding area of the target person based on the InstantID, and to process the connection area by edge feathering and pixel supplementation.
[0040] In some alternative embodiments, the adaptive color adjustment module 50 includes the following steps: The analysis and extraction unit is used to analyze the overall color distribution of the template face and extract the main color tone and color temperature parameters; A unification unit is used to unify the colors of the secondary image and the template face based on the inherent color features of the skin and hair of the target face in the secondary image, as well as the dominant hue and color temperature parameters, through histogram matching and linear transformation.
[0041] In some alternative embodiments, the steps of the identification and stitching module 60 include: The identification and splicing unit is used to identify the original size and cropping coordinates of the face-swapping template, and splice the secondary image back to the initial position of the face-swapping template; The softening unit is used to soften the splicing area using an edge algorithm.
[0042] The functions or operation steps implemented by the above modules and units are largely the same as those in the above method embodiments, and will not be repeated here.
[0043] The screen-based interactive animation AI imaging system provided in this embodiment of the invention has the same implementation principle and technical effects as the aforementioned method embodiment. For the sake of brevity, any parts not mentioned in the system embodiment can be referred to the corresponding content in the aforementioned method embodiment.
[0044] Example 3 The present invention also proposes an electronic device, please refer to [link to relevant documentation]. Figure 3 The image shows an electronic device according to a third embodiment of the present invention.
[0045] The electronic device may include a processor 71 and a memory 72 storing computer program instructions.
[0046] Specifically, the processor 71 may include a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement this application.
[0047] The memory 72 may include a mass storage device for data or instructions. For example, and not limitingly, the memory 72 may include a hard disk drive (HDD), a floppy disk drive, a solid-state drive (SSD), flash memory, an optical disk drive, a magneto-optical disk drive, magnetic tape, or a Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, the memory 72 may include removable or non-removable (or fixed) media. Where appropriate, the memory 72 may be internal or external to a data processing device. In a particular embodiment, the memory 72 is non-volatile memory. In a particular embodiment, the memory 72 includes read-only memory (ROM) and random access memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), an electrically alterable read-only memory (EAROM), or flash memory, or a combination of two or more of these. Where appropriate, the RAM can be Static Random-Access Memory (SRAM) or Dynamic Random-Access Memory (DRAM). DRAM can be Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), Extended Data Out Dynamic Random-Access Memory (EDODRAM), Synchronous Dynamic Random-Access Memory (SDRAM), etc.
[0048] The memory 72 can be used to store or cache various data files that need to be processed and / or communicated, as well as possible computer program instructions executed by the processor 71.
[0049] The processor 71 reads and executes the computer program instructions stored in the memory 72 to implement the screen-based interactive animation AI imaging method of the above embodiment 1.
[0050] In some embodiments, the electronic device may further include a communication interface 73 and a bus 70. For example, Figure 3 As shown, the processor 71, memory 72, and communication interface 73 are connected through bus 70 and complete communication with each other.
[0051] The communication interface 73 is used to enable communication between the various modules, devices, units, and / or equipment in this application. The communication interface 73 can also enable data communication with other components such as external devices, image / data acquisition devices, databases, external storage, and image / data processing workstations.
[0052] Bus 70 includes hardware, software, or both, that couples the components of a device together. Bus 70 includes, but is not limited to, at least one of the following: data bus, address bus, control bus, expansion bus, and local bus. For example, and not as a limitation, bus 70 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Extended Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hyper Transport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an InfiniBand interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local Bus (VLB) bus, or other suitable buses, or a combination of two or more of these. Where appropriate, bus 70 may include one or more buses. Although this application describes and illustrates a specific bus, this application considers any suitable bus or interconnection.
[0053] The electronic device can acquire the screen-based interactive animation AI imaging system and execute the screen-based interactive animation AI imaging method of this embodiment.
[0054] Furthermore, in conjunction with the screen-based interactive animation AI imaging method in Embodiment 1 above, this application can provide a storage medium for implementation. This storage medium stores computer program instructions; when these computer program instructions are executed by a processor, they implement the screen-based interactive animation AI imaging method of Embodiment 1 above.
[0055] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0056] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of this patent should be determined by the appended claims.
Claims
1. A screen-based interactive animation AI imaging method, characterized in that, The method includes: Upload a face-swapping template and an image of the target person; process the image of the target person using a high-precision face detection model; and locate the face bounding box. Facial key points are extracted based on the located face bounding box, and the target face is cropped to a standard size based on the key points to generate the facial region of the target person, and the facial region is dynamically adjusted. A pre-trained face recognition model is used to extract the FaceID embedding of the target face. The FaceID embedding is converted into a conditional control signal through a PuLID adapter. The initial face-swapping image is generated with the template image as the structural guide and the guidance of PuLID and ControlNet. A facial structure map is extracted from the initial face-swapping image. Based on InstantID, the facial features of the target person in the initial face-swapping image are mapped to the template face. The transition and structure of the target person's facial features are adjusted by combining the features of the template face. The target person is then processed through InstantID to output a secondary image. The secondary image is adaptively color-adjusted using a deep learning color matching model. The skin color statistics of the non-face area in the template image are used as a reference to align the colors of the face-swapping area to the secondary image to obtain the adjusted secondary image. The original size of the face-swapping template is identified, the secondary image is stitched together based on the original size of the face-swapping template, and the stitching area is softened to output the finished face-swapping product.
2. The screen-based interactive animation AI imaging method according to claim 1, characterized in that, The steps of processing the target person image using a high-precision face detection model include: The YOLOv8-face model is used to perform standardized preprocessing on the face-swapping template and the facial information of the target person, and the facial information of the target person is mapped onto the corresponding area of the face-swapping template.
3. The screen-based interactive animation AI imaging method according to claim 1, characterized in that, The step of dynamically adjusting the facial region includes: The facial information of the target person is analyzed by a deep learning model to locate the facial features, so as to generate a contour mask, and the proportion of the facial area of the target person in the template face is calculated. If the proportion exceeds a preset threshold, the facial area is cropped while the facial features are retained; If the percentage does not exceed the preset threshold, the facial region is enlarged according to the preset ratio, and the facial region is processed by bicubic interpolation.
4. The screen-based interactive animation AI imaging method according to claim 1, characterized in that, Before the step of generating the initial face-swapped image using a template image as a structural guide and guided by PuLID and ControlNet, the method further includes: A high-precision mask is generated based on the facial region, and the facial region and the background region of the template face are semantically segmented.
5. The screen-based interactive animation AI imaging method according to claim 1, characterized in that, The step of mapping the facial features of the target person in the initial face-swapping image to the template face based on InstantID, and adjusting the transition and structure of the target person's facial features based on the features of the template face, includes: Based on the InstantID node, the facial texture, shape, proportion, and details of the target person are mapped to the template face using a feature transfer algorithm. The light and shadow transitions and three-dimensional structure of the target person's facial features are dynamically adjusted in combination with the light and shadow and expression features of the template face. Based on the InstantID, the facial features of the target person are detected and the connection between them and the surrounding area are processed by edge feathering and pixel supplementation.
6. The screen-based interactive animation AI imaging method according to claim 1, characterized in that, The step of adaptively color adjusting the secondary image using a deep learning color matching model includes: Analyze the overall color distribution of the template face and extract the main color tone and color temperature parameters; Based on the inherent color features of the skin and hair of the target face in the secondary image, as well as the dominant hue and color temperature parameters, the colors of the secondary image and the template face are unified through histogram matching and linear transformation.
7. The screen-based interactive animation AI imaging method according to claim 1, characterized in that, The steps of identifying the original size of the face-swapping template, stitching the secondary images based on the original size of the face-swapping template, and softening the stitched areas include: Identify the original size and cropping coordinates of the face-swapping template, and stitch the secondary image back to the initial position of the face-swapping template; Edge processing algorithms are used to soften the splicing area.
8. A screen-based interactive animation AI imaging system, characterized in that, The system includes: The upload processing module is used to upload the face-swapping template and the target person image, process the target person image through a high-precision face detection model, and locate the face bounding box. The extraction and generation module is used to extract facial key points based on the located face bounding box, and to crop the target face to a standard size based on the key points to generate the facial region of the target person, and to dynamically adjust the facial region. The recognition and conversion module is used to extract the FaceID embedding of the target face using a pre-trained face recognition model, convert the FaceID embedding into a conditional control signal through a PuLID adapter, and generate an initial face-swapping image with the template image as the structural guide and the guidance of PuLID and ControlNet. The extraction and mapping module is used to extract facial structure maps from the initial face-swapping image, map the facial features of the target person in the initial face-swapping image to the template face based on InstantID, adjust the transition and structure of the target person's facial features based on the features of the template face, and process the target person through InstantID to output a secondary image. An adaptive color adjustment module is used to adaptively adjust the secondary image using a deep learning color matching model. Taking the skin color statistics of non-face areas in the template image as a reference, the color of the face-swapping area is aligned to the secondary image to obtain the adjusted secondary image. The identification and stitching module is used to identify the original size of the face-swapping template, stitch the secondary image based on the original size of the face-swapping template, and soften the stitching area to output the finished face-swapping product.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the screen-based interactive animation AI imaging method as described in any one of claims 1 to 7.
10. A storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the screen-based interactive animation AI imaging method as described in any one of claims 1 to 7.