Image conversion apparatus and method, and computer-readable recording medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The image conversion method using an artificial neural network addresses the limitations of conventional facial landmark technologies by accurately separating and transforming facial landmarks, producing natural-looking videos that mimic user expressions.

JP7875924B2Active Publication Date: 2026-06-18HYPERCONNECT LLC

View PDF 5 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: HYPERCONNECT LLC
Filing Date: 2024-10-28
Publication Date: 2026-06-18

AI Technical Summary

Technical Problem

Conventional facial image analysis and utilization technologies based on facial landmarks fail to consider facial appearance features and emotional characteristics, leading to performance degradation and the need for improved methods to separate and transform facial landmarks accurately.

Method used

An image conversion method utilizing an artificial neural network to receive a static image, obtain an image conversion template, and convert it into a moving image, incorporating texture and landmark information to generate natural-looking videos.

Benefits of technology

The method provides a video image with the same effect as directly captured facial expressions, enabling accurate landmark data separation and generating images that match user images with target image characteristics, enhancing user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007875924000138
Figure 0007875924000139
Figure 0007875924000140

Patent Text Reader

Abstract

To provide an image conversion device and method capable of converting still images into natural moving images, and a computer-readable recording medium.SOLUTION: An image conversion device that converts images using an artificial neural network includes: an image receiving unit that receives an image from a user; a template acquisition unit that acquires at least one image conversion template; and an image converter that converts a still image into a moving image using the at least one acquired image conversion template. The image converter includes a step of converting the still image received by the image receiving unit into the moving image.SELECTED DRAWING: Figure 1

Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] Cross-reference of related applications This application claims priority to Japanese Patent Application No. 10-2019-0141723 filed with the Korean Intellectual Property Office on 7 November 2019, Japanese Patent Application No. 10-2019-0177946 filed on 30 December 2019, Japanese Patent Application No. 10-2019-0179927 filed on 31 December 2019, and Japanese Patent Application No. 10-2020-0022795 filed on 25 February 2020, the contents of which are incorporated herein by reference. The present invention relates to an image conversion apparatus and method, and a computer-readable recording medium. More specifically, it relates to an image conversion apparatus and method capable of converting still images into natural-looking moving images, and a computer-readable recording medium. The present invention relates to a landmark data separation apparatus and method, and a computer-readable recording medium. More specifically, it relates to a landmark data separation apparatus and method, and a computer-readable recording medium, that can more accurately separate landmark data from faces contained in an image. The present invention relates to a landmark separation apparatus and method, and a computer-readable recording medium. More specifically, it relates to a landmark separation apparatus and method capable of separating landmarks from one or a few frames, and a computer-readable recording medium. The present invention relates to an image deformation apparatus and method, and a computer-readable recording medium. More specifically, it relates to an image deformation apparatus and method capable of generating images that are naturally deformed according to the characteristics of different images, and a computer-readable recording medium. [Background technology]

[0002] Most personal mobile devices have built-in cameras that can capture still images and moving images such as videos. When a user of a personal mobile device needs a moving image of a desired facial expression, they must capture it using the camera built into their device. If the video does not capture the desired facial expression, the user needs to repeat the capture process until a satisfactory result is obtained. Therefore, a method was needed that could replace the desired facial expression in a still image input by the user and convert it into a natural-looking video. A technology for analyzing and utilizing images of human faces based on facial landmarks obtained by extracting the main points of a person's face is being actively researched. Facial landmarks include the results of extracting the starting points of major elements of the face, such as the eyes, eyebrows, nose, mouth, and jawline, or the contour lines drawn by connecting these points. Facial landmarks are mainly used in technologies such as facial expression classification, pose analysis, and face synthesis and deformation. However, conventional facial image analysis and utilization technologies based on facial landmarks suffer from performance degradation because they do not consider facial appearance features and emotional characteristics when processing facial landmarks. Therefore, to improve the performance of facial image analysis and utilization technologies, there is a need to develop technologies that separate facial landmarks, including those related to facial emotions. [Overview of the Initiative] [Problems that the invention aims to solve]

[0003] The present invention aims to provide an image conversion device and method capable of converting still images into natural-looking moving images, as well as a computer-readable recording medium. The present invention aims to provide a landmark data separation device and method that can more accurately and precisely separate landmark data from faces included in an image, as well as a computer-readable recording medium. The present invention aims to provide a landmark separation device and method that can separate landmarks even in objects with a small amount of data, as well as a computer-readable recording medium. The present invention aims to provide an image transformation apparatus and method, as well as a computer-readable recording medium, that can generate an image that matches a user's image but has the characteristics of the target image, when given a target image to be transformed, by using a user's image that is different from the target image. [Means for solving the problem]

[0004] An image conversion method according to one embodiment of the present invention is an image conversion method utilizing an artificial neural network, and includes the steps of: receiving a static image from a user; obtaining at least one image conversion template; and converting the static image into a moving image using the obtained image conversion template. [Effects of the Invention]

[0005] The present invention may also provide an image conversion device and method that provides a video image having the same effect as a video image captured by a user directly while changing their facial expressions, without directly capturing the video image, as well as a computer-readable recording medium. The present invention may also provide an image conversion device and method that provides a user with an enjoyable user experience by converting still images and providing the user with generated moving images, as well as a computer-readable recording medium. The present invention may also provide a landmark data separation device and method that can more accurately and precisely separate landmark data from faces included in an image, as well as a computer-readable recording medium. The present invention may also provide a landmark data separation device and method that can more accurately separate landmark data containing information about facial characteristics and expressions included in an image, as well as a computer-readable recording medium. The present invention may also provide a landmark separation device and method capable of separating landmarks even in objects with a small amount of data, as well as a computer-readable recording medium. The present invention may also provide an image transformation apparatus and method, as well as a computer-readable recording medium, that can generate an image that matches a user's image but has the characteristics of the target image, when given a target image to be transformed, by using a user's image that is different from the target image. [Brief explanation of the drawing]

[0006] [Figure 1] Figure 1 is a schematic diagram showing the environment in which the image conversion method according to the present invention is performed. [Figure 2] Figure 2 is a schematic diagram showing the configuration of an image conversion device according to one embodiment of the present invention. [Figure 3] Figure 3 is a flowchart illustrating a schematic image conversion method according to one embodiment of the present invention. [Figure 4] Figure 4 is a diagram illustrating an image conversion template according to one embodiment of the present invention. [Figure 5a] Figure 5a is a diagram illustrating the process for generating a moving image according to one embodiment of the present invention. [Figure 5b] Figure 5b is a diagram illustrating a generated video according to one embodiment of the present invention. [Figure 6a] Figure 6a is a diagram illustrating an exemplary process for generating a moving image according to another embodiment of the present invention. [Figure 6b] Figure 6b is a diagram illustrating a generated video according to another embodiment of the present invention. [Figure 7] Figure 7 is a schematic diagram showing the configuration of an image conversion device according to one embodiment of the present invention. [Figure 8] Figure 8 is a schematic diagram illustrating the environment in which the method for extracting landmark data from faces contained in an image according to the present invention is performed. [Figure 9] Figure 9 is a schematic diagram showing the configuration of a landmark data separation device according to one embodiment of the present invention. [Figure 10] Figure 10 illustrates a method for extracting facial landmark data according to one embodiment of the present invention. [Figure 11] Figure 11 is a flowchart showing a method for extracting various types of landmark data according to one embodiment of the present invention. [Figure 12] Figure 12 is a diagram illustrating an exemplary process for transforming facial expressions contained in an image according to another embodiment of the present invention. [Figure 13] Figure 13 is a comparison table illustrating the effect of transforming facial expressions contained in an image using the landmark data separation method according to the present invention. [Figure 14] Figure 14 is a schematic diagram showing the configuration of a landmark data separation device according to one embodiment of the present invention. [Figure 15] Figure 15 is a schematic diagram showing the environment in which the landmark separation device according to the present invention operates. [Figure 16] Figure 16 is a flowchart illustrating a landmark separation method according to one embodiment of the present invention. [Figure 17] Figure 17 is a schematic diagram illustrating a method for calculating a transformation matrix according to one embodiment of the present invention. [Figure 18] Figure 18 is a schematic diagram showing the configuration of a landmark separation device according to one embodiment of the present invention. [Figure 19] Figure 19 is an illustrative diagram showing a method for recreating a face using the present invention. [Figure 20] Figure 20 is a schematic diagram showing the environment in which the image deformation apparatus and image deformation method according to the present invention operate. [Figure 21] Figure 21 is a flowchart illustrating a schematic image deformation method according to one embodiment of the present invention. [Figure 22] Figure 22 is a diagram illustrating the results of performing an image deformation method according to one embodiment of the present invention. [Figure 23] Figure 23 is a schematic diagram showing the configuration of an image deformation device according to one embodiment of the present invention. [Figure 24] Figure 24 is a schematic diagram showing the configuration of a landmark acquisition unit according to one embodiment of the present invention. [Figure 25] Figure 25 is a schematic diagram showing the configuration of a second encoder according to one embodiment of the present invention. [Figure 26] Figure 26 is a schematic diagram showing the structure of a blender according to one embodiment of the present invention. [Figure 27] Figure 27 is a schematic diagram showing the structure of a decoder according to one embodiment of the present invention. [Figure 28a-c] Figures 28a to 28c show examples of identity preservation failures and the improved results produced by the proposed method. Figure 28a shows interference with the driver shape. Figure 28b shows the loss of details in the target identity, and Figure 28c shows the failure of warping for large poses. [Figure 29] Figure 29 shows the overall structure of MarioNETte. [Figure 30] Figure 30 shows the structure of the image attention block. [Figure 31] Figure 31 shows the structure of the target feature alignment. [Figure 32] Figure 32 shows the structure of the decomposed part of the landmark. [Figure 33] Figure 33 shows images generated by re-enacting different identities in CelebV under the proposed method, criteria, and single-shot imaging settings. [Figure 34] Figure 34 shows the evaluation results of the self-reenactment setting for VoxCeleb1. [Figure 35] Figure 35 shows the evaluation results of reprising different identities in CelebV. [Figure 36]Figure 36 shows the results of a user study in which users reenacted different identities in CelebV. [Figure 37] Figure 37 shows a comparison of ablation models for re-enacting different identities in CelebV. [Figure 38] Figure 38a shows the driver and target images superimposed on the attention map. Figure 38b shows a failure case of +Alignment and the improved result generated by MarioNETte. [Figure 39] Figure 39 shows an example of a rasterized facial landmark. [Figure 40] Figure 40 shows a comparison of ablation models for setting up self-reenactment on the VoxCeleb1 dataset. [Figure 41] Figure 41 shows the inference speed of each component of the model. [Figure 42] Figure 42 shows the inference speed of the overall model for generating a single image from K target images. [Figure 43] Figure 43 shows the qualitative results of the one-time acquisition and re-enactment ablation model under different identity settings in CelebV. [Figure 44] Figure 44 shows the qualitative results of the ablation model for several imaging and re-enactment sessions under different identity settings in CelebV. [Figure 45] Figure 45 shows the qualitative results of a one-time acquisition self-reenactment setting in VoxCeleb1. [Figure 46] Figure 46 shows the qualitative results of multiple self-reenactment setups using VoxCeleb1. [Figure 47] Figure 47 shows the qualitative results of one-time acquisition and re-enactment under different identity settings in VoxCeleb1. [Figure 48] Figure 48 shows the qualitative results of several imaging and re-enactment sessions under different identity settings in VoxCeleb1. [Figure 49] Figure 49 shows the qualitative results of a one-time acquisition self-reenactment setting in CelebV. [Figure 50]Figure 50 shows the qualitative results of multiple self-reenactment setups using CelebV. [Figure 51] Figure 51 shows the qualitative results of several imaging and reenactment sessions under different identity settings in CelebV. [Figure 52] Figure 52 shows a failure example formed with MarioNETte+LTd during a single acquisition and re-enactment under different identity settings in VoxCeleb1. [Modes for carrying out the invention]

[0007] The advantages and features of the present invention, as well as the methods for achieving them, will become clearer by referring to the embodiments described below in detail with the accompanying drawings. In this regard, embodiments of the present invention may take various forms and are not limited to the descriptions herein. Rather, these embodiments will provide a comprehensive understanding of the present disclosure and fully convey the scope of the present disclosure to those skilled in the art, and the present disclosure is defined solely by the accompanying claims. Throughout the specification, the same reference numerals refer to the same components. Although terms such as "first" or "second" may be used to describe various components, these components are not limited to the terms described above. The terms described above may be used to distinguish one component from another. Therefore, the first component described below may be the second component within the technical concept of the present invention. The terms used herein are for illustrative purposes only and do not limit the invention. In this specification, the singular form includes the plural form unless otherwise specified. As used in this specification, “comprises” or “comprising” does not exclude the presence or addition of one or more other components or steps in the components or steps mentioned. Unless otherwise defined, all terms used herein may be interpreted in a way that can be commonly understood by a person of ordinary skill in the art to which the present invention pertains. Furthermore, terms defined in commonly used dictionaries shall not be interpreted ideally or excessively unless explicitly and specifically defined otherwise.

[0008] Figure 1 is a schematic diagram showing an environment in which the image conversion method according to the present invention is performed. Referring to Figure 1, the environment in which the image conversion method according to the present invention is performed may include a server 10 and terminals 20 connected to each other on the server 10. For the sake of explanation, only one terminal is shown in Figure 1, but multiple terminals may be included. The description of terminal 20 may be applied to additional terminals, except for those that should be specifically mentioned. In an embodiment of the present invention, the server 10 may receive an image from the terminal 20, convert the received image into any format, and then transmit the converted image to the terminal 20. Alternatively, the server 10 may function as a platform providing services that the terminal 20 may connect to and use. The terminal 20 may convert an image selected by the user of the terminal 20 and transmit the converted image to the server 10. Server 10 may be connected to a communication network. Server 10 may be connected to other external devices via the above-mentioned communication network. Server 10 may transmit data to other devices that are connected to each other, or receive data from the above-mentioned other devices. The communication network connected to server 10 may include a wired communication network, a wireless communication network, or a hybrid communication network. The communication network may include mobile communication networks such as 3G, LTE, or LTE-A. The communication network may include wired or wireless communication networks such as Wi-Fi, UMTS / GPRS, or Ethernet. The communication network may include local area networks such as Magnetic Secure Transmission (MST), RFID (Radio Frequency Identification), NFC (Near Field Communication), Zigbee, Z-Wave, Bluetooth, Bluetooth Low Energy (BLE), or InfraRed communication (IR). The communication network may also include local area networks (LAN), metropolitan area networks (MAN), or wide area networks (WAN). Server 10 may be connected to terminal 20 via a communication network. When Server 10 and terminal 20 are connected to each other, Server 10 may send and receive data to and from terminal 20 via the communication network. Server 10 may use the data received from terminal 20 to perform any calculations. Server 10 may transmit the calculation results to terminal 20. Terminal 20 may be a desktop computer, laptop computer, smartphone, smart tablet, smartwatch, mobile terminal, digital camera, wearable device, or portable electronic device. Terminal 20 may run a program or application.

[0009] Figure 2 is a schematic diagram showing the configuration of an image conversion device according to one embodiment of the present invention. Referring to Figure 2, an image conversion device 100 according to one embodiment of the present invention includes an image receiving unit 110, a template acquisition unit 120, and an image conversion unit 130. The image conversion device 100 may be configured by a server 10 or terminal 20 as described with reference to Figure 1. Accordingly, each component included in the image conversion device 100 may also be configured by a server 10 or terminal 20. The image receiving unit 110 receives an image from the user. The image may include the user's face and may be a still image or a static image. On the other hand, the size of the user's face included in the image may differ from image to image. For example, the face in image 1 may have a pixel size of 100 x 100, and the face in image 2 may have a pixel size of 200 x 200.

[0010] The image receiving unit 110 may extract only the face region from the image received from the user and then provide this to the image conversion unit 130. The image receiving unit 110 may extract the area corresponding to the user's face from the image containing the user's face to a predetermined size. For example, if the predetermined size is 100 x 100 and the area corresponding to the user's face in the image is 200 x 200, the image receiving unit 110 may reduce the 200 x 200 image to 100 x 100 before extracting it. Alternatively, a method may be used in which the 200 x 200 image is extracted and then converted to a 100 x 100 image. The template acquisition unit 120 acquires at least one image conversion template. This image conversion template can be understood as a tool that can convert an image received by the image receiving unit 110 into a new image of a specific form. For example, if the image received by the image receiving unit 110 contains a user's expressionless face, a specific image conversion template can be used to generate a new image that includes the user's smile. The above image conversion template may be predetermined to any template, or it may be selected by the user. The image conversion unit 130 may receive a still image corresponding to the face region from the image receiving unit 110. Alternatively, the image conversion unit 130 may convert the still image into a moving image using an image conversion template acquired by the template acquisition unit 120.

[0011] Figure 3 is a flowchart illustrating a schematic image conversion method according to one embodiment of the present invention. Referring to Figure 3, an image conversion method according to one embodiment of the present invention may include the steps of receiving a still image (S110), obtaining an image conversion template (S120), and generating a moving image (S130). The image conversion method according to the present invention is an image conversion method that utilizes an artificial neural network, and a still image can be obtained in step S110. The still image may include the user's face or may include a single frame. In step S120, at least one image conversion template may be acquired from among the multiple image conversion templates stored in the image conversion device 100. The image conversion template may be selected by the user from among the multiple image conversion templates stored in the image conversion device 100. The image conversion template described above may be understood as a tool that can convert the image received in step S110 into a new image of a specific form. For example, if the image received in step S110 contains the user's expressionless face, a specific image conversion template may be used to generate a new image that includes the user's smile. In another embodiment, if the image received in step S110 includes the user's smile, a different specific image transformation template may be used to generate a new image that includes the user's angry face. In some embodiments, at least one reference image may be received from the user in step S120. For example, the reference image may be an image of the user or an image of another person selected by the user. If the user does not select one of the defined templates and selects a reference image, the reference image may be acquired as the image conversion template. That is, the reference image may be understood to have the same function as the image conversion template. In step S130, the acquired image conversion template may be used to convert the still image into a moving image. To convert the still image into a moving image, texture information may be extracted from the user's face contained in the still image. The texture information may include the user's face color and visual texture information.

[0012] Furthermore, in order to convert a still image into a moving image, landmark information may be extracted from the region corresponding to the person's face included in the image conversion template. Feature point information can be obtained from specific shapes, patterns, colors, or combinations thereof contained in the person's face, based on an image processing algorithm. The image processing algorithm may be, but is not limited to, SIFT (Scale Invariant Feature Transform), HOG (HiStogram of Oriented Gradient), Haar feature, Ferns, LBP (Local Binary Pattern), or MCT (Modified Census Transform). The above-mentioned video may be generated by combining the above-mentioned texture information and landmark information. In some embodiments, the video may include multiple frames. The video may have a frame corresponding to the still image as its first frame and a frame corresponding to the image conversion template as its last frame. For example, the facial expression of the user in the still image and the face in the first frame of the video may be the same. Also, when the texture information and landmark information are combined, the facial expression of the user in the still image may be transformed in accordance with the landmark information, and the last frame of the video may include a frame corresponding to the transformed face of the user. When generating moving images using an artificial neural network, the moving images can gradually change from the user's facial expression contained in a still image to the user's facial expression transformed in accordance with the landmark information mentioned above. That is, there may be at least one frame between the first and last frames of the moving image, and the facial expression contained in each of these at least one frame may gradually change. In this way, by utilizing artificial neural networks, it is possible to generate video footage that has the same effect as video footage captured by a user directly changing their facial expressions, without directly capturing video footage.

[0013] Figure 4 is a diagram illustrating an image conversion template according to one embodiment of the present invention. The image conversion device 100 may store multiple image conversion templates. Each of the multiple image conversion templates may include outline images corresponding to eyebrows, eyes, and mouth. The multiple image conversion templates may correspond to various facial expressions, such as sad, happy, winking, melancholic, expressionless, surprised, and angry, and each of the multiple image conversion templates may contain information about different facial expressions from one another. The outline images corresponding to each of the various facial expressions are different from each other. Therefore, each of the multiple image conversion templates may contain different outline images from one another. Referring to Figure 2, the image conversion unit 130 may extract landmark information from the outline image included in the image conversion template.

[0014] Figure 5a is a diagram illustrating the process for generating a moving image according to one embodiment of the present invention. Referring to Figures 4 and 5a, a still image 31, an image conversion template 32, and a moving image 33 generated using the still image 31 and the image conversion template 32 are shown. For example, the still image 31 may include the user's smile. The image conversion template 32 may include outline images corresponding to the eyebrows, eyes, and mouth of a face that is winking and smiling. On the other hand, although the video 33 shown in Figure 5a is considered to contain only one frame, the video 33 may be understood to represent the last frame that constitutes the video generated by the image conversion unit 130 or step S130. The image conversion device 100 may extract texture information from the still image 31 to the region corresponding to the user's face. The image conversion device 100 may also extract landmark information from the image conversion template 32. The image conversion device 100 may combine the texture information from the still image 31 and the landmark information from the image conversion template 32 to generate a moving image 33. The video 33 is shown as a single image containing the user's winking face. However, the video 33 contains multiple frames. The video 33 containing multiple frames will be explained with reference to Figure 5b. Figure 5b is a diagram illustrating a generated video according to one embodiment of the present invention. Referring to Figures 5a and 5b, there may be at least one frame between the first frame 33_1 and the last frame 33_n of the video 33. For example, still image 31 may correspond to the first frame 33_1 of the video 33. Also, the image containing the user's winking face may correspond to the last frame 33_n of the video 33. Each of the at least one frames between the first frame 33_1 and the last frame 33_n of the above video 33 may include the above image of the user's face as their eyes are gradually covered.

[0015] Figure 6a is a diagram illustrating an exemplary process for generating a moving image according to another embodiment of the present invention. Referring to Figures 4 and 6a, a still image 41, a reference image 42, and a moving image 43 generated using the still image 41 and the reference image 42 are shown. For example, the still image 41 may include the user's smile. The reference image 42 may include a face that is winking and smiling. The face included in the reference image 42 may be the face of a person other than the user. On the other hand, although the video 43 shown in Figure 6a is considered to contain only one frame, the video 43 may be understood to represent the last frame that constitutes the video generated by the image conversion unit 130 or step S130. The image conversion device 100 may extract texture information from the still image 41 for the area corresponding to the user's face. The image conversion device 100 may also extract landmark information from the reference image 42. The image conversion device 100 may extract landmark information for the areas corresponding to the eyebrows, eyes, and mouth of the face contained in the reference image 42. The image conversion device 100 may generate a moving image 43 by combining the texture information of the still image 41 and the landmark information of the reference image 42. The video 43 is shown as a single image containing the user's smiling and winking face. However, the video 43 contains multiple frames. The video 43 containing multiple frames will be explained with reference to Figure 6b.

[0016] Figure 6b is a diagram illustrating a generated video according to another embodiment of the present invention. Referring to Figures 6a and 6b, there may be at least one frame between the first frame 43_1 and the last frame 43_n of the video 43. For example, still image 41 may correspond to the first frame 43_1 of the video 43. Also, the image containing the user's smiling and winking face may correspond to the last frame 43_n of the video 43. Each of the at least one frames between the first frame 43_1 and the last frame 43_n of the above video 43 may include an image of the user's face in which the eyes are gradually closed and the mouth is opened.

[0017] Figure 7 is a schematic diagram showing the configuration of an image conversion device according to one embodiment of the present invention. Referring to Figure 7, the image conversion device 200 may include a processor 210 and a memory 220. A person with ordinary skill in the art relating to this embodiment will understand that, in addition to the components shown in Figure 13, other general components may be included. The image conversion device 200 may be the same as or identical to the image conversion device 100 shown in Figure 2. The image receiving unit 110, the template acquisition unit 120, and the image conversion unit 130 included in the image conversion device 100 may be further included in the processor 210. The processor 210 controls all operations of the image conversion device 200 and may include at least one processor such as a CPU. The processor 210 may include at least one specialized processor corresponding to each function, or it may be a single integrated processor. Memory 220 may store programs, data, or files related to the artificial neural network. Memory 220 may also store instructions that can be executed by the processor 210. The processor 210 may execute programs stored in memory 220, read data or files stored in memory 220, or store new data. Memory 220 may also store program instructions, data files, data structures, etc., individually or in combination.

[0018] The processor 210 may obtain a still image from the input image. The still image may include the user's face and may include a single frame. The processor 210 may read at least one image conversion template from among several image conversion templates stored in memory 220. Alternatively, the processor 210 may read at least one reference image stored in memory 220. For example, at least one reference image may be input by the user. The reference image may be an image of the user or an image of another person selected by the user. If the user does not select one of the multiple templates provided and instead selects a reference image, the reference image may be obtained as the image conversion template. The processor 210 may use the acquired image conversion template to convert a still image into a moving image. To convert a still image into a moving image, texture information may be extracted from the user's face contained in the still image. The texture information may include the user's face color and visual texture information. Furthermore, in order to convert a still image into a moving image, landmark information may be extracted from the region corresponding to the person's face included in the image conversion template. Feature point information can be obtained from specific shapes, patterns, colors, or combinations thereof contained in the person's face, based on an image processing algorithm. The image processing algorithm may be, but is not limited to, SIFT (Scale Invariant Feature Transform), HOG (Histogram of Oriented Gradient), Haar feature, Ferns, LBP (Local Binary Pattern), or MCT (Modified Census Transform).

[0019] The above-mentioned video may be generated by combining the above-mentioned texture information and landmark information. The above-mentioned video may include multiple frames. The video may have a frame corresponding to the still image as its first frame and a frame corresponding to the image conversion template as its last frame. For example, the facial expression of the user included in the still image and the face included in the first frame of the video may be the same. Also, when the texture information and landmark information are combined, the facial expression of the user included in the still image may be transformed in accordance with the landmark information, and the last frame of the video may include a frame corresponding to the transformed user face. The video generated by the processor 210 may have the shapes shown in Figures 5b and 6b. The processor 210 may store the generated video in the memory 220 and output the video so that the user can observe it. As explained with reference to Figures 1 to 7, when a user uploads a still image to the user's terminal 20, the image conversion device 200 may convert the still image into a moving image and provide it to the user. Even if the user does not directly capture a moving image, the user may be provided with a moving image that has the same effect as a moving image captured while directly changing facial expressions. Furthermore, the image conversion device 200 can provide an enjoyable user experience by converting the image into a still image and then providing the user with the resulting moving image.

[0020] Figure 8 is a schematic diagram illustrating the environment in which the method for extracting landmark data from faces contained in an image according to the present invention is performed. Referring to Figure 8, the environment in which the method for extracting landmark data according to the present invention is performed may include a server 10-1 and terminals 20-1 connected to each other on the server 10-1. For convenience of explanation, only one terminal is shown in Figure 8, but multiple terminals may be included. The description relating to terminal 20-1 may be applied to additional terminals, except for those that should be specifically mentioned. In an embodiment of the present invention, the server 10-1 may receive an image from the terminal 20-1, extract landmark data from the faces contained in the received image, calculate necessary data from the extracted landmark data, and then transmit the calculated data to the terminal 20-1. Alternatively, server 10-1 may function as a platform providing services that terminal 20-1 may connect to and use. Terminal 20-1 may extract landmark data from faces contained in the image, calculate necessary data from the extracted landmark data, and then transmit the calculated data to server 10-1.

[0021] Server 10-1 may be connected to a communication network. Server 10-1 may be connected to other external devices via the above-mentioned communication network. Server 10-1 may transmit data to other connected devices and may receive data from the above-mentioned other devices. The communication network connected to server 10-1 may include a wired communication network, a wireless communication network, or a composite communication network. The communication network may include mobile communication networks such as 3G, LTE, or LTE-A. The communication network may include wired or wireless communication networks such as Wi-Fi, UMTS / GPRS, or Ethernet. The communication network may include local area networks such as Magnetic Secure Transmission (MST), RFID (Radio Frequency Identification), NFC (Near Field Communication), Zigbee, Z-Wave, Bluetooth, Bluetooth Low Energy (BLE), or InfraRed communication (IR). The communication network may also include local area networks (LAN), metropolitan area networks (MAN), or wide area networks (WAN). Server 10-1 may be connected to terminal 20-1 via a communication network. When server 10-1 is connected to terminal 20-1, server 10-1 may send and receive data to and from terminal 20-1 via the communication network. Server 10-1 may use the data received from terminal 20-1 to perform any calculation. Server 10-1 may transmit the calculation result to terminal 20-1. Terminal 20-1 may be a desktop computer, laptop computer, smartphone, smart tablet, smartwatch, mobile terminal, digital camera, wearable device, or portable electronic device. Terminal 20-1 may run a program or application.

[0022] Figure 9 is a schematic diagram showing the configuration of a landmark data separation device according to one embodiment of the present invention. Referring to Figure 9, the landmark data separation device 100-1 according to one embodiment of the present invention may include an image receiving unit 110-1, a landmark data calculation unit 120-1, and a landmark data storage unit 130-1. The landmark data separation device 100-1 may be configured by a server 10-1 or a terminal 20-1 as described with reference to Figure 8. Accordingly, each component included in the landmark data separation device 100-1 may also be configured by a server 10-1 or a terminal 20-1. The image receiving unit 110-1 may receive multiple images from the user. Each of the multiple images may contain only one person. That is, each of the multiple images may contain the face of one person, and the people contained in the multiple images may be different people from each other. The image receiving unit 110-1 may extract only the face region from each of the multiple images and then provide the extracted face region to the landmark data calculation unit 120-1. The landmark data calculation unit 120-1 may calculate face landmark data included in each of the multiple images, average landmark data for all faces included in the multiple images, characteristic landmark data for a specific face included in a specific image among the multiple images, and expression landmark data for a specific face. In some embodiments, the landmark data may be the result of extracting the face key point. The method for extracting landmark data is described with reference to Figure 10.

[0023] Figure 10 illustrates a method for extracting facial landmark data according to one embodiment of the present invention. Landmark data may be obtained by extracting the origins of key facial features such as the eyes, eyebrows, nose, mouth, and jawline, or by extracting the contour lines drawn by connecting these points. Landmark data may be used in techniques such as facial expression classification, pose analysis, compositing of faces from different individuals, or facial deformation. Referring again to Figure 9, the landmark data calculation unit 120-1 may calculate average landmark data for faces included in multiple images. The average landmark data may be the result of extracting the average shape of a human face. The landmark data calculation unit 120-1 may calculate landmark data from a specific image containing a specific face among multiple images. More specifically, it may calculate landmark data for a specific face contained in a specific frame among multiple frames contained in a specific image. Furthermore, the landmark data calculation unit 120-1 may calculate characteristic landmark data for a specific face included in a specific image among multiple images. The characteristic landmark data may be calculated based on the face landmark data included in each of the multiple frames included in the specific image. Furthermore, the data calculation unit 120-1 may calculate average landmark data, landmark data for a specific frame, and characteristic landmark data to calculate facial landmark data for a specific frame in a specific image. For example, the facial landmark data may correspond to the movement information of a specific facial expression or key elements such as the eyes, eyebrows, nose, mouth, and jawline. The landmark data storage unit 130-1 may store data calculated by the landmark data calculation unit 120-1. For example, the landmark data storage unit 130-1 may store average landmark data, landmark data for a specific frame, characteristic landmark data, and facial expression landmark data, which are calculated by the landmark data calculation unit 120-1.

[0024] Figure 11 is a flowchart showing a method for extracting various types of landmark data according to one embodiment of the present invention. Referring to Figures 9 and 11, in step S1100, the landmark data separation device 100-1 may receive multiple images. Each of the multiple images may contain only one person. That is, each of the multiple images may contain the face of one person, and the people included in the multiple images may be different people from each other. In step S1200, the landmark data separator 100-1 separates the average landmark data I m The average landmark data I may be calculated. m This can be expressed as follows:

number

[0025] For example, the landmark data I (c、t) of a specific frame may be the main starting point information of a specific face included in the t-th frame of the c-th image among a plurality of images C. That is, the specific image may be the c-th image, and the specific frame may be the t-th frame. In step S1400, the landmark data separation device 100-1 may calculate the characteristic landmark data I id(c) for a specific face included in the c-th image. The characteristic landmark data I id(c) can be expressed as follows.

Number

[0026] More specifically, the landmark data separation device 100-1 separates specific facial expression landmark data I contained in the t-th frame of the c-th image. exp(c、t) You may also calculate the following: Facial Landmark Data I exp(c、t) This can be expressed as follows:

number

[0027] Figure 12 is a diagram illustrating an exemplary process for transforming facial expressions contained in an image according to another embodiment of the present invention. Referring to Figures 11 and 12, Server 10-1 or Terminal 20-1 separates facial landmark data I from Landmark Data Separator 100-1. exp(c、t) , Average landmark data I m , and characteristic landmark data I id(c) By utilizing this method, the facial expression may be converted to that of the face in the second image 400 while maintaining the overall appearance of the face in the first image 300. For example, the first image 300 is c among multiple images. x Among the multiple frames contained in the second image, t x It may also correspond to the nth frame. Also, the second image 400 is among multiple images c y Among the multiple frames contained in the second image, t y It may also correspond to the second frame. x The second image and c y The second image may be a different image from the first. The facial landmark data included in the first image 300 may be separated as follows:

number

[0028] The facial landmark data included in the second image 400 may be separated as follows:

number

[0029] Figure 13 is a comparison table illustrating the effect of transforming facial expressions contained in an image using the landmark data separation method according to the present invention. The MarioNETte model is a model that transforms facial expressions in an image without using landmark data separation methods. When using the MarioNETte model, the degree of naturalness of the transformed image was measured to be 0.147. The MarioNETte+LT model is a model that transforms facial expressions in an image using a landmark data separation method. When using the MarioNETte model, the degree of naturalness of the transformed image was measured to be 0.280. In other words, it was confirmed that the image transformed using the MarioNETte+LT model is 1.9 times more natural than the image transformed using the MarioNETte model.

[0030] Figure 14 is a schematic diagram showing the configuration of a landmark data separation device according to one embodiment of the present invention. Referring to Figure 14, the landmark data isolation device 200-1 may include a processor 210-1 and a memory 220-1. A person with ordinary skill in the art relating to this embodiment will understand that in addition to the components shown in Figure 14, other general components may be included. The image conversion device 200-1 may be the same as or identical to the landmark data separation device 100-1 shown in Figure 9. The image receiving unit 110-1 and the landmark data calculation unit 120-1 included in the landmark data separation device 100-1 may be included in the processor 210-1. Processor 210-1 controls the overall operation of the landmark data separator 200-1 and may include at least one processor, such as a CPU. Processor 210-1 may include at least one specialized processor corresponding to each function, or it may be a single integrated processor. Memory 220-1 may store programs, data, or files that control the landmark data separator 200-1. Memory 220-1 may also store instruction words that can be executed by processor 210-1. Processor 210-1 may execute programs stored in memory 220-1, read data or files stored in memory 220-1, or store new data. Memory 220-1 may also store program instructions, data files, data structures, etc., individually or in combination.

[0031] The processor 210-1 may receive multiple images. Each of the multiple images may contain only one person. That is, each of the multiple images may contain the face of one person, and the people contained in the multiple images may be different people from each other. The processor 210-1 may store multiple received images in the memory 220-1. Processor 210-1 processes the landmark data I of each face contained in multiple images C. (c、t) The following may be extracted. The above landmark data separator 100-1 may also calculate the average value of all extracted landmark data. The calculated average value is the average landmark data I m It may also be possible to accommodate this. Processor 210-1 processes landmark data for a specific frame in a set of frames of a particular image containing a specific face among multiple images. (c、t) You may calculate this. Landmark data I for a specific frame (c、t) This may be the primary origin information of a particular face contained in the t-th frame of the c-th image among multiple images C. That is, the particular image may be the c-th image, and the particular frame may be the t-th frame. Processor 210-1 processes characteristic landmark data I of a specific face contained in the c-th image. id(c)The following can be calculated: The multiple frames contained in the c-th image contain various facial expressions of a particular face. Therefore, characteristic landmark data I id(c) To calculate this, processor 210-1 uses specific facial expression landmark data I contained in the c-th image. exp average value JPEG0007875924000020.jpg1541 may be set to 0. Therefore, characteristic landmark data I id(c) This is specific facial expression landmark data I exp average value You may calculate without considering JPEG0007875924000021.jpg1541. Characteristic Landmark Data I id(c) This calculates landmark data for each of the multiple frames contained in the c-th image, and then calculates the average landmark data of each of the multiple frames. The value of JPEG0007875924000022.jpg1535 was calculated, and the average landmark data of the calculated c-th image was obtained. Average landmark data I from multiple images in JPEG0007875924000023.jpg1535 m It can also be defined as the value obtained by subtracting [a certain factor].

[0032] Processor 210-1 processes specific facial expression landmark data I contained in the t-th frame of the c-th image. exp(c、t) You may also calculate the following: Facial Landmark Data I exp(c、t) This may correspond to a specific facial expression contained in the t-th frame and motion information such as the eyes, eyebrows, nose, mouth, and jawline contained in that specific face. More specifically, facial landmark data I exp(c、t) This is landmark data I for a specific frame. (c、t) From average landmark data I m and characteristic landmark data I id(c) It can also be defined as the value obtained by subtracting [a certain factor]. Processor 210-1 processes the separated facial landmark data I exp(c、t) , Average landmark data Im , and characteristic landmark data I id(c) This may be stored in memory 220-1. As will be explained with reference to Figures 8 to 14, the landmark data separation devices 100-1 and 200-1 according to one embodiment of the present invention can separate landmark data more accurately and precisely from faces contained in an image. Furthermore, the landmark data separation devices 100-1 and 200-1 can more accurately separate landmark data that contains information about facial characteristics and expressions included in the image. Furthermore, the server 10-1 or terminal 20-1, which includes the landmark data separation devices 100-1 and 200-1, separates the facial landmark data I exp(c、t) , Average landmark data I m , and characteristic landmark data I id(c) By utilizing this technology, it is possible to realize a technique that naturally converts facial expressions from the first image to the facial expressions from the second image while maintaining the facial appearance of the first image.

[0033] Figure 15 is a schematic diagram illustrating the environment in which the landmark separation device according to the present invention operates. Referring to Figure 15, the environment in which the first terminal 2000 and the second terminal 3000 operate may include the server 1000 and the first terminal 2000 and the second terminal 3000 connected to each other on the server 1000. For convenience of explanation, only two terminals, namely the first terminal 2000 and the second terminal 3000, are shown in Figure 15, but more than two terminals may be included. The description relating to the first terminal 2000 and the second terminal 3000 may be applied to additional terminals, except for those that should be specifically mentioned. Server 1000 may be connected to a communication network. Server 1000 may be connected to other external devices via the above-mentioned communication network. Server 1000 may transmit data to other devices that are connected to each other, or may receive data from the above-mentioned other devices. The communication network connected to server 1000 may include a wired communication network, a wireless communication network, or a composite communication network. The communication network may include mobile communication networks such as 3G, LTE, or LTE-A. The communication network may include wired or wireless communication networks such as Wi-Fi, UMTS / GPRS, or Ethernet. The communication network may include local area networks such as Magnetic Secure Transmission (MST), Radio Frequency Identification (RFID), Near Field Communication (NFC), ZigBee, Z-Wave, Bluetooth, Bluetooth Low Energy (BLE), or Infrared communication (IR). The communication network may also include local area networks (LAN), metropolitan area networks (MAN), or wide area networks (WAN).

[0034] Server 1000 may receive data from at least one of the first terminal 2000 and the second terminal 3000. Server 1000 may perform calculations using the data received from at least one of the first terminal 2000 and the second terminal 3000. Server 1000 may transmit the results of the above calculations to at least one of the first terminal 2000 and the second terminal 3000. Server 1000 may receive a mediation request from at least one of the first terminal 2000 and the second terminal 3000. Server 1000 may select the terminal to which the mediation request is transmitted. For example, Server 1000 may select the first terminal 2000 and the second terminal 3000. Server 1000 may mediate the communication connection between the selected first terminal 2000 and the second terminal 3000. For example, Server 1000 may mediate a video call connection between the first terminal 2000 and the second terminal 3000, or it may mediate a text transmission and reception connection. Server 1000 may transmit connection information regarding the first terminal 2000 to the second terminal 3000, or transmit connection information regarding the second terminal 3000 to the first terminal 2000. Connection information for the first terminal 2000 may include, for example, the IP address and port number of the first terminal 2000. Upon receiving connection information for the second terminal 3000, the first terminal 2000 may use the received connection information to attempt to establish a connection with the second terminal 3000.

[0035] A video call session between the first terminal 2000 and the second terminal 3000 can be established by a successful attempt to connect the first terminal 2000 to the second terminal 3000, or by an attempt to connect the second terminal 3000 to the first terminal 2000. Through the above video call session, the first terminal 2000 may transmit images and sounds to the second terminal 3000. The first terminal 2000 may encode the images and sounds into digital signals and transmit the encoded results to the second terminal 3000. The first terminal 2000 may receive images and sounds encoded as digital signals and decode the received images and sounds. Through the video call session described above, the second terminal 3000 may transmit images and sound to the first terminal 2000. Furthermore, the second terminal 3000 may receive images and sound from the first terminal 2000 through the video call session. This allows the user of the first terminal 2000 and the user of the second terminal 3000 to communicate with each other via video call. The first terminal 2000 and the second terminal 3000 may be, for example, a desktop computer, a laptop computer, a smartphone, a smart tablet, a smartwatch, a mobile terminal, a digital camera, a wearable device, or a portable electronic device. The first terminal 2000 and the second terminal 3000 may run a program or an application. The first terminal 2000 and the second terminal 3000 may be the same type of device as each other, or they may be different types of devices as each other.

[0036] Figure 16 is a flowchart illustrating a landmark separation method according to one embodiment of the present invention. Referring to Figure 16, the landmark separation method according to one embodiment of the present invention includes the steps of receiving a facial image and landmark information (S210), estimating a transformation matrix (S220), calculating a representation landmark (S230), and calculating an identity landmark (S240). In step S210, a facial image of a first person and landmark information corresponding to the facial image are received. Here, the landmark may be understood as the facial landmark of the facial image. The landmark may also mean the main elements of the face, such as the eyes, eyebrows, nose, mouth, and jawline. Furthermore, the landmark information may include information regarding the position, size, or shape of the main elements of the face. In addition, the landmark information may include information regarding the color or texture of the main elements of the face.

[0037] The above-mentioned first person refers to any person, and in step S210, a facial image of any person and landmark information corresponding to the facial image are received. The landmark information is obtained using known techniques, and any known method may be used. Furthermore, the present invention is not limited by the method used to acquire the landmark. In step S220, a transformation matrix corresponding to the above landmark information is estimated. The above transformation matrix can be used to construct the above landmark information together with a predetermined unit vector. For example, the first landmark information may be calculated by multiplying the above unit vector and the first transformation matrix. The second landmark information may be calculated by multiplying the above unit vector and the second transformation matrix. The transformation matrix described above is a matrix that transforms high-dimensional landmark information into low-dimensional data, and can be used in Principal Component Analysis (PCA). PCA is a dimensionality reduction method that transforms variables in a high-dimensional space into variables in a low-dimensional space by searching for new mutually orthogonal axes while preserving the variance of the data as much as possible. PCA first finds the hyperplane closest to the data, and then projects the data onto the low-dimensional hyperplane to reduce the dimensionality of the data. Alternatively, high-dimensional data can be transformed into low-dimensional data by defining the i-th axis in PCA as the i-th principal component (PC) and then linearly combining these axes.

number

[0038] Therefore, in step S220, the facial image of the first person and landmark information corresponding to the facial image are received as input, and a transformation matrix is estimated and output from there. On the other hand, the above-mentioned learning model may be trained to classify landmark information into multiple semantic groups corresponding to the right eye, left eye, nose, and mouth, and to output PCA conversion coefficients corresponding to each of the above multiple semantic groups. In this case, the above semantic groups are not necessarily classified to correspond to the right eye, left eye, nose, and mouth, but may also be classified to correspond to the eyebrows, eyes, nose, mouth, and jawline, or to correspond to the eyebrows, right eye, left eye, nose, mouth, jawline, and ear, etc. In step S120, the above landmark information may be classified into semantic groups of subdivided units according to the above learning model, and PCA conversion coefficients corresponding to the classified semantic groups may be estimated. In step S230, the representation landmark of the first person is calculated using the transformation matrix. The landmark information can be decomposed into multiple sub-landmarks, but in this invention, the landmark information is represented as follows.

number

[0039] Here, l(c, t) is the landmark information of the t-th frame of the video containing person c, l m This refers to mean facial landmark information in humans, l id (c) is the personal identity landmark (facial landmark of identity geometry) information of person c, l exp (c, t) represents the facial landmark of expression geometry of person c in the t-th frame of the video containing person c. In other words, the landmark information of a particular person in a particular frame may be represented by the sum of the average landmark information of all people's faces, the identity landmark information of the particular person, and the facial expression and movement information of the particular person in the particular frame. The above average landmark information can be defined by the following formula and may be calculated based on a large amount of pre-collected video footage.

number

[0040] In other words, b exp represents the eigenvectors explained previously, and higher-dimensional representation landmarks may be defined by combinations of lower-dimensional eigenvectors. Also, n expThis refers to the total number of expressions and movements that person C can express using their right eye, left eye, nose, mouth, etc. Therefore, the expressive landmarks of the first person may be defined as a set of expressive information for each of the major parts of the face, namely the right eye, left eye, nose, and mouth. k (c, t) may exist corresponding to each eigenvector. The aforementioned learning model may be trained to estimate the PCA coefficient α(c,t) by taking a photograph x(c,t) of person c and landmark information l(c,t) as input, as shown in equation 8. Through such training, the learning model may estimate the PCA coefficient from an image of a specific person and its corresponding landmark information, or it may estimate the low-dimensional eigenvectors. When applying a trained neural network, the photograph x(c', t) of person c' whose landmark separation is to be performed and the landmark information l(c', t) are taken as input to the neural network, and the PCA transformation matrix is estimated. At this time, b exp This refers to the PCA coefficient and b predicted (estimated) using values obtained from the training data. exp Using this, you can estimate the representational landmark as follows:

number

[0041] In step S240, the identity landmark of the first person is calculated using the representation landmark described above. As explained with reference to Equation 2, the landmark information may be defined as the sum of the average landmark information, the identity landmark information, and the representation landmark information, and the representation landmark information may be estimated in step S230 using Equation 11. Therefore, the above identity landmark can be calculated as follows:

number

[0042] Figure 17 is a schematic diagram illustrating a method for calculating a transformation matrix according to one embodiment of the present invention. Referring to Figure 17, the artificial neural network receives an input image of an arbitrary person's face. The above artificial neural network may be a part of a known artificial neural network, but in one embodiment, the above artificial neural network may be ResNet. ResNet is a type of CNN (Convolutional Neural Network), and the present invention is not limited to a specific type of artificial neural network. MLP (Multi-Layer Perceptron) is a type of artificial neural network that stacks multiple layers of perceptrons to overcome the limitations of a single-layer perceptron. Referring to Figure 17, the MLP receives the output of the above artificial neural network and landmark information corresponding to the face image as input. The MLP also outputs a transformation matrix. In Figure 17, the artificial neural network and MLP may be understood as constituting a single trained artificial neural network as a whole. Once the transformation matrix is estimated via the trained artificial neural network, representational landmark information and identity landmark information can be calculated, as explained with reference to Figure 16. The landmark separation method according to the present invention can also be applied when only a very small number of face images are available or when only a single frame of face images is available.

[0043] The artificial neural network described above is trained to estimate low-dimensional eigenvectors and transformation coefficients from a large number of face images and their corresponding landmark information. Thus, an artificial neural network trained in this manner can estimate the eigenvectors and transformation coefficients even when only a single frame of face images is provided. By separating a person's representational landmarks from their identity landmarks using this method, the quality of facial image processing techniques such as facial landmark-based face reenactment, face classification, and face morphing can be improved. Face reenactment technology is a technique that, given a target face and a driver face, mimics the movements of the driver face but synthesizes face images and photographs that possess the identity of the target face. Face morphing is a technique that, given face images or photographs of person 1 and person 2, synthesizes a face image or photograph of a third person that possesses the characteristics of person 1 and person 2. Traditional morphing algorithms first find the face key point and then divide the face into non-overlapping triangular or rectangular shapes based on the key point. Then, they combine the photographs of person 1 and person 2 to synthesize a photograph of a third person. However, because the key points of person 1 and person 2 are in different positions, if the photograph of person 1 and person 2 is matched pixel-wise to generate the photograph of the third person, the resulting image may appear unnatural. Known face morphing techniques do not distinguish between the subject's physical characteristics and emotional characteristics such as facial expressions, so the quality of the morphed result may be low. The landmark separation method according to the present invention can separate representational landmark information and identity landmark information from a single landmark information, thereby contributing to improving the results of face image processing technology that utilizes face landmarks. In particular, the landmark separation method according to the present invention is extremely useful because it can separate landmarks even when only a very small amount of face image data is provided.

[0044] Figure 18 is a schematic diagram showing the configuration of a landmark separation device according to one embodiment of the present invention. Referring to Figure 18, the landmark separation device 5000 according to one embodiment of the present invention includes a receiving unit 5100, a transformation matrix estimation unit 5200, and a calculation unit 5300. The receiving unit 5100 receives a facial image of a first person and landmark information corresponding to the facial image. Here, the landmark may be understood as a concept that includes the main elements of the face as the facial landmark, such as the eyes, eyebrows, nose, mouth, and jawline. Furthermore, the landmark information may include information regarding the position, size, or shape of the main elements of the face. In addition, the landmark information may include information regarding the color or texture of the main elements of the face. The above-mentioned first person refers to any person, and the receiving unit 5100 receives a facial image of any person and landmark information corresponding to the facial image. The landmark information is obtained using known techniques, and any known method may be used. Furthermore, the present invention is not limited by the method used to acquire the landmark. The transformation matrix estimation unit 5200 estimates a transformation matrix corresponding to the landmark information. The transformation matrix can be used to construct the landmark information together with a predetermined unit vector. For example, the first landmark information may be calculated by multiplying the unit vector and the first transformation matrix. The second landmark information may be calculated by multiplying the unit vector and the second transformation matrix. The transformation matrix described above is a matrix that transforms high-dimensional landmark information into low-dimensional data, and can be used in Principal Component Analysis (PCA). PCA is a dimensionality reduction method that transforms variables in a high-dimensional space into variables in a low-dimensional space by searching for new mutually orthogonal axes while preserving the variance of the data as much as possible. PCA first finds the hyperplane closest to the data, and then projects the data onto the low-dimensional hyperplane to reduce the dimensionality of the data. Alternatively, high-dimensional data can be transformed into low-dimensional data by defining the i-th axis in PCA as the i-th principal component (PC) and then linearly combining these axes.

[0045] As mentioned above, the unit vectors, i.e., principal components, may be predetermined. Therefore, when new landmark information is received, a corresponding transformation matrix can be determined. In this case, there may be multiple transformation matrices corresponding to a single piece of landmark information. On the other hand, the transformation matrix estimation unit 5200 may use a learning model that has been trained to estimate the above transformation matrix. The above learning model may be understood as a model that has been trained to estimate the PCA transformation matrix from an arbitrary face image and landmark information corresponding to the arbitrary face image. The above learning model may be trained to estimate the above transformation matrix from facial images of different people and landmark information corresponding to each facial image. Although there may be multiple transformation matrices corresponding to a single high-dimensional landmark information, the above learning model may be trained to output only one of the multiple transformation matrices. The landmark information used as input to the above-mentioned learning model may be obtained using a known method of extracting landmarks from a face image and visualizing them. Accordingly, the transformation matrix estimation unit 5200 receives the face image of the first person and landmark information corresponding to the face image as input, and then estimates and outputs a transformation matrix. On the other hand, the above-mentioned learning model may be trained to classify landmark information into multiple semantic groups corresponding to the right eye, left eye, nose, and mouth, and to output PCA conversion coefficients corresponding to each of the above multiple semantic groups. In this case, the above-mentioned semantic groups are not necessarily classified to correspond to the right eye, left eye, nose, and mouth, but may also be classified to correspond to the eyebrows, eyes, nose, mouth, and jawline, or to correspond to the eyebrows, right eye, left eye, nose, mouth, jawline, ear, etc. The transformation matrix estimation unit 5200 may classify the above-mentioned landmark information into semantic groups of subdivided units according to the above-mentioned learning model, and estimate the PCA transformation coefficients corresponding to the classified semantic groups.

[0046] The calculation unit 5300 calculates the representation landmark of the first person using the transformation matrix and calculates the identity landmark of the first person using the representation landmark. The landmark information may be separated into multiple sub-landmark information, for example, into average landmark information, identity landmark information, and representation landmark information. In other words, the landmark information of a particular person in a particular frame may be represented by the sum of the average landmark information of all people's faces, the identity landmark information of the particular person, and the facial expression and movement information of the particular person in the particular frame. The above average landmark information can be defined by the following formula and may be calculated based on a large amount of pre-collected video footage. The aforementioned learning model may be trained to estimate the PCA coefficient α(c,t) by taking a photograph x(c,t) of person c and landmark information l(c,t) as input, as shown in equation 8. Through such training, the learning model may estimate the PCA coefficient from an image of a specific person and its corresponding landmark information, or it may estimate the low-dimensional eigenvectors. When applying a trained neural network, the photograph x(c', t) of person c' whose landmark separation is to be performed and the landmark information l(c', t) are taken as input to the neural network, and the PCA transformation matrix is estimated. At this time, b exp This refers to the PCA coefficient and b predicted (estimated) using values obtained from the training data. exp You can also use this to estimate the representational landmark as shown in equation 11. On the other hand, as explained with reference to Equation 8, landmark information may be defined as the sum of average landmark information, identity landmark information, and representational landmark information, and the above representational landmark information may be estimated in step S230 using Equation 11. Therefore, the above identity landmark may be calculated as shown in formula 12, or landmark information may be obtained from a given facial image of any person, and representational landmark information and identity landmark information may be calculated from the facial image and landmark information.

[0047] Figure 19 is an illustrative diagram showing a method for recreating a face using the present invention. Referring to Figure 19, a target image 4100 and a driver image 4200 are shown, where the target image 4100 may recreate an image corresponding to the driver image 4200. The re-enacted image 4300 possesses the characteristics of the target image 4100, but its facial expression corresponds to that of the driver image 4200. In other words, the re-enacted image 4300 has the identity landmark of the target image 4100, while its expressive landmark has features corresponding to that of the driver image 4200. Therefore, it is clear that in order to recreate a natural face, it is important to appropriately separate the identity landmark and the expressive landmark from a single landmark.

[0048] Figure 20 is a schematic diagram showing the environment in which the image deformation apparatus and image deformation method according to the present invention operate. Referring to Figure 20, the environment in which the first terminal 6000 and the second terminal 7000 operate may include the server 10000 and the first terminal 6000 and the second terminal 7000 connected to each other on the server 10000. For convenience of explanation, only two terminals, namely the first terminal 6000 and the second terminal 7000, are shown in Figure 20, but more than two terminals may be included. The description relating to the first terminal 6000 and the second terminal 7000 may be applied to additional terminals, except for those that should be specifically mentioned. Server 10000 may be connected to a communication network. Server 10000 may be connected to other external devices via the above communication network. Server 10000 may transmit data to other connected devices or receive data from the above other devices. The communication network connected to server 10000 may include a wired communication network, a wireless communication network, or a composite communication network. The communication network may include mobile communication networks such as 3G, LTE, or LTE-A. The communication network may include wired or wireless communication networks such as Wi-Fi, UMTS / GPRS, or Ethernet. The communication network may include local area networks such as Magnetic Secure Transmission (MST), Radio Frequency Identification (RFID), Near Field Communication (NFC), ZigBee, Z-Wave, Bluetooth, Bluetooth Low Energy (BLE), or Infrared communication (IR). The communication network may also include local area networks (LAN), metropolitan area networks (MAN), or wide area networks (WAN).

[0049] Server 10000 may receive data from at least one of the first terminal 6000 and the second terminal 7000. Server 10000 may perform calculations using the data received from at least one of the first terminal 6000 and the second terminal 7000. Server 10000 may transmit the results of the above calculations to at least one of the first terminal 6000 and the second terminal 7000. Server 10000 may receive a mediation request from at least one of the first terminal 6000 and the second terminal 7000. Server 10000 may select the terminal to which the mediation request is transmitted. For example, Server 10000 may select the first terminal 6000 and the second terminal 7000. Server 10000 may mediate the communication connection between the selected first terminal 6000 and the second terminal 7000. For example, Server 10000 may mediate a video call connection between the first terminal 6000 and the second terminal 7000, or it may mediate a text transmission and reception connection. Server 10000 may transmit connection information regarding the first terminal 6000 to the second terminal 7000, or transmit connection information regarding the second terminal 7000 to the first terminal 6000. Connection information for the first terminal 6000 may include, for example, the IP address and port number of the first terminal 6000. Upon receiving connection information for the second terminal 7000, the first terminal 6000 may use the received connection information to attempt to establish a connection with the second terminal 7000.

[0050] A video call session between the first terminal 6000 and the second terminal 7000 can be established by a successful attempt to connect the first terminal 6000 to the second terminal 7000, or by an attempt to connect the second terminal 7000 to the first terminal 6000. Through the above video call session, the first terminal 6000 may transmit images and sounds to the second terminal 7000. The first terminal 6000 may encode the images and sounds into digital signals and transmit the encoded results to the second terminal 7000. Furthermore, the first terminal 6000 may receive images and sounds from the second terminal 7000 via the video call session described above. The first terminal 6000 may receive images and sounds encoded as digital signals and decode the received images and sounds. Through the video call session described above, the second terminal 7000 may transmit images and sound to the first terminal 6000. Furthermore, through the video call session described above, the second terminal 7000 may receive images and sound from the first terminal 6000. This allows the user of the first terminal 6000 and the user of the second terminal 7000 to communicate with each other via video call. The first terminal 6000 and the second terminal 7000 may be, for example, a desktop computer, a laptop computer, a smartphone, a smart tablet, a smartwatch, a mobile terminal, a digital camera, a wearable device, or a portable electronic device. The first terminal 6000 and the second terminal 7000 may run a program or an application. The first terminal 6000 and the second terminal 7000 may be the same type of device as each other, or they may be different types of devices as each other.

[0051] Figure 21 is a flowchart illustrating a schematic image deformation method according to one embodiment of the present invention. Referring to Figure 21, an image deformation method according to one embodiment of the present invention includes the steps of acquiring user face landmark information (S2100), generating a user feature map (S2200), generating a target feature map (S2300), generating a mixed feature map (S2400), and generating a reenacted image (S2500). In step S2100, landmark information is obtained from the user's face image. The landmark refers to a facial feature that is characteristic of the user's face, and may include, for example, the user's eyes, eyebrows, nose, mouth, ears, or jawline. The landmark information may also include information about the position, size, or shape of the main elements of the user's face. Furthermore, the landmark information may include information about the color or texture of the main elements of the user's face. The above-mentioned user may mean any user using a terminal on which the image deformation method according to the present invention is performed. In step S2100, the user's face image is received and landmark information corresponding to the face image is acquired. The landmark information is obtained using known techniques, and any known method may be used. Furthermore, the present invention is not limited by the method used to acquire the landmark information. In step S2200, a transformation matrix corresponding to the above landmark information may be estimated. The above transformation matrix can be used to construct the above landmark information together with a predetermined unit vector. For example, the first landmark information may be calculated by multiplying the above unit vector and the first transformation matrix. The second landmark information may be calculated by multiplying the above unit vector and the second transformation matrix.

[0052] The transformation matrix described above is a matrix that transforms high-dimensional landmark information into low-dimensional data, and can be used in Principal Component Analysis (PCA). PCA is a dimensionality reduction method that transforms variables in a high-dimensional space into variables in a low-dimensional space by searching for new mutually orthogonal axes while preserving the variance of the data as much as possible. PCA first finds the hyperplane closest to the data, and then projects the data onto the low-dimensional hyperplane to reduce the dimensionality of the data. Alternatively, high-dimensional data can be transformed into low-dimensional data by defining the i-th axis in PCA as the i-th principal component (PC) and then linearly combining these axes.

number

[0053] The above learning model may be trained to estimate the above transformation matrix from facial images of different people and landmark information corresponding to each facial image. Although there may be multiple transformation matrices corresponding to a single high-dimensional landmark information, the above learning model may be trained to output only one of the multiple transformation matrices. The landmark information used as input to the above-mentioned learning model may be obtained using a known method of extracting landmarks from a face image and visualizing them. Therefore, in step S2100, the user's face image and landmark information corresponding to the face image are received as input, and a transformation matrix is estimated and output from there. On the other hand, the above-mentioned learning model may be trained to classify landmark information into multiple semantic groups corresponding to the right eye, left eye, nose, and mouth, and to output PCA conversion coefficients corresponding to each of the above multiple semantic groups. In this case, the above semantic groups are not necessarily classified to correspond to the right eye, left eye, nose, and mouth, but may also be classified to correspond to the eyebrows, eyes, nose, mouth, and jawline, or to correspond to the eyebrows, right eye, left eye, nose, mouth, jawline, and ear. In step S2100, the above landmark information may be classified into semantic groups of subdivided units according to the above learning model, and PCA conversion coefficients corresponding to the classified semantic groups may be estimated.

[0054] On the other hand, the above transformation matrix is used to calculate the user's expression landmark. Landmark information can be decomposed into multiple sub-landmarks, but in this invention, the above landmark information is represented as follows.

number

[0055] The aforementioned learning model may be trained to estimate the PCA coefficient α(c,t) by taking a photograph x(c,t) of person c and landmark information l(c,t) as input, as shown in equation 14. Through such training, the learning model may estimate the PCA coefficient from an image of a specific person and its corresponding landmark information, or it may estimate the low-dimensional eigenvectors. When applying a trained neural network, the photograph x(c', t) of person c' whose landmark separation is to be performed and the landmark information l(c', t) are taken as input to the neural network, and the PCA transformation matrix is estimated. At this time, b exp This refers to the PCA coefficient and b predicted (estimated) using values obtained from the training data. exp Using this, you can estimate the representational landmark as follows:

number

[0056] The user feature map generated in step S2200 includes information representing the facial expressions and facial movements of the user. Furthermore, the artificial neural network used in step S2200 may be a Convolutional Neural Network (CNN), but various types of artificial neural networks may be used. In step S2300, the target's face image is received, and a target feature map and a pose-normalized target feature map are generated from the style information and pose information corresponding to the target's face image. The above-mentioned target refers to a person to be transformed by the present invention, and the above-mentioned user and the above-mentioned target may be different people, but are not necessarily limited to each other. The reenacted image produced as a result of carrying out the present invention may be transformed from the target's facial image and displayed in the form of the target that imitates or copies the user's movements and facial expressions. The target feature map described above includes information that represents the facial expression represented by the target and the characteristics of the facial movements of the target. The above pose-normalized target feature map may correspond to the output of the above style information input to the artificial neural network. Alternatively, the above pose-normalized target feature map may include information corresponding to the unique features of the target's face, excluding the target's pose information.

[0057] The artificial neural network used in step S2300 may be a CNN, similar to the artificial neural network used in step S2200, and the structure of the artificial neural network used in step S2200 and the structure of the artificial neural network used in step S2300 may differ from each other. The above style information refers to information that indicates a person's unique features on their face. For example, the above style information may include innate features, the size, shape, and position of landmarks that appear on the target's face. Alternatively, the above style information may include at least one of the following: texture information, color information, and shape information, which correspond to the target's facial image.

[0058] The target feature map described above may be understood to include data corresponding to the representational landmark information obtained from the target's facial image, and the pose-normalized target feature map described above may be understood to include data corresponding to the identity landmark information obtained from the target's facial image. Alternatively, the above style information may include at least one of the following: texture information, color information, and shape information, corresponding to the target face image. The mixed feature map described above may be generated such that the target landmarks have pose information corresponding to the user landmarks described above. The artificial neural network used in step S2400 may be a CNN, similar to the artificial neural networks used in steps S2200 and S2300, and the structure of the artificial neural network used in step S2400 may differ from the structure of the artificial neural network used in the previous steps.

[0059] In step S2500, a re-enacted image is generated for the target face image using the mixed feature map and the pose-normalized target feature map. As mentioned above, the pose-normalized target feature map includes data corresponding to identity landmark information obtained from the target's facial image. This identity landmark information refers to information corresponding to the unique characteristics of the person, which are unrelated to the expression information corresponding to the person's movement or facial expression. If the mixed feature map generated in step S2400 allows for the target's movement to naturally follow the user's movements, then in step S2500, the unique characteristics of the target can be reflected, achieving an effect similar to that of the actual target moving and expressing emotions on its own.

[0060] Figure 22 is an illustrative diagram showing the results of performing an image deformation method according to one embodiment of the present invention. Figure 22 shows a target image, a user image, and a reenacted image, the reenacted image having the facial movements and expressions of the user while maintaining the facial features of the target image. Comparing the target image in Figure 22 with the reenacted image, it can be seen that the two images depict the same person, with only the facial expression being different. The eyes, nose, mouth, and hairstyle of the target image are the same as those of the reenacted image. On the other hand, the facial expression of the person in the re-enacted image is substantially the same as that of the user. For example, if the user has their mouth open in the user's image, the re-enacted image will have an image of the target with their mouth open. Also, if the user turns their head to the right or left in the user's image, the re-enacted image will have an image of the target with their head turned to the right or left. When receiving real-time changing images of a user and generating re-enacted images based on these images, the re-enacted images can change the target image in response to the user's real-time changing movements and facial expressions.

[0061] Figure 23 is a schematic diagram showing the configuration of an image deformation device according to one embodiment of the present invention. Referring to Figure 23, the image deformation device 8000 according to one embodiment of the present invention includes a landmark acquisition unit 8100, a first encoder 8200, a second encoder 8300, a blender 8400, and a decoder 8500. The landmark acquisition unit 8100 receives facial images of the user and the target, and acquires landmark information from each facial image. The landmark refers to a facial feature that is characteristic of the user's face, and may include, for example, the user's eyes, eyebrows, nose, mouth, ears, or jawline. The landmark information may also include information about the position, size, or shape of the main elements of the user's face. Furthermore, the landmark information may include information about the color or texture of the main elements of the user's face. The above-mentioned user may mean any user using a terminal on which the image deformation method according to the present invention is performed. The landmark acquisition unit 8100 receives the face image of the above-mentioned user and acquires landmark information corresponding to the face image. The above-mentioned landmark information can be obtained using known techniques, and any known method may be used. Furthermore, the present invention is not limited by the method used to acquire the above-mentioned landmark information.

[0062] The landmark acquisition unit 8100 may estimate a transformation matrix corresponding to the above landmark information. The above transformation matrix can constitute the above landmark information together with a predetermined unit vector. For example, the first landmark information may be calculated by multiplying the above unit vector and the first transformation matrix. The second landmark information may be calculated by multiplying the above unit vector and the second transformation matrix. The transformation matrix described above is a matrix that transforms high-dimensional landmark information into low-dimensional data, and can be used in Principal Component Analysis (PCA). PCA is a dimensionality reduction method that transforms variables in a high-dimensional space into variables in a low-dimensional space by searching for new mutually orthogonal axes while preserving the variance of the data as much as possible. PCA first finds the hyperplane closest to the data, and then projects the data onto the low-dimensional hyperplane to reduce the dimensionality of the data. Alternatively, high-dimensional data can be transformed into low-dimensional data by defining the i-th axis in PCA as the i-th principal component (PC) and then linearly combining these axes.

[0063] On the other hand, the landmark acquisition unit 8100 may use a learned model that has been trained to estimate the above transformation matrix. The above learned model may be understood as a model that has been trained to estimate the PCA transformation matrix from an arbitrary face image and landmark information corresponding to the arbitrary face image. The above learning model may be trained to estimate the above transformation matrix from facial images of different people and landmark information corresponding to each facial image. Although there may be multiple transformation matrices corresponding to a single high-dimensional landmark information, the above learning model may be trained to output only one of the multiple transformation matrices. The landmark information used as input to the above-mentioned learning model may be obtained using a known method of extracting landmarks from a face image and visualizing them. Accordingly, the landmark acquisition unit 8100 receives the user's face image and landmark information corresponding to the face image as input, and then estimates and outputs a transformation matrix. On the other hand, the above-mentioned learning model may be trained to classify landmark information into multiple semantic groups corresponding to the right eye, left eye, nose, and mouth, and to output PCA conversion coefficients corresponding to each of the above multiple semantic groups. In this case, the above-mentioned semantic groups are not necessarily classified to correspond to the right eye, left eye, nose, and mouth, but may also be classified to correspond to the eyebrows, eyes, nose, mouth, and jawline, or to correspond to the eyebrows, right eye, left eye, nose, mouth, jawline, and ear. The landmark acquisition unit 8100 may classify the above-mentioned landmark information into semantic groups of subdivided units according to the above-mentioned learning model, and estimate PCA conversion coefficients corresponding to the classified semantic groups.

[0064] Alternatively, the user's expression landmark may be calculated using the above transformation matrix. The landmark information may be separated into multiple sub-landmark information, but in this invention, the landmark information is defined as the sum of the average human face landmark information, the individual person's unique face landmark information, and the person's expression face landmark information. In other words, the landmark information of a particular person in a particular frame may be represented by the sum of the average landmark information of all people's faces, the identity landmark information of the particular person, and the facial expression and movement information of the particular person in the particular frame. On the other hand, the above-mentioned representation landmark corresponds to the pose information of the user's face image, and the above-mentioned identity landmark corresponds to the style information of the target's face image. In summary, the landmark acquisition unit 8100 may receive the user's face image and the target's face image, and generate multiple landmark information from them, including their respective representation landmark information and identity landmark information. The first encoder 8200 generates a user feature map from the pose information of the user's face image. The pose information corresponds to the representational landmark information and may include motion information and facial expression information of the face image. Alternatively, the first encoder 8200 may input the pose information corresponding to the user's face image into an artificial neural network to generate the user feature map.

[0065] The user feature map generated by the first encoder 8200 includes information representing the facial expressions and facial movements of the user. The artificial neural network used in the first encoder 8200 may be a Convolutional Neural Network (CNN), but various types of artificial neural networks may be used. The second encoder 8300 generates a target feature map and a pose-normalized target feature map from the style information and pose information of the target face image. The above-mentioned target refers to a person to be transformed by the present invention, and the above-mentioned user and the above-mentioned target may be different people, but are not necessarily limited to each other. The reenacted image produced as a result of carrying out the present invention may be transformed from the target's facial image and displayed in the form of the target that imitates or copies the user's movements and facial expressions. The target feature map generated by the second encoder 8300 may be understood as data corresponding to the user feature map generated by the first encoder 8200, and includes information representing the facial expression represented by the target and the characteristics of the facial movements of the target. The above pose-normalized target feature map may correspond to the output of the above style information input to the artificial neural network. Alternatively, the above pose-normalized target feature map may include information corresponding to the unique features of the target's face, excluding the target's pose information.

[0066] The artificial neural network used in the second encoder 8300 may be a CNN, similar to the artificial neural network used in the first encoder 8200, and the structure of the artificial neural network used in the first encoder 8200 and the structure of the artificial neural network used in the second encoder 8300 may differ from each other. The above style information refers to information that indicates a person's unique features on their face. For example, the above style information may include innate features, the size, shape, and position of landmarks that appear on the target's face. Alternatively, the above style information may include at least one of the following: texture information, color information, and shape information, which correspond to the target's facial image. The target feature map described above may be understood to include data corresponding to the representational landmark information obtained from the target's facial image, and the pose-normalized target feature map described above may be understood to include data corresponding to the identity landmark information obtained from the target's facial image. Blender 8400 may generate a mixed feature map using the above-mentioned user feature map and the above-mentioned target feature map, input the pose information of the user's face image and the style information of the target's face image into an artificial neural network, and generate the above-mentioned mixed feature map.

[0067] The mixed feature map described above may be generated such that the target landmarks have pose information corresponding to the user landmarks described above. The artificial neural network used in the Blender 8400 may be a CNN, similar to the artificial neural networks used in the first encoder 8200 and the second encoder 8300, and the structure of the artificial neural network used in the Blender 8400 may differ from the structure of the artificial neural network used in the first encoder 8200 or the second encoder 8300. The user feature map and target feature map input to the Blender 8400 each include the user's facial landmark information and the target's facial landmark information, respectively, and generate a target face corresponding to the user's facial movements and expressions. However, the Blender may also perform a matching operation between the user's facial landmark and the target's facial landmark in order to maintain the unique characteristics of the target face. For example, in order to control the facial movements of the target in accordance with the facial movements of the user, landmarks such as the eyes, eyebrows, nose, mouth, and jawline of the user are linked to landmarks such as the eyes, eyebrows, nose, mouth, and jawline of the target.

[0068] Alternatively, in order to control the facial expression of the target in accordance with the facial expression of the user, landmarks such as the eyes, eyebrows, nose, mouth, and jawline of the user may be linked to landmarks such as the eyes, eyebrows, nose, mouth, and jawline of the target. The decoder 8500 uses the mixed feature map and the pose-normalized target feature map to generate a re-enacted image for the target face image. As mentioned above, the pose-normalized target feature map includes data corresponding to identity landmark information obtained from the target's facial image. This identity landmark information refers to information corresponding to the unique characteristics of the person, which are unrelated to the expression information corresponding to the person's movement or facial expression. If the mixed feature map generated by Blender 8400 allows for the target's movement to naturally follow the user's movements, the decoder 8500 can reflect the target's unique characteristics, achieving an effect similar to that of an actual target moving and expressing emotions on its own.

[0069] Figure 24 is a schematic diagram showing the configuration of a landmark acquisition unit according to one embodiment of the present invention. Referring to Figure 24, the landmark acquisition unit according to one embodiment of the present invention may include an artificial neural network, which receives a person's face image (input image) as input. The artificial neural network may be a part of a known artificial neural network, but in one embodiment, the artificial neural network may be ResNet. ResNet is a type of CNN (Convolutional Neural Network), and the present invention is not limited to a specific type of artificial neural network. A Multi-Layer Perceptron (MLP) is a type of artificial neural network that stacks multiple layers of perceptrons to overcome the limitations of a single-layer perceptron. Referring to Figure 24, the MLP receives the output of the artificial neural network and landmark information corresponding to the face image as input. The MLP also outputs a transformation matrix. Alternatively, the artificial neural network and the MLP can be understood as forming a single, trained artificial neural network as a whole. When the transformation matrix is estimated via the learned artificial neural network, expression landmark information and identity landmark information can be calculated as described with reference to FIG. 23. The image deformation device according to the present invention can also be applied when there are only a very small number of face images or when there is only a face image of one frame. The learned artificial neural network is trained to estimate low-dimensional eigenvectors and transformation coefficients from a large number of face images and the corresponding landmark information. The artificial neural network thus trained can estimate the eigenvectors and transformation coefficients even when only a face image of one frame is given. By such a method, when the expression landmarks and identity landmarks of an arbitrary person are separated, the quality of face image processing technologies such as facial landmark-based face reenactment, face classification, and face morphing can be improved.

[0070] FIG. 25 is a diagram schematically showing the configuration of a second encoder according to an embodiment of the present invention. Referring to FIG. 25, the second encoder 8300 according to an embodiment of the present invention may adopt the structure of a U-Net. U-Net means a U-shaped network, basically executes a subdivision function, and has a symmetric form. f y means a normalization flow map used when normalizing the target feature map, and T means a warping function for performing warping. Also, S j、 j = 1....n y represents the target feature map encoded in each convolutional layer. The second encoder 8300 receives the rendered target landmarks and the target image as inputs, and then generates the encoded target feature map and the normalization flow map f y Also, the generated target feature map S jand normalized flow map f y By using the input and executing the warping function, a warped target feature map is generated. The warped target feature map here may be understood as being similar to the pause-normalized target feature map described above. Therefore, the warping function T may be understood as a function that generates data consisting only of the target's style information, i.e., identity landmark information, excluding the target's representation landmark information. Figure 26 is a schematic diagram showing the structure of a blender according to one embodiment of the present invention. As mentioned above, the Blenda 8400 generates a mixed feature map from a user feature map and a target feature map, but it may also generate the mixed feature map by inputting pose information from the user's face image and style information from the target's face image into an artificial neural network.

[0071] Figure 26 shows one user feature map and three target feature maps, but there may be one target feature map, or more than two or three. Also, the small areas within each feature map shown in Figure 25 represent information for any landmark, and all of them represent information for the same landmark. The user feature map and target feature map input to the Blender 8400 each include the user's facial landmark information and the target's facial landmark information, respectively, and generate a target face corresponding to the user's facial movements and expressions. However, the Blender may also perform a matching operation between the user's facial landmark and the target's facial landmark in order to maintain the unique characteristics of the target face. For example, in order to control the facial movements of the target in accordance with the facial movements of the user, landmarks such as the eyes, eyebrows, nose, mouth, and jawline of the user are linked to landmarks such as the eyes, eyebrows, nose, mouth, and jawline of the target. Alternatively, in order to control the facial expression of the target in accordance with the facial expression of the user, landmarks such as the eyes, eyebrows, nose, mouth, and jawline of the user may be linked to landmarks such as the eyes, eyebrows, nose, mouth, and jawline of the target. Alternatively, for example, after searching for an eye in the user feature map, an eye may be searched for in the target feature map, and a mixed feature map may be generated such that the eye in the target feature map follows the movement of the eye in the user feature map. The Blender 8400 can perform substantially the same operation for other landmarks as well.

[0072] Figure 27 is a schematic diagram showing the structure of a decoder according to one embodiment of the present invention. Referring to Figure 27, the decoder 8500 according to one embodiment of the present invention uses a pause-normalized target feature map generated by the second encoder 8300 and a mixed feature map z generated by the blender 8400. xy By using this as input, the user's representational landmark information is applied to the target image. In Figure 27, the data input to each block of the decoder 8500 is a pause-normalized target feature map generated by the second encoder 8300, and f u This refers to a flow map that applies user-represented landmark information to a pause-normalized target feature map. Furthermore, the Warp-alignment block of decoder 8500 takes the output u of the previous block of decoder 8500 and the pose-normalized target feature map as input and performs a warping function. The warping function performed by decoder 8500 is for generating a reenacted image that mimics the user's movements and poses while maintaining the unique characteristics of the target, and is different from the warping function performed by the second encoder 8300. On the other hand, moving images can be generated based on the embodiments described above, referring to Figures 1 to 27. For example, referring to Figures 5a to 6b, as mentioned above, it is possible to generate moving images by converting input still images. Alternatively, the input still image may be converted into a moving image based on an image conversion template. The image conversion template may include multiple frames, each of which may be a still image. For example, the input still image may be applied to each of the multiple frames to generate multiple intermediate images (i.e., multiple still images). The generated intermediate images may also be combined to generate a moving image.

[0073] Alternatively, the input video may be converted to generate a video. In this case, each of the multiple first still images (frames) contained in the input video is converted into a second still image, and these second still images are combined to generate a video. Referring to Figures 1 to 27, the embodiments described above may be realized by the content described later. For example, at least a part of the content described later may be applied to at least one of the embodiments described above with reference to Figures 1 to 27. Also, when the meaning of the terms described below is the same or similar to the meaning of the terms described with reference to Figures 1 to 27, they may be understood as referring to the same component. Furthermore, when the content described below is the same or similar to the content described above with reference to Figures 1 to 27, they may be understood as being the same. Furthermore, the content described below may be included in the content of the paper "MarioNETte: Few-Shot Face Reenactment Preserving Identity of Unseen Targets". When a mismatch occurs between the target identity and the driver identity, the quality of the results is significantly degraded when re-simulating faces, especially when setting up imaging over multiple sessions. Identity preservation problems, i.e., the loss of target details that leads to output defects, are the most common failure modes. This problem has several potential causes, including driver identity leakage due to identity mismatch and invisible large pose processing.

[0074] To overcome these problems, we propose an image attention block, a target feature alignment unit, and a landmark transformation unit as components to solve the aforementioned problems. By processing and warping relevant features, the proposed structure, called MarioNETte, reproduces in high quality the invisible identity through multiple imaging settings. Furthermore, the landmark transformation unit separates the representation geometry by decomposing landmarks, dramatically mitigating the identity preservation problem. A comprehensive experiment will be conducted to confirm whether the proposed framework can generate a highly realistic face that exceeds all other criteria, even when the facial features between the target and the driver do not significantly match. Given the target's face and the driver's face, the face reenactment aims to synthesize an animated face based on the driver's movements while maintaining the target's identity. Traditionally, GAN (generative adversarial network) methods have been widely used, achieving great success in image generation. Xu et al.; Wu et al. (2017; 2018) utilized CycleGAN (Zhu et al., 2017) to obtain highly fidelity face reproduction results. However, CycleGAN-based approaches require at least several minutes of training data for each target and can only reproduce predefined identities. This is not very appealing in the reality where the reproduction of invisible targets is unavoidable. Therefore, face re-enactment approaches that involve imaging multiple times attempt to re-enact invisible targets using adaptive instance normalization (AdaIN) (Zakharov et al., 2019) or warping modules (Wiles, Koepke, and Zisserman 2018; Siarohin et al., 2019). However, current state-of-the-art methods suffer from a problem called the identity preservation problem, namely the inability to preserve the target's identity, which leads to flaws in re-enactment. The problem is further exacerbated because the driver's identity and the target's identity are different.

[0075] Examples of flawed and successful facial reproductions generated by previous approaches and proposed models are shown in Figures 28A–28C, respectively. In most cases, the failures of previous approaches can be divided into three modes. 1. If identity mismatch is not considered, the driver's identity will interfere with face synthesis, resulting in a generated face that resembles the driver (Figure 28A). 2. If the compressed vector representation (e.g., the AdaIN layer) used to preserve target identity information is insufficient, the generated face may lack detailed features (Figure 28B). 3. Warping operations create defects when handling large poses (Figure 28C). We propose a framework called MarioNETte. It aims to recreate the face of an invisible target using several imaging methods while preserving identity without fine-tuning. Here, we use an image attention block and target feature alignment, which allows MarioNETte to directly inject features from the target during image generation. We also propose a new landmark transformation unit, which adjusts for identity discrepancies in an unsupervised manner, further mitigating the identity preservation problem. Details are described below. We propose a multi-image face reenactment framework called MarioNETte. This maintains the target identity even when the driver face features differ significantly from the target. The proposed method uses image interest blocks, which allow the model to process the relevant locations of the target feature map, in combination with target feature alignment, which involves multi-feature-level warping work, to improve the quality of face reenactment under conditions of differing identities. This paper introduces a new landmark transformation method that accommodates the diverse facial features of various people. The proposed method adapts the driver's landmark to the target landmark in an unsupervised manner, mitigating identity preservation problems without the need for separate label data. Using the VoxCeleb1 (Nagrani, Chung, Zisserman 2017) and CelebV (Wu et al., 2018) datasets, we will compare state-of-the-art methods when target identity and driver identity match or differ, respectively. This experiment, including user research, will demonstrate that the proposed method surpasses state-of-the-art methods. MarioNETte structure

[0076] Figure 29 shows the overall structure of the proposed model. The conditional generator G generates a face reproduced based on the driver x and the target image JPEG0007875924000038.jpg934, and the discriminator D predicts whether the image is real or not. The generator is composed of the following components. · The preprocessor P extracts facial keypoints using a 3D landmark detector (Bulat and Tzimiropoulos 2017) and renders them onto the landmark image, corresponding to the driver and target inputs respectively JPEG0007875924000039.jpg626 and JPEG0007875924000040.jpg827 are calculated. The proposed landmark conversion part is included in the preprocessor. Since the size, movement, and rotation of the landmarks are normalized before being used in the landmark conversion part, 3D landmarks are utilized instead of 2D landmarks. · Driver encoder JPEG0007875924000041.jpg818 extracts pose and expression information from the driver input and generates the driver's feature map z x . · Target encoder JPEG0007875924000042.jpg723 adopts the structure of U-Net to extract style information from the target input and generates the warped target feature map JPEG0007875924000043.jpg85 together with the target feature map z y . · Blender JPEG0007875924000044.jpg943 receives the driver feature map z x and the target feature map JPEG0007875924000045.jpg844 and generates the mixed feature map z xyThis generates the proposed image attention block, which is a fundamental component of Blender. ·decoder JPEG0007875924000046.jpg845 is a warped target feature map. JPEG0007875924000047.jpg75 and mixed feature map z xy The re-enacted images are synthesized using this method. The decoder improves the quality of the re-enacted images using the proposed target feature alignment. Image attention block To transfer target style information to the driver, previous studies encoded the target information into vectors and mixed them with driver features via concatenation or AdaIN layers (Liu et al., 2019; Zakharov et al., 2019). However, encoding targets as spatially independent vectors results in a loss of spatial information. Furthermore, such methods handle multiple targets where there is no unique design for multiple target images, and therefore summary statistics (e.g., mean or maximum) may be used, potentially losing detailed target information.

[0077] To solve the aforementioned problem, we propose an image attention block (Figure 30). The proposed attention block is inspired by the encoder-decoder attention of the transformation unit (VaSwani et al. 2017), where the driver feature map acts as the attention query and the target feature map acts as the attention memory. The proposed attention block has multiple target feature maps (i.e., Z y While processing the above, the appropriate location of each feature (red box in Figure 30) is processed. Driver Feature Map JPEG0007875924000048.jpg638 and target feature map Considering JPEG0007875924000049.jpg782, the above attention is calculated as follows.

Number

[0078] Target Feature Alignment The fine details of the target identity can be preserved through the warping of low-level features (Siarohin et al. 2019). Different from previous approaches that calculate the differences between the key points of the target and the driver and estimate the warping flow map or the affine transformation matrix (Balakrishnan et al. 2018; Siarohin et al. 2018; Siarohin et al. 2019), the proposed target feature alignment warps the target feature map in two steps. (1) Target pose normalization generates a pose-normalized target feature map, and (2) driver pose adaptation aligns the fully-modified target feature map to the driver's pose (Figure 31). Through the two-step process, the model can better handle the structural differences of identities that are different from each other. The details are as follows. 1. Target pose normalization The target encoder E y The feature map encoded by JPEG0007875924000055.jpg829 is processed by the estimated normalization flow map f y and the warping function JPEG0007875924000056.jpg876 (in 1 of Figure 31), JPEG0007875924000057.jpg99. The next warp alignment block of the decoder processes JPEG0007875924000058.jpg96 in a way that is independent of the target pose. JPEG0007875924000058.jpg96. 2. Driver pose adaptation The warp alignment block of the decoder receives JPEG0007875924000059.jpg1135 and the output u of the previous block of the decoder. In multiple imaging settings, other target images (e.g., Average the resolution-compatible feature maps of JPEG0007875924000060.jpg1148). To apply the pose-normalized feature map to the driver's pose, use a 1x1 convolution with u as input and the driver f u Generate an estimated flow map. Alignment is performed using JPEG0007875924000061.jpg1130 (Figure 31-2). Subsequently, the above result is concatenated to u and input to the next remaining upsampling block.

[0079] Landmark transformation section Significant structural differences between two facial landmarks drastically degrade the quality of re-enactment. Common approaches to these problems are to learn transformations for all identities (Wu et al., 2018) or to prepare paired landmark data with similar representations (Zhang et al., 2019). However, such methods are unnatural in several imaging settings when dealing with invisible identities and present difficulties in collecting labeled data. To overcome these difficulties, we propose a novel landmark transformation unit that transfers a driver's facial expression to an arbitrary target identity. The landmark transformation unit utilizes multiple images of unlabeled human faces and is learned in an unsupervised manner. Landmark separation When viewing screens of images with different identities, x(c,t) is set to the t-th frame of the c-th image, and l(c,t) is displayed as a 3D face landmark. First, the size, translation, and rotation are normalized for all landmarks, and the normalized landmarks... Convert to JPEG0007875924000062.jpg1019. Inspired by a 3D morphable face model (Blanz and Vetter 1999), we assume that normalized landmarks can be separated as follows:

number

[0080] Landmark Decomposition In several imaging settings, a neural network is introduced that regresses linear-based coefficients to separate identity from representational geometry. Previously, such an approach has been widely used to model the geometric structure of complex faces (Blanz and Better 1999). Representation bases are extracted from training data by separating representational landmarks into semantic groups of the face (e.g., mouth, nose, eyes) and performing PCA on each group.

number

Number

[0081] Experimental settings Dataset The model and benchmarks were trained using VoxCeleb1 (Nagrani, Chung, and Zisserman 2017), which contains 1,251 videos of different identities with a size of 256×256. The test splits of VoxCeleb1 and CelebV (Wu et al., 2018) were used to evaluate self-play and play under other identities, respectively. From 100 randomly selected videos in the VoxCeleb1 test split, a test set was generated by sampling 2,083 image sets, and 2,000 image sets were uniformly sampled from all identities in CelebV. The CelebV data contains videos of five famous people with various characteristics, and this was used to evaluate the performance of the model in reenacting targets that do not look like actual scenarios. Details of the loss function and learning method can be found in Supplementary Materials A3 and A4. Benchmarks MarioNETte variants (MarioNETte+LT and MarioNETte) with or without a landmark conversion unit are compared with state-of-the-art models for multi-shot face reenactment. The detailed information for each benchmark is as follows. ·x2Face (Wiles, Koepke, and Zisserman 2018) The x2 face uses direct image distortion. A pre-trained model provided by the inventors trained on VoxCeleb1 is used. • Monkey-Net (Siarohin et al. 2019) Monkey-Net employs feature-level warping. The implementation provided by the inventors is used. Due to the structure of the method, Monkey-Net can receive only one original image. • NeuralHead (Zakharov et al. 2019) NeuralHead utilizes an AdaIN layer. Since there is no reference implementation, we honestly attempted to replicate the results. This implementation is a feedforward version of the model (NeuralHead-FF), so the meta-training and fine-tuning stages are omitted. This is because a single model is used to handle multiple identities.

[0082] index To evaluate the quality of the generated images, the models are compared based on the following metrics: Structural Similarity (SSIM) (Wang et al., 2004) and Peak Signal-to-Noise Ratio (PSNR) assess the low level of similarity between the generated image and the actual image. Masked SSIM (M-SSIM) and masked PSNR (M-PSNR), where measurements are limited to the face region, are also reported. When there are no actual images of different identities driving the target face, the following metrics are more relevant: Assess identity preservation quality using cosine similarity (CSIM) of embedding vectors generated by a pre-trained face recognition model (Deng et al. 2019). Calculate PRMSE, the root mean squared error of head pose angles, and PRSE, the ratio of similar face action unit values between the generated image and the driving image, to examine the model's ability to adequately reproduce pose and driver representations. Utilize OpenFace (Baltrusaitis et al. 2018) to calculate pose angles and action unit values. Experimental results The models were compared under self-reenactment and reenactment conditions with different identities, including user studies. Ablation experiments were also performed. All experiments were conducted in both single-image and multi-image settings; one target image was used for single-image settings, and eight target images were used for multi-image settings. self-reenactment Figure 34 shows the evaluation results of the VoxCeleb1 model under the self-reenactment setting. MarioNETte outperforms the other models in all measurements under the multiple-image setting, and in all measurements except PSNR under the single-image setting. However, MarioNETte shows the best performance in M-PSNR, which means it performs better in the facial region compared to the baseline. The low CSIM obtained with NeuralHead-FF is indirect evidence of the capacity deficiency of the AdaIN-based method.

[0083] Other identity reenactments Figure 35 shows the evaluation results of re-enacting other identities with CelebV, and Figure 33 shows images generated from the proposed method and criteria. MarioNETte and MarioNETte+LT adequately preserve the target identity and outperform other CSIM models. The proposed method mitigates the identity preservation problem regardless of whether the driver is the same identity or not. NeuralHead-FF shows slightly better performance than MarioNETte in terms of PRMSE and AUCON, but the low CSIM of NeuralHead-FF means that it failed to preserve the target identity. The landmark transformation unit shows a slight decrease in PRMSE and AUCON, but significantly improves identity preservation. The above decrease may be because the PCA criteria for representation decomposition are not diverse enough to encompass the entire space of representations. Furthermore, the decomposition of identity and representation itself is a critical issue, especially in single-shot settings. User Research Two types of user studies will be conducted to evaluate the performance of the proposed model. • Comparative Analysis: Considering three exemplary images of the target and an image of a driver, two images generated by different models were displayed, allowing human evaluators to select the high-quality image. Users were asked to evaluate the image quality in terms of (1) identity preservation, (2) reproduction of the driver's pose and facial expression, and (3) photorealism. The win rate of the baseline model compared to the proposed model was reported. The scores reported by users are considered to reflect the quality of the other models better than other indirect measurement items. • Realism Analysis: Similar to the user research protocol by Zakharov et al. (2019), three photographs of the same person were presented to human evaluators. Of the three photographs, two were taken from video, and the remaining one was a photograph generated by the aforementioned model. Users were instructed to select an image that differed from the other two images in terms of identity aspects within a time limit of 3 seconds. The trick percentage representing identity preservation and photorealism for each model is reported. In both studies, 150 examples were sampled from CelebV and distributed equally to 100 different evaluators.

[0084] Figure 36 shows that this model is preferred over conventional methods and has a significantly higher realism score. This ultimately represents MarioNETte's ability to generate realistic reenactments while preserving target identity in terms of human perception. A slight preference for MarioNETte over MarioNETte+LT was observed. This is because, as shown in Figure 35, MarioNETte+LT has a higher identity preservation capability, although its representational transmission is slightly reduced. Since MarioNETte+LT's identity preservation capability surpasses all other models in realism scores, i.e., it is almost twice as high as MarioNETte's score in several imaging settings, the slight reduction in representational transmission is not a significant issue. Ablation experiment Ablation tests were performed to investigate the effects of the proposed components. The following configurations are compared, each re-enacting a different identity while maintaining all other aspects similarly: (1) MarioNETte is the proposed method, where both the image attention block and target feature alignment are applied. (2) AdaIN is a model similar to MarioNETte, where the image attention block is replaced by the rest of the AdaIN block, and target feature alignment is omitted. (3) +Attention is MarioNETte with only the image attention block applied. (4) +Alignment uses only target feature alignment. Figure 37 shows the results of the ablation test. For identity preservation (e.g., CSIM), AdaIN has difficulty combining style features that rely only on the remaining AdaIN blocks. +Attention handles coordinates appropriately and greatly mitigates the problem in both single-image and multi-image settings. +Alignment shows higher CSIM than +Attention, but struggles to produce plausible images for invisible poses and representations, resulting in worse PRMSE and AUCON. MarioNETte leverages both attention and target feature alignment and performs better than +Alignment in all metrics under consideration.

[0085] +Alignment, which relies entirely on target feature alignment for re-enactment, is prone to failure due to large pose differences between the target and the driver. MarioNETte can overcome this. Given a single driver image along with three target images (Figure 38A), +Alignment shows a flaw in the forehead (indicated by the arrow in Figure 38B). This is because it (1) warps low-level features with a large pose input and (2) combines features from multiple targets with varying poses. MarioNETte, on the other hand, handles the situation appropriately by processing not only the correct spatial coordinates within the target image but also the correct image among several target images. An attention map highlighting the area in focus of the image attention block is shown in white in Figure 38A. MarioNETte processes the forehead and the correct target images (targets 2 and 3 in Figure 38A) which have similar poses to the driver. Related technologies A classic approach to face reenactment is to use explicit 3D modeling of human faces (Blanz and Vetter 1999), where the parameters of driver and target 3DMMs are typically computed from a single image and then mixed (Thies et al. 2015; Thies et al. 2016). Image warping is another popular approach, which modifies a target image using estimated flows derived from 3D models (Cao et al. 2013) or rare landmarks (Averbuch-Elor et al. 2017). Face reenactment research has embraced recent successes of neural networks exploring various image-to-image movement structures, such as the work of Xu et al. (2017) and Wu et al. (2018) which combine cycle consistency loss (Zhu et al. 2017) (Isola et al. 2017). A mixture of the two approaches has also been studied. Kim et al. (2018) trained an image translation network to map reenacted renderings of 3D face models to realistic outputs.

[0086] Recently, structures have been proposed that can fuse target style information with driver spatial information. AdaIN layers (Huang and Belongie 2017; Huang et al. 2018; Liu et al. 2019), attention mechanisms (Zhu et al. 2019; Lathuilière et al. 2019; Park and Lee 2019), deformation work (Siarohin et al. 2018; Dong et al. 2018), and GAN-based methods (Bao et al. 2018) have all been widely adopted. Similar ideas have been applied several times to image-level (Wiles, Koepke and Zisserman 2018) and feature-level (Siarohin et al. 2019) warping, and the use of AdaIN layers coupled with meta-learning (Zakharov et al. 2019). The problem of identity mismatch has been studied using methods such as CycleGAN-based landmark transformation (Wu et al. 2018) and landmark swapping (Zhang et al. 2019). While effective, these methods require a dataset containing independent models for each person or image pairs that are difficult to obtain. conclusion Here, we propose a framework for multiple face reenactments. The proposed image attention block and target feature alignment, along with the landmark transformation unit, can handle identity discrepancies that occur when using other people's landmarks. The proposed method does not require additional fine-tuning steps for identity adaptation, thus significantly increasing the usefulness of the model in actual delivery. This experiment, including human evaluation, suggests the excellence of the proposed method. Future research should focus on improving the landmark transformation section and handling landmark decomposition more effectively to make the re-enactment even more convincing. Supplementary materials Detailed information on the MarioNETte structure

[0087] structural design Driver image x and K target image Given JPEG0007875924000085.jpg1033, a proposed face reenactment framework called MarioNETte first uses a 2D landmark image (i.e., JPEG0007875924000086.jpg610 and Generates JPEG0007875924000087.jpg832). 3D landmark detector Using JPEG0007875924000088.jpg864 (Bulat and Tzimiropoulos2017), JPEG0007875924000089.jpg936 and The image JPEG0007875924000090.jpg835 contains information about the pose and expression, and key points of the face are extracted. Subsequently, the 3D landmarks are rasterized into an image using the Rasterizer R. Obtain JPEG0007875924000091.jpg967. A simple rasterizer is used to project 3D landmark points (e.g., (x, y, z)) perpendicularly onto a 2DxY plane (e.g., (x, y)). The projected landmarks are then grouped into eight categories: left eye, right eye, contour, nose, left eyebrow, right eyebrow, inner mouth, and outer mouth. Lines are drawn between points in a predetermined order using predefined colors (e.g., red, red, green, blue, yellow, yellow, cyan, and cyan, respectively) for each group. The result is the rasterized image shown in Figure 39. MarioNETte is a conditional image generator. JPEG0007875924000092.jpg987 and projection identifier It consists of JPEG0007875924000093.jpg828. The classifier D is determined by the given image JPEG0007875924000094.jpg77 is a rasterized landmark. This determines whether JPEG0007875924000095.jpg87 is an actual image of the data distribution, taking into account the conditional input of identity c.

[0088] generator JPEG0007875924000096.jpg987 can be further broken down into four components: the target encoder, the driver encoder, the blender, and the decoder. Target encoder JPEG0007875924000097.jpg928 is a target image with warped target feature maps. The target feature map z encoded with JPEG0007875924000098.jpg95 y Generates the driver encoder. JPEG0007875924000099.jpg821 receives the driver image and the driver feature map z x Brenda generates... JPEG0007875924000100.jpg952 combines encoded feature maps and mixed feature maps z xy Generates a decoder. JPEG0007875924000101.jpg1055 generates a re-enacted image. Input image y and landmark image r y These are concatenated channel by channel and supplied to the target encoder. Target Encoder JPEG0007875924000102.jpg928 employs a U-Net (Ronneberger, Fischer, and Brox 2015) style structure that includes five downsampling blocks and four upsampling blocks using skip connections. The five feature maps are generated by the downsampling blocks. In JPEG0007875924000103.jpg931, s5, the feature map that was downsampled the most, is encoded with the target feature map z y Used as, the rest JPEG0007875924000104.jpg930 is converted to a normalized feature map. Normalized flow map JPEG0007875924000105.jpg959 has the following warping features Using JPEG0007875924000106.jpg88, each feature map is normalized into a feature map Convert to JPEG0007875924000107.jpg1145.

number

[0089] Due to its potential for differentiation, we employ a bilinear sampler-based warping function, which is widely used with neural networks (Jaderberg et al. 2015; Balakrishnan et al. 2018; Siarohin et al. 2019). j Because the width and height are different, f y The size is S j To match the magnitude, the average pooling is f y Applies to this. Driver Encoder JPEG0007875924000109.jpg921 consists of four remaining downsampled blocks, and the driver's landmark image r x Take the driver's feature map z x Generates. Brenda JPEG0007875924000110.jpg853 is z x Location information and target style feature map z y Mix and to create a mixed feature map z xyを Generate. Create a blender by stacking attention blocks on three images. decoder JPEG0007875924000111.jpg957 consists of four warp alignment blocks and the remaining upsampling blocks. The last upsampling block is followed by an additional convolutional layer and a hyperbolic tangent activation function. Identifier JPEG0007875924000112.jpg829 consists of five remaining downsampling blocks without a self-attention layer. A projection discriminator is employed with a slight modification to remove the global summation layer from the original structure. By removing the global summation layer, the discriminator generates points for multiple patches similar to the PatchGAN discriminator (Isola et al. 2017). The network is constructed using the remaining upsampling and downsampling blocks proposed by Brock, Donahue, and Simonyan (2019). All batch normalization layers are replaced with instance normalization, except for target encoders and discriminators that do not have a normalization layer. ReLU is used as an activation function. The number of channels in the output that are downsampled (or upsampled) is doubled (or halved). The minimum number of channels is set to 64, and the maximum number of channels is set to 512 for all layers. The input images used as inputs to the target encoder, driver encoder, and discriminator are first projected through a convolutional layer to match a channel size of 64.

[0090] Location encoding We will use a slightly modified sinusoidal position encoding introduced by Vaswani et al. (2017). First, we divide the number of channels in the position encoding in half. Subsequently, we use half of these to encode the horizontal coordinates and the other half to encode the vertical coordinates. To encode the relative position, we normalize the absolute coordinates by the width and height of the feature map. Therefore, Given the feature map of JPEG0007875924000113.jpg844, the position encoding JPEG0007875924000114.jpg847 is calculated as follows:

number

[0091] loss function This model was trained adversarially using projection discriminator D (Miyato and Koyama 2018). The discriminator aims to distinguish between the actual image of identity c and the synthetic image of c generated by G. Since the paired target and driver images of different identities cannot be obtained without explicit annotation, the model was trained using target and driver images extracted from the same video. Therefore, x and y i The identity of is always the same for every target and driver image pair during training (e.g., c). That is, ( JPEG0007875924000116.jpg838). We optimize the discriminator D using hinge GAN loss (Lim and Ye2017) as follows:

number

[0092] Learning Details To stabilize adversarial learning, spectral normalization (Miyato et al. 2018) is applied to all layers of the classifier and generator. Furthermore, the convex hull of the facial landmarks is used as the face region mask, and the perceptual loss is calculated while assigning a 3x weight to the positions of this mask. The model is trained using the Adam Optimizer, where 2 × 10⁻¹⁶ values are applied. -4 The learning rate of 5 × 10 is used in the classifier. -5 This is used for the generator and style encoder. Unlike the configuration of Brock, Donahue, and Simonyan (2019), in this invention, the discriminator is updated only once for each generator update during training.、 λ P 10, λ PF 0.01, λ FM Set the target number of images K to 10 and the target number of images K to 4. Detailed information about the landmark transformation section Landmark separation Formally, the separation of landmarks is calculated as follows:

number

[0093] Landmark Decomposition Representation criterion b is obtained using the representation geometry derived from the VoxCeleb1 training data. exp To perform the calculation, the landmarks are divided into other groups (e.g., left eye, right eye, eyebrows, mouth, etc.), and PCA is performed on each group. PCA dimensions of 8, 8, 8, 16, and 8 are used for each group, resulting in a total of 48 representation criteria n exp To obtain. The landmark decomposition part is trained separately using the VoxCeleb1 training set. Before training the landmark decomposition part, the parameters of each representation are normalized and set to a standard normal distribution for ease of regression learning. The process followed JPEG0007875924000134.jpg825. ResNet50, pre-trained on ImageNet (He et al. 2016), was used, and features were extracted from the first to the last layer immediately before the global average pooling layer. The extracted image features were averaged landmarks. Normalized landmark obtained by subtracting JPEG0007875924000135.jpg88 The image is concatenated with JPEG0007875924000136.jpg85 and fed to a two-layer MLP, after which ReLU activation is performed. The overall network has a learning rate of 3 × 10⁻⁶. -4 Adam's optimization tool is used to optimize the model while minimizing the MSE loss between the predicted and target representation parameters. During training, slope clipping was used with a maximum slope standard of 1. Representation intensity parameter λ exp It is set to 1.5.

[0094] Additional ablation experiments Quantitative results In Figures 34 and 35, MarioNETte performs better than NeuralHead-FF in PRMSE and AUCON in the VoxCeleb1 self-reenactment setting, but is inverted while reenacting other identities in CelebV. This phenomenon will be explained through ablation studies. Figure 40 shows the evaluation results of the ablation model under the self-reenactment setting for VoxCeleb1. Unlike the evaluation results with other identities reenacted using CelebV (Figure 37 in this paper), PRMSE and AUCON outperform AdaIN for +Alignment and MarioNETte. This phenomenon may be due to the characteristics of the training dataset and other inductive biases of other models. VoxCeleb1 consists of short video clips (typically 5-10 seconds long) representing similar poses and representations between the driver and target. Unlike the AdaIN-based model, which does not recognize spatial information, the proposed image attention block and target feature alignment encode the spatial information of the target image. This is presumed to allow the proposed model to overfit similar identity pairs with similar pose and representation settings.

[0095] Qualitative results Figures 43 and 44 show the results of ablation models re-enacting other identities with CelebV under single-image and multiple-image settings, respectively. AdaIN is unable to generate images similar to the target identity, but +Attention successfully preserves the key characteristics of the target. The target feature alignment module adds details to the generated image. However, while MarioNETte produces more natural images with several imaging settings, +Alignment is not adept at handling images of multiple targets with varying poses and expressions. Inference time This section reports the inference time of the model. The latency of the provided method was measured while generating 256×256 images with a different number of target images K∈{1, 8}. Each setting was run 300 times and the average speed was reported. Nvidia Titan Xp and Pytorch 1.0.1.post2 were used. As mentioned in this journal, 3D facial landmarks were extracted using the open-source implementation by Bulat and Tzimiropoulos (2017). Figure 41 shows the inference time analysis of the models. The total inference time of the proposed models MarioNETte+LT and MarioNETte can be derived as shown in Figure 3. z is used to compute the target encoding while generating the reenactment video. y and JPEG0007875924000137.jpg98 is generated only once. Therefore, the inference is performed by dividing it into the target encoding part and the driver generation part. Since inference is performed on multiple target images simultaneously, the inference time of the proposed components (e.g., target encoder and target landmark transformation unit) expands non-linearly depending on the number of target images K. On the other hand, open-source 3D landmark detectors process images sequentially, so the processing time expands linearly.

[0096] Additional example of generated images This report provides baseline methods and additional qualitative results for the proposed models on the VoxCeleb1 and CelebV datasets. Qualitative results are reported for single-image and multi-image (8 target images) settings, with the exception of Monkey-Net, which is designed to use only a single image. In the multi-image re-enactment case, only one target image is displayed due to limited space. Figures 45 and 46 compare different methods for VoxCeleb1 self-reenactment with single-shot and multi-shot settings, respectively. Examples of single-shot and multi-shot reenactments in VoxCeleb1 where the driver and target identities do not match are shown in Figures 13 and 48. Figures 49, 50, and 51 show the qualitative results for the CelebV dataset. Figures 15 and 50 compare various self-reenactment settings for single and multiple acquisitions. Figure 51 shows the results of reenacting different identities from CelebV depending on the multiple acquisition settings. Figure 52 shows a failed example formed with MarioNETte+LTd while performing a single acquisition and re-enactment under different identity settings in VoxCeleb1. The main cause of the failure appears to be a large pose difference between the driver and the target. The embodiments described above can also be implemented in the form of a recording medium containing computer-executable instructions, such as a program module executed by a computer. The computer-readable medium may be any available medium accessible by a computer, and may include both volatile and non-volatile media, as well as removable and non-removable media. Furthermore, computer-readable media may include computer storage media. Computer storage media may include all volatile and non-volatile, removable and non-removable media implemented by any method or technique for storing information such as computer-readable instructions, data structures, program modules, or other data.

[0097] As shown in Figures 2, 9, 17, 18, 23-27, and 29-32, at least one of the components, members, modules, or units (collectively referred to as "components" in this paragraph) represented by blocks in the drawings may be implemented as various hardware, software, and / or firmware structures that perform the respective functions described above, according to exemplary embodiments. For example, at least one of these components may use direct circuit structures such as memory, processors, logic circuits, or lookup tables that perform their respective functions through control by one or more microprocessors or other control devices. Alternatively, at least one of these components may be concretely implemented as part of a module, program, or code containing one or more executable instructions for performing a particular logic function, and may be implemented by one or more microprocessors or other control devices. Furthermore, at least one of these components may be implemented by a processor, such as a central processing unit (CPU), that performs its respective function, microprocessor, etc., and may include such processors. Two or more of these components may be combined as a single component that performs all the operations or functions of two or more combined components. Furthermore, at least part of one or more functions of these components may be performed by other components of these components. Also, although buses are not shown in the block diagram above, communication between components may occur via buses. The functional aspects of the above exemplary embodiment may be implemented by algorithms executed on one or more processors. Furthermore, components represented by blocks or processing steps may utilize any number of techniques related to electronic configuration, signal processing and / or control, data processing, etc. While embodiments of the present invention have been described above with reference to the attached drawings, those with ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in other specific forms without altering its technical idea or essential features. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting.

Claims

1. A method for transforming landmarks, wherein the method is The process involves receiving an input image that includes a facial image of a first person and a facial landmark corresponding to the facial image, To estimate the transformation matrix corresponding to the aforementioned facial landmark, Using the transformation matrix, calculate an expression landmark corresponding to the input image that relates to the facial expression of the first person, and an identity landmark corresponding to the input image that relates to the unique identity of the first person. The facial landmark is represented as the sum of the facial expression landmark, the identity landmark, and the average landmark associated with the average identity of a human face. Methods that include...

2. The method according to claim 1, wherein, in the estimation, the transformation matrix is estimated using a learning model trained to estimate a principal component analysis (PCA) transformation matrix from an arbitrary face image and a landmark corresponding to the arbitrary face image.

3. The aforementioned learning model, Classify multiple landmarks into multiple semantic groups, The method according to claim 2, wherein a PCA conversion coefficient corresponding to each of the plurality of semantic groups is output.

4. The method according to claim 3, wherein the calculation of the facial expression landmark is performed using the transformation matrix and the PCA unit vector.

5. The method according to claim 1, wherein the identity landmark includes at least one of texture information corresponding to the face image, color information corresponding to the face image, and shape information corresponding to the face image.

6. The method according to claim 1, wherein calculating the identity landmark includes calculating the facial expression landmark and the average landmark from the facial landmark.

7. A program for causing a computer to perform the method described in claim 1.

8. A device comprising at least one processor, wherein the at least one processor is The system receives an input image which includes a facial image of a first person and a facial landmark corresponding to the facial image. The transformation matrix corresponding to the aforementioned facial landmark is estimated, Using the transformation matrix, an expression landmark corresponding to the input image, related to the facial expression of the first person, and an identity landmark corresponding to the input image, related to the unique identity of the first person, are calculated. A device configured to represent the facial landmark as the sum of the facial expression landmark, the identity landmark, and an average landmark associated with the average identity of a human face.

9. The apparatus according to claim 8, wherein the at least one processor estimates the transformation matrix using a learning model trained to estimate a principal component analysis (PCA) transformation matrix from an arbitrary face image and a landmark corresponding to the arbitrary face image.

10. The aforementioned learning model, Classify multiple landmarks into multiple semantic groups, The apparatus according to claim 9, which outputs a PCA conversion coefficient corresponding to each of the aforementioned plurality of semantic groups.

11. The apparatus according to claim 10, wherein the at least one processor is configured to calculate the transformation matrix and the PCA unit vector in order to calculate the facial landmark corresponding to the face image of the first person.

12. The facial expression landmark includes facial expression information in at least one of the eyes of the face image, the nose of the face image, the mouth of the face image, the contour of the face of the face image, and the entire face image. The apparatus according to claim 8, wherein the identity landmark includes at least one of texture information corresponding to the face image, color information corresponding to the face image, and shape information corresponding to the face image.

13. The apparatus according to claim 8, wherein the at least one processor is configured to calculate the identity landmark by calculating the facial expression landmark and the average landmark from the facial landmark.