Electronic device for generating 3D head and operation method thereof

The model addresses the separation of 3D Gaussians from FLAME meshes by normalizing angles and positions, classifying faces, and adding internal mouth components, resulting in improved 3D head generation with high-quality textures and animations.

WO2026121525A1PCT designated stage Publication Date: 2026-06-11KLLEON INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
KLLEON INC
Filing Date
2025-09-30
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

Conventional 3D Gaussian splatting-based avatar generation models face issues with 3D Gaussians becoming separated from the movement of FLAME meshes, leading to defects when generating unseen movements, particularly in training distributions.

Method used

A model that accurately assigns 3D Gaussians to corresponding mesh regions by normalizing the angle and position of the Gaussians, classifying faces into first and second sets based on reconstructible geometric structure, and incorporating internal mouth components and vertex offsets to enhance mesh accuracy.

🎯Benefits of technology

Enables accurate assignment of 3D Gaussians to mesh regions, improving the generation of high-quality textures and animations, particularly in unseen movements, with reduced defects and enhanced geometric structure representation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025015513_11062026_PF_FP_ABST
    Figure KR2025015513_11062026_PF_FP_ABST
Patent Text Reader

Abstract

Disclosed is an electronic device for generating a 3D head. The electronic device for generating a 3D head may comprise a storage unit and a processor. The storage unit may store an adaptive geometric Gaussian splatting model for generating a 3D head. The processor may assign a 3D Gaussian set to each face of a mesh and may optimize the 3D Gaussian set by normalizing the angle between the face and the 3D Gaussian set. According to the present invention, 3D Gaussians can be accurately assigned to corresponding mesh regions.
Need to check novelty before this filing date? Find Prior Art

Description

Electronic device for generating a 3D head and method of operation thereof

[0001] The present invention relates to an artificial intelligence model capable of generating a 3D head having high-quality textures when a new pose and / or a new facial expression is given as input. The head may include a face and hair. Hereinafter, unless strictly distinguished from the head, face, and hair, "face" may be used instead of "head."

[0002]

[0003] Recently, with the advancement of deep learning-based 3D modeling technology, research on avatar generation is actively being conducted. Among them, 3D Gaussian splatting[1]-based avatar generation models are known to significantly reduce training and inference times while providing excellent performance.

[0004] The above model utilizes 3DMM (3D Morphable Model, statistical model, e.g., FLAME[2]), which is widely used in the field of face and body modeling to represent 3D shapes. 3DMM can capture and animate various facial expressions and / or various body movements using template meshes obtained from large scan data sets.

[0005] Conventional models [3, 4] bind 3D Gaussians to FLAME meshes. They can freely place 3D Gaussians near FLAME meshes by setting local locations representing the distance between the 3D Gaussians and the FLAME mesh as learnable parameters. However, since they can assign 3D Gaussians to unintended regions of the FLAME mesh, a problem may arise where the 3D Gaussians become separated from the movement of the FLAME mesh. In particular, they can cause serious defects when generating unseen movements in the training distribution.

[0006]

[0007] [비특허문헌]

[0008] [1] Bernhard Kerbl et al., 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139-1, 2023.

[0009] [2] Tianye Li et al., Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194-1, 2017.

[0010] [3] Zhijing Shao et al., Splattingavatar: Realistic real-time human avatars with mesh-embedded gaussian splatting. In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition, pages 1606-1616, 2024.

[0011] [4] Shenhan Qian et al., Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition, pages 20299-20309, 2024.

[0012] [5] Jun Xiang et al., Flashavatar: High-fidelity digital avatar rendering at 300fps. arXiv preprint arXiv:2312.02214, 2023.

[0013] [6] Ben Mildenhall et al., Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99-106, 2021.

[0014] [7] Wojciech Zielonka et al., Towards metrical reconstruction of human faces. In European conference on computer vision, pages 250-269. Springer, 2022.

[0015] [8] Tobias Kirschstein et al., Nersemble: Multi-view radiance field reconstruction of human heads. ACM Transactions on Graphics (TOG), 42(4):1-14, 2023.

[0016] [9] Wojciech Zielonka et al., Instant volumetric head avatars. In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition, pages 4574-4584, 2023.

[0017]

[0018] The present invention aims to provide a model that can accurately assign 3D Gaussians to corresponding mesh regions.

[0019]

[0020] An electronic device for generating a 3D head according to one embodiment of the present invention may include a storage unit and a processor. The processor may assign a 3D Gaussian set to each face of a mesh and optimize the 3D Gaussian set by normalizing the angle between the face and the 3D Gaussian set.

[0021] In one embodiment, the processor can optimize the 3D Gaussian set by further normalizing the positions of the 3D Gaussian set.

[0022] In one embodiment, the processor can normalize the angle by classifying each face into either a first face or a second face.

[0023] In one embodiment, the processor can calculate the average distance between a face and a 3D Gaussian set to classify each face as either a first face or a second face.

[0024] In one embodiment, the processor can generate a tooth mesh including molars and add it to the mesh.

[0025] In one embodiment, the processor can generate a mouth roof mesh or a mouth bottom mesh and add it to the mesh.

[0026] In one embodiment, the processor may add a vertex offset to at least one of the tooth mesh, palate mesh, and palate bottom mesh.

[0027]

[0028] According to the present invention, 3D Gaussians can be accurately assigned to corresponding mesh regions. The present invention can be integrated with large-scale language models, Text-to-Speech (TTS) models, and voice-based 3D face animation networks, thereby enabling the creation of digital humans capable of real-time interaction.

[0029]

[0030] The following drawings are according to an embodiment of the present invention.

[0031] Figure 1 is a block diagram of an electronic device for generating a 3D head.

[0032] Figure 2 is a block diagram of a model for generating a 3D head.

[0033] Figure 3 is a framework of a model for generating a 3D head.

[0034] Figure 4 illustrates the distribution of the mean values ​​of the local mean values ​​of a 3D Gaussian set.

[0035] Figure 5 illustrates the internal structure of the mouth generated by the present model and the model of [4], respectively.

[0036] Figure 6 illustrates the distance and angle between a 3D Gaussian and a face.

[0037] Figure 7 is a flowchart illustrating the operation method of an electronic device for generating a 3D head.

[0038] FIG. 8 is the result of qualitatively comparing Comparative Examples 1 to 5 and Examples in each of the self-replay scenario, cross-replay scenario, and novel view synthesis scenario.

[0039] Figure 9 is the result of qualitatively comparing Comparative Example 5 and the Example on a multi-view data set.

[0040] Figure 10 is the result of a qualitative comparison between Comparative Example 5 and Examples 2 to 5.

[0041]

[0042] The present invention will be described in detail below. However, the present invention is not limited to the following embodiments. The purpose and effects of the present invention may be naturally understood or become clearer through the following descriptions, and the purpose and effects of the present invention are not limited solely to the following descriptions. Furthermore, in describing the present invention, if it is determined that a detailed description of known technology related to the present invention may unnecessarily obscure the essence of the present invention, such detailed description will be omitted.

[0043] FIG. 1 is a block diagram of an electronic device for generating a 3D head. Referring to FIG. 1, the electronic device (100) for generating a 3D head (hereinafter, the device) may include a storage unit (110) and a processor (120).

[0044] The storage unit (110) can store various data and various programs. For example, the storage unit (110) can store a model (MD, see FIG. 2) (hereinafter, the model) for generating a 3D head, and can store input data and output data of the model (MD). The storage unit (110) may include at least one of volatile memory and non-volatile memory.

[0045] The processor (120) can control the overall operation of the device (100) and the storage unit (110). The processor (120) may include at least one of a central processing unit (CPU) and a graphics processing unit (GPU).

[0046] FIG. 2 is a block diagram of the present model. Referring to FIG. 2, the present model (MD) is an adaptive geometric Gaussian splatting model and may include a tracer (210), a 3DMM (220), a 3D Gaussian rigging unit (230), an internal mouth component (240), an mouth deformation network (250), a face feature deformation unit (260), a 3D Gaussian renderer (270), and a loss function calculation unit (280). Note that the 3DMM (220), the face feature deformation unit (260), and the 3D Gaussian renderer (270) do not have learnable parameters, whereas the mouth deformation network (250) may have learnable parameters.

[0047] Figures 3(a) and 3(b) are the framework of the present model. The present model (MD) will be explained with reference to Figure 3 and Figure 2. Hereinafter, the "3D Gaussian" will be referred to as the "Gaussian".

[0048] Tracker (210)

[0049] The tracer (210) can receive a frame. The frame may be any one of at least one (e.g., multiple) frames that make up a frame sequence. The frame sequence can be obtained from monocular video using known techniques.

[0050] The tracer (210) can extract a set of parameters from the frame. The set of parameters may include at least one (e.g., multiple) parameters. In one embodiment, the tracer (210) may be a FLAME tracer [7] that extracts a set of FLAME parameters from the frame, such as a shape parameter (β), a facial expression parameter (ψ), a pose parameter (Φ), a rotation parameter (R), and a translation parameter (t). The FLAME tracer may further extract a camera projection matrix from the frame.

[0051] 3DMM(220)

[0052] A 3DMM (220) can generate a mesh using a parameter set. The mesh may consist of a vertex set containing multiple vertices and a face set containing multiple faces. In one embodiment, the 3DMM (220) may be a FLAME, and the parameter set may be a FLAME parameter set. The mesh generated by the FLAME may include a face, a neck, and shoulders. Each face of the face set may be a triangle (composed of three vertices and three sides).

[0053] The 3DMM (220) can divide the mesh into multiple (e.g., n) face parts (components) using multiple masks, and can set each face of the face set as one of the face parts (in FIG. 3(a), p1, p2, p n (Reference). In one embodiment, the mask may be a FLAME mask. The face portion may be composed of at least one (e.g., multiple) surface.

[0054] 3D Gaussian rigging unit (230) (hereinafter, rigging unit (230))

[0055] 1. Acquisition of surface features

[0056] The rigging unit (230) can set a local coordinate system for each face of the face set. The faces include multiple faces of the mesh generated by the 3DMM (220) and multiple faces of the inner mesh described later. For clarity, the following description is based on one face. The rigging unit (230) can set the average position of the vertices of the face as the origin (C) of the local space. The rigging unit (230) can obtain a rotation matrix (R) by concatenating the direction vector of any one of the sides, the normal vector of the face, and their cross product into column vectors. The rotation matrix (R) represents the direction of the face in the global space. The rigging unit (230) can obtain a scale value (S, scalar) of the face by averaging the length of any one of the sides and the length of its perpendicular line.

[0057] 2. Assignment of Gaussian Set

[0058] The rigging unit (230) can define a set of Gaussians. The set of Gaussians may include at least one (e.g., multiple) Gaussians. The rigging unit (230) can define Gaussians by defining the parameters of the Gaussians described below.

[0059] The rigging unit (230) can assign a Gaussian set to a face and bind (combine) the face and the Gaussian set. The face refers to multiple faces of the mesh generated by the 3DMM (220) and multiple faces of the inner mesh described later. The rigging unit (230) can assign the Gaussian set to any location in local space, a preset location, or the origin.

[0060] 3. Initialization / Optimization of Gaussian Sets

[0061] A Gaussian may include at least one (e.g., multiple) parameters. A Gaussian may include a position parameter, a rotation parameter, a scale parameter, a color parameter, and a transparency parameter. The covariance of the Gaussian can be obtained using the rotation parameter and the scale parameter.

[0062] The rigging unit (230) can initialize the Gaussian position parameter with respect to the origin of the local space into 3D coordinates (x, y, z). Hereinafter, the position parameter is referred to as the local average value (μ).

[0063] The rigging unit (230) can transform the local average value into polar coordinates μ = (r', θ, φ) for normalization described later. r' represents the distance between the local average value and the origin of the local space, i.e., the radius. θ represents the angle between the local average value and the positive x-axis. φ represents the angle between the local average value and the positive z-axis.

[0064] In one embodiment, the rigging unit (230) can initialize the rotation parameter and the scale parameter as a unit vector.

[0065] In one embodiment, the rigging unit (230) can initialize the color parameter and the transparency parameter to any value or a preset value.

[0066] The rigging unit (230) can be optimized by updating each parameter of the Gaussian through the loss function calculation unit (280), thereby the loss value of the total loss function, in particular, the loss function L of [Equation 8] rgb It can be minimized.

[0067] 4. Classification of noodles

[0068] FIG. 4 illustrates the distribution of the mean values ​​of the local mean values ​​of a Gaussian set. FIG. 4 illustrates six people with different IDs (identities). Each person has a different hairstyle. Note that the mesh generated by the 3DMM (220) is a rough shape as shown in FIG. 3.

[0069] For example, looking at the distribution of average values ​​for ears, IDs 1, 2, 4, and 6, who have ears not covered with hair, show small average values, while IDs 3 and 5, who have ears covered with hair, show large average values.

[0070] It can be seen that face parts where the correct geometric structure (e.g., ears not covered by hair) can be reconstructed through the mesh show a relatively small average value, while face parts where the correct geometric structure (e.g., ears covered by hair) cannot be reconstructed using only the mesh show a relatively large average value.

[0071] As such, since the local mean value of the Gaussian differs depending on whether the correct geometric structure can be reconstructed through the mesh, it is necessary to classify the faces of the mesh based on whether the correct geometric structure can be reconstructed through the mesh.

[0072] Hereinafter, a set of faces capable of reconstructing the correct geometric structure through a mesh is referred to as the first set of faces. The first set of faces may include at least one (e.g., multiple) first faces (f1). The first face (f1) may be a face with a small local mean value of the Gaussian. This is because the Gaussian combined with the first face (f1) has a small local mean value. Additionally, the first face (f1) may be a face located in a position close to the Gaussian. This is because a small local mean value implies that the distance between the Gaussian and the face is close. The criteria for small / closeness may be a threshold value calculated through [Equation 1].

[0073] Hereinafter, a set of faces for which the correct geometric structure cannot be reconstructed through the mesh is referred to as a second set of faces. The second set of faces may include at least one (e.g., multiple) second faces (f2). The second face (f2) may be a face with a large local mean value of the Gaussian. This is because the Gaussian combined with the second face (f2) has a large local mean value. Additionally, the second face (f2) may be a face where the Gaussian is located at a remote position. This is because a large local mean value implies that the distance between the Gaussian and the face is far. The criteria for large / remote / far may be a threshold value calculated through [Equation 1].

[0074] The rigging unit (230) can calculate the distance value of [Equation 1] and classify each face of the face set into either the first face set or the second face set according to the face part (unsupervised, adaptive). Each face of the face set refers to each face of the mesh generated by the 3DMM (220).

[0075] The distance value in [Equation 1] refers to the average distance between each Gaussian set associated with a specific face part and that face part. That is, the distance value in [Equation 1] is a value obtained by calculating the average distance of the Gaussian sets associated with a specific face and averaging the average distances of the Gaussian sets of each face corresponding to the specific face part. The aforementioned threshold value is the average value calculated by determining the distance value in [Equation 1] for each face part. In [Equation 1], G represents a Gaussian set.

[0076] [Mathematical Formula 1]

[0077]

[0078] The smaller the distance value of [Equation 1], the less flexibility, i.e., the strict normalization described later, is required to reconstruct the correct geometric structure of a specific face part. The rigging unit (230) can classify each face of the corresponding face part as a first face (f1) when the distance value of [Equation 1] is smaller than the threshold value.

[0079] The larger the distance value of [Equation 1], the more flexibility, i.e., flexible normalization described later, is required to reconstruct the correct geometric structure of a specific face part. The rigging unit (230) can classify each face of the corresponding face part as a second face (f2) when the distance value of [Equation 1] is greater than the threshold value.

[0080] Since there is no available 3D prior knowledge, namely the classification criteria for the first set of faces and the second set of faces, at the beginning of training, the rigging unit (230) can set all faces excluding the face inside the mouth as the first face (f1), and after performing the optimization described later, the distance value of [Equation 1] can be calculated.

[0081] internal components (240) (hereinafter, components (240))

[0082] The component (240) can generate an internal mouth mesh and can add the internal mouth mesh to the mesh generated by the 3DMM (220). The mesh generated by the 3DMM (220) does not have a geometric structure of the inside of the mouth. The internal mouth mesh may include at least one of a tooth mesh including molars, a palate mesh, and a bottom of the mouth mesh. Each mesh may be composed of a plurality of vertices and at least one (e.g., a plurality of) faces. Hereinafter, the set of faces of the internal mouth mesh is referred to as the third set of faces.

[0083] The component (240) can generate a ring-shaped mesh by duplicating the vertices in the trajectory of the lip mesh. The component (240) can generate tooth meshes, for example, an upper / front tooth mesh and a lower / front tooth mesh, respectively, through the ring-shaped mesh.

[0084] The component (240) can extend the tooth mesh to generate a tooth mesh that includes molars. Vertices in the trajectory of the lip mesh can be located in the xz plane. Since the trajectory of the lip mesh is similar to an ellipse, the component (240) can find the center of the ellipse in the tooth mesh and extend the trajectory of the tooth mesh based on the center of the ellipse to generate a tooth mesh that includes molars.

[0085] The component (240) can generate a mesh for the roof of the mouth and a mesh for the bottom of the mouth, respectively. By adding each mesh to the mesh generated by the 3DMM (220), the component (240) can improve the generation performance of the model by having at least one (e.g., multiple) Gaussian sets inside the mouth move together with the tooth mesh.

[0086] Figure 5 illustrates the internal mouth structure generated by the present model and the model of [4], respectively. Referring to Figure 5, the present model (MD) extended the upper / incisor mesh (5a) and lower / incisor mesh (5b) to represent molars, and generated the palate mesh (5c) and the bottom of the mouth mesh (5d), respectively, and added them to the internal mouth structure.

[0087] Input deformation network (250, Φ) (hereinafter, network (250))

[0088] The network (250) can modify the third face set by providing a vertex offset (Δv) to the vertices constituting the third face set. The network (250) can modify the third face set by providing a vertex offset (Δv) to at least one of the tooth mesh including molars, the palate mesh, and the floor of the mouth mesh. Through the network (250), the distance between the lips and teeth can be changed in various animation scenarios, and problems caused by the relative position of the internal mouth structure being fixed to the lip vertices, such as the problem of hindering the reconstruction of correct movement, can be resolved.

[0089] The network (250) can perform segment-by-segment transformation. In one embodiment, the network (250) can divide the vertices inside the mouth into two segments. The first segment is an upper segment (v) including upper tooth vertices and palate vertices. upper ) and the second segment is a lower segment (v) including lower tooth vertices and mouth floor vertices. lower )am.

[0090] The network (250) is an upper mouth deformation network (Φ upper ) and lower mouth deformation network (Φ lower It may include ). Each network can be trained individually, and the input upper vertex offset (Δv upper ) and mouth lower vertex offset(Δv lower Each can obtain ). Each network can utilize facial expression parameters (ψ), pose parameters (θ), and time step (T) as inputs, as shown in [Equation 2]. In [Equation 2], γ represents position encoding.

[0091] [Mathematical Formula 2]

[0092] Δv = Φ(ψ, θ, γ(T))

[0093] Conventional offset-based models transformed vertices or Gaussians individually. Since elements belonging to the same segment maintain a relative geometric structure, there is no need to transform vertices or Gaussians within each segment individually. Therefore, the network (250) roughly partitions the vertices within the mouth and transforms the vertices belonging to the same segment together. Since the purpose of the network (250) is to accurately generate the internal structure of the mouth by providing marginal vertex offsets, the absence of vertex offsets after training may not degrade the generation performance of the model (MD). Consequently, in the inference phase, the model (MD) Δv = can maintain 0.

[0094] Surface feature deformation part (260) (hereinafter, deformation part (260))

[0095] The transformation unit (260) can transform the local mean value (μ), scale parameter (s), and rotation parameter (r) of each Gaussian into global space using a transformation function T for Gaussian rendering described later (see [Equation 3] Local-to-Global Transformation).

[0096] [Mathematical Formula 3]

[0097] T: μ, s, r → S i R i μ + C i , S i s, R i r (i is 1, 2, or 3)

[0098] The deformation part (260) can be deformed by using the local mean value (μ), scale parameter (s), and rotation parameter (r) of each Gaussian combined with the first to third planes (f1, f2, f3) based on [Equation 3] and the scale value (S1, S2, S3), rotation matrix (R1, R2, R3), and origin (C1, C2, C3) of the corresponding plane.

[0099] 3D Gaussian renderer (270)

[0100] The 3D Gaussian renderer (270) uses the values ​​transformed into global space to obtain a frame (I) corresponding to a time step (T) in a frame sequence. T,hat ) can be generated. In one embodiment, for example, the 3D Gaussian renderer (270) may be a tile rasterizer of non-patent document [1]. Non-patent document [1] may be incorporated herein.

[0101] Loss function calculation unit (280) (hereinafter, calculation unit (280))

[0102] The total loss function is defined as in [Equation 4].

[0103] [Mathematical Formula 4]

[0104] L = L reg (μ) + L rgb (I, I hat )

[0105] First term (L reg ) is a normalization-related loss function that preserves geometric consistency between the mesh and Gaussians (see [Equation 5]).

[0106] [Mathematical Formula 5]

[0107]

[0108] The first term of [Mathematical Formula 5] (L p ) is defined as in [Equation 6]. The computation unit (280) normalizes the center of the Gaussian through [Equation 6] so that the Gaussian is located near the center of the face of the corresponding mesh. The computation unit (280) can generate an avatar animation similar to a mesh animation by maintaining a rough alignment between the geometric structure of the Gaussian and the mesh through [Equation 6].

[0109] [Mathematical Formula 6]

[0110]

[0111] In [Equation 6], r' is the radius of the local mean value (μ) of the Gaussian. The loss function of [Equation 6] can be calculated for each of the first to third sets of faces. The third set of faces may be a set of faces before or after deformation. In [Equation 6], τ1, τ2, and τ3 are threshold values ​​applied to each of the first to third sets of faces, respectively, and are hereinafter referred to as the first to third threshold values.

[0112] The calculation unit (280) may set the first threshold and the second threshold differently to flexibly adjust the normalization strength. In one embodiment, the first threshold may be smaller than the second threshold (e.g., τ1 = 0.1, τ2 = 2.0). Since the first threshold is small, normalization for the first set of faces is strict, and since the second threshold is large, normalization for the second set of faces is flexible. The calculation unit (280) may set the first threshold and the third threshold to be the same (e.g., τ3 = τ1 = 0.1). Through this, the calculation unit (280) can apply strict normalization to the third set of faces. This is because the network (250) can reflect the correct geometric structure of the third face in the mesh.

[0113] The second term of [Mathematical Equation 5] (L angle ) is defined as in [Equation 7]. The calculation unit (280) calculates the distance (r') between the Gaussian and the face of the target mesh and the distance (r) between the Gaussian and the face of the non-target mesh through [Equation 7]. n Keep it smaller than ) or distance(r') distance(r n It can be made smaller than ). The faces of the unintended mesh are adjacent to the faces of the target mesh. If the Gaussian is not assigned close to the area of ​​the faces of the target mesh, unintended deformation may occur in the area near the faces of the target mesh.

[0114] [Mathematical Formula 7]

[0115]

[0116] The calculation unit (280) can normalize the angle (φ) of the Gaussian through [Equation 7] only when the distance (r') between the Gaussian and the face of the target mesh is greater than a threshold value. In one embodiment, τ φ θ can be 45° (approx. 0.78 rad). [Equation 7] is explained in detail with reference to Fig. 6. Fig. 6 illustrates the distance and angle between the Gaussian and the plane.

[0117] FIG. 6(c) illustrates the case where the face of the mesh is the first face (f1). The distance (r') is the distance (r n Since it is smaller than ), even if the angle (φ) is large, r' < r n The relationship can be maintained. Therefore, the calculation unit (280) may not normalize the angle (φ) in FIG. 6(c).

[0118] FIG. 6(a) illustrates the case where the face of the mesh is the second face (f2) and the angle (φ) is large. The Gaussian combined with the second face (f2) may have a greater distance (r') than the Gaussian combined with the first face (f1). Accordingly, the second threshold may be greater than the first threshold. The Gaussian illustrated in FIG. 6(a) is more likely to represent the face to the right (non-target face) than the face in the center (target face). Therefore, the calculation unit (280) can normalize the angle (φ). At this time, the value of the distance (r') may be maintained. FIG. 6(b) illustrates the state when the angle (φ) of FIG. 6(a) is normalized.

[0119] The second term of [Mathematical Equation 4] (L rgb ) is defined as in [Equation 8]. The loss function in [Equation 8] compares the image generated by the rendering loss function with the original image. The loss function in [Equation 8] may refer to non-patent literature [1].

[0120] [Mathematical Formula 8]

[0121] L rgb = (1-λ)L1+ λL D-SSIM

[0122] FIG. 7 is a flowchart illustrating a method of operation of the device according to an embodiment of the present invention. FIG. 7 is intended to clearly explain the order of operation of the device (100), and descriptions that overlap with the above descriptions are omitted as much as possible. Referring to FIG. 7, the method of operation of the device (100) may include the following steps.

[0123] S710: Step to generate the mesh;

[0124] S720: A step of generating an internal mouth mesh and adding it to the mesh of step S710, wherein the internal mouth mesh may include at least one of a tooth mesh including molars, a palate mesh, and a floor of the mouth mesh;

[0125] S730: A step of assigning a Gaussian set to each face of a mesh, wherein each face of the mesh may include each face of the mesh in step S710 and each face of the inner mesh of step S720;

[0126] S740: A step to initialize the parameters of each Gaussian set;

[0127] S750: A step of transforming each Gaussian set from local space to global space;

[0128] S760: Step to perform Gaussian rendering;

[0129] S770: Step for calculating loss functions;

[0130] S780: The total loss value of step S770, in particular, is the loss function L of [Equation 8] rgb A step of determining whether the loss value is a minimum value; and

[0131] S790: L rgb A step to optimize the parameters of each Gaussian set when the loss value is not the minimum.

[0132] The device (100) is L rgb Steps S750 through S790 can be repeated until the loss value becomes the minimum value. rgb When the loss value is at a minimum value, the device (100) can terminate the operation.

[0133] The device (100) can perform step S790 at least once (e.g., multiple times) or a preset number of times. To this end, the device (100) can perform the optimization process of the sequence of steps S790, S750, S760, S770, and S780 at least once (e.g., multiple times) or a preset number of times.

[0134] In one embodiment, the device (100) may perform step S800, which is a step of classifying each face of the mesh of step S710 into a first face set and a second face set, after performing the optimization process multiple times. In one embodiment, after performing step S800, the device (100) may perform the aforementioned optimization process at least once (e.g. multiple times) or a preset number of times. The device (100) may perform a process multiple times in which multiple optimization processes - face classification - multiple optimization processes are performed as a unit.

[0135] In one embodiment, the device (100) may perform a step of acquiring features of each face of the mesh in step S710 and each face of the mouth-inside mesh in step S720 (hereinafter, face feature acquisition step) after step S720 and before step S750, for example, after step S720 and before step S730, after step S730 and before step S740, and after step S740 and before step S750. For example, the device (100) may perform the face feature acquisition step after step S720 and before step S730.

[0136] In one embodiment, the device (100) may perform a step of setting each face of the mesh of step S710 as one of the face parts (hereinafter, a face part setting step) after step S710 and before step S750, for example, after step S710 and before step S720, after step S720 and before step S730, after step S730 and before step S740, or after step S740 and before step S750. For example, the device (100) may perform the face part setting step after step S720 and before step S730.

[0137] In one embodiment, the device (100) may perform a face part setting step after performing a face feature acquisition step. In another embodiment, the device (100) may perform a face feature acquisition step after performing a face part setting step.

[0138] In one embodiment, the device (100) may perform a step of transforming the local mean value of each Gaussian set into polar coordinates (hereinafter, polar coordinate transformation step) after step S740 and before step S750. Additionally, the device (100) may perform the polar coordinate transformation step after step S790 and before step S750.

[0139] In one embodiment, the device (100) may perform a step of providing a vertex offset to the inner mesh of step S720 (vertex offset providing step) after step S720 and before step S750, for example, after step S720 and before step S730, after step S730 and before step S740, and after step S740 and before step S750. In one embodiment, the device (100) may perform a polar coordinate deformation step after performing the vertex offset providing step.

[0140] Experimental Example

[0141] 1. Setup

[0142] (1) Datasets: SplattinaAvatar[3], NeRSemble[8], DynamicFace (a dataset developed by the inventors of the present invention, consisting of 2-3 minutes of monocular RGB videos featuring 10 actors. Each actor was instructed to slowly nod their head while making various facial expressions).

[0143] (2) Comparative Examples: Comparative Examples 1 to 5 are, respectively, non-patent literature [9], [1], [3], [5] and [4]. [9] is a NeRF-based model, and [4] is a multi-view video-based model.

[0144] 2. Comparison on monocular video datasets

[0145] FIG. 8 is the result of qualitatively comparing Comparative Examples 1 to 5 and the Example in each of the self-reproduction scenario, the cross-reproduction scenario, and the novel view synthesis scenario. Self-reproduction refers to a task of reproducing a person on a correct image and the animation of that person exactly as they are, and cross-reproduction refers to a task of applying the animation of a person on a correct image to another person.

[0146] Row 1: When an unseen mouth-around animation was given in the training distribution, Comparative Examples 1 to 5 produced serious defects, whereas the Example showed clear results without defects.

[0147] Line 2: The example exhibited the highest resolution in the accessory part through flexible relaxation of the geometric structure via adaptive geometric initialization.

[0148] Rows 3 and 4: Comparative Examples 1 to 5 produced a coarse appearance well, but exhibited low-quality texture and low representational accuracy, and in particular, struggled to produce correct animations. On the other hand, the Examples showed excellent results.

[0149] Rows 5 and 6: Comparative Examples 1 through 5 showed defects and blurred details. On the other hand, the Examples showed robust results with high-quality texture.

[0150] Table 1 shows the results of a quantitative comparison between Comparative Examples 1 to 5 and the Example in a self-replay scenario. In Table 1, each figure represents the mean and standard deviation. For a fair comparison, all 10 subjects from each dataset were utilized, and the last 350 frames of each video were set as the test sequence. This is the same as the setup in Comparative Examples 1 and 3. The Example demonstrated superior performance compared to Comparative Examples 1 to 5 in both datasets and showed notable differences in various metrics such as MSE, PSNR, SSIM, and LPIPS.

[0151] Classification Comparison Example 1 Comparison Example 2 Comparison Example 3 Comparison Example 4 Comparison Example 5 Example DatasetSplattingAvatarMSE(10 -3 )↓2.429±2.022.164±2.143.132±0.832.239±2.631.679±1.530.884±0.72PSNR↑28.083±2.5027.543±2.9525.333±1.1929. 306±2.8529.124±3.5132.635±2.88SSIM↑0.938±0.020.923±0.030.933±0.020.943±0.020.938±0.030.965±0.02LPIPS(10 -1 )↓0.678±0.231.019±0.420.588±0.120.444±0.170.494±0.280.367±0.17DatasetDynamicFaceMSE(10 -3)↓1.545±0.591.603±0.771.426±0.521.811±0.611.268±0.710.612±0.35PSNR↑28.688±1.9628.547±1.9428.843±1.6327. 780±1.3629.641±2.2432.760±1.99SSIM↑0.888±0.030.874±0.030.869±0.030.874±0.030.879±0.040.919±0.02LPIPS(10 -1 )↓1.289±0.28 1.699±0.34 1.378±0.21 0.745±0.10 0.801±0.20 0.660±0.14

[0152] 3. Comparison on the Multi-view Video Dataset (NeRSemble)

[0153] Since the Example is a monocular video-based model, the inventors trained the Example using only a single-view image. Since Comparative Example 5 is a multi-view video-based model, the inventors trained Comparative Example 5 using 16 views.

[0154] FIG. 9 shows the results of a qualitative comparison between Comparative Example 5 and the Example on a multi-view dataset. Comparative Example 5 generated defects inside the mouth. The Example was able to effectively reduce defects inside the mouth even though it was trained using only a single view. In addition to the mouth, the Example also exhibited sharper textures than Comparative Example 5. This indicates that, although FLAME fitting results are generally known to be more accurate in multi-view settings than in monocular video scenarios, the flexible utilization of the geometric structure of the present invention is useful.

[0155] Table 2 shows the results of a quantitative comparison between Comparative Example 5 and the Example on a multi-view dataset. Although the Example was trained using only a single view, it showed significant results in MSE and PSNR, and showed better results than Comparative Example 5 in SSIM and LPIPS. In particular, the Example showed a significant difference of 9.7% in LPIPS. Note that LPIPS is a human-aligned metric, which is a standard that aligns with or corresponds to human evaluation.

[0156] Classification Comparison Example 5 (16-Views) Example (1-Views) Gap (%) MSE (10 -3 )↓2.483±2.162.514±2.18-1.247PSNR↑27.829±1.5927.782±1.60-0.169SSIM↑0.877±0.040.882±0.040.617LPIPS(10 -1 )↓1.073±0.340.969±0.329.711

[0157] Table 3 shows the results of a quantitative comparison between Comparative Example 5 and Examples 2 to 5. Figure 10 shows the results of a qualitative comparison between Comparative Example 5 and Examples 2 to 5.

[0158] Configuration MSE(10 -3 )↓PSNR↑SSIM↑LPIPS(10 -1 )↓Comparative Example 50.99130.3560.9190.653 Example 20.90530.7170.9300.572 Example 30.80232.2730.9410.548 Example 40.73332.7510.9410.519 Example 50.74832.6970.9420.513

[0159] For training Comparative Example 5, all faces of the mesh were set to the second set of faces. Example 2 is a model trained only with adaptive geometric initialization (see [Equation 1]). Example 3 is a model with an internal mouth component added to the model of Example 2. Example 4 is a model with a mouth deformation network added to the model of Example 3. Example 5 is a model with the loss function of [Equation 7] added to the model of Example 4.

[0160] Example 2 exhibited a significantly lower LPIPS than Comparative Example 5. This is understood to be due to effective normalization resulting from the establishment of the first face set. Example 2 generated sharper teeth and glasses than Comparative Example 5. Example 3 improved the quality of tooth generation. Example 4 generated an even more improved internal mouth structure. Example 5 showed accurate results along with robust Gaussian rigging results. In FIG. 10, Example 5 generated teeth without any defects inside the mouth (see row 1) and robustly represented light reflections on the glasses (see row 2).

[0161] The present invention has been described in detail through representative embodiments. However, those skilled in the art will understand that various modifications are possible to the embodiments described above without departing from the scope of the invention. Therefore, the scope of the present invention is not limited to the embodiments described above, but should be defined to include all variations derived from the claims and their equivalents.

Claims

1. Includes a storage unit and a processor, and The above processor is, Assign a 3D Gaussian set to each face of the mesh, and An electronic device for generating a 3D head that optimizes the 3D Gaussian set by normalizing the angle between the above-mentioned surface and the above-mentioned 3D Gaussian set.

2. In Paragraph 1, The above processor is, An electronic device for generating a 3D head, which optimizes the 3D Gaussian set by further normalizing the position of the 3D Gaussian set.

3. In Paragraph 1, The above processor is, An electronic device for generating a 3D head, which classifies each of the above faces into either a first face or a second face to normalize the angle.

4. In Paragraph 3, The above processor is an electronic device for generating a 3D head, which calculates the average distance between the face and the 3D Gaussian set and classifies each face as either the first face or the second face.

5. In Paragraph 1, The above processor is an electronic device for generating a 3D head, which generates a tooth mesh including molars and adds it to the mesh.

6. In Paragraph 5, The above processor is an electronic device for generating a 3D head, which generates a mouth roof mesh or a mouth floor mesh and adds it to the said mesh.

7. In Paragraph 5, The above processor is an electronic device for generating a 3D head, adding a vertex offset to at least one of the tooth mesh, palate mesh, and mouth floor mesh.