A speech-driven digital human generation method based on neural fields

By improving the FLAME face model and neural field-driven method, and combining semantic information to control the eyes, the problems of head-body inconsistency and eye control in voice-driven digital humans were solved, resulting in high-quality audio-driven digital humans.

CN116825127BActive Publication Date: 2026-06-30TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL
Filing Date
2023-08-14
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing voice-driven digital human methods suffer from rendering crashes when there are large changes in head pose, resulting in inconsistencies between head and body and an inability to effectively control eye opening and closing, which negatively impacts the user experience.

Method used

We employ a neural field-based approach, using an improved FLAME face model for 3D deformation, and combine neural occupancy field and texture field to drive the digital human. We control the opening and closing of the eyes through semantic information, and optimize the rendering results using neural field representation and semantic segmentation methods.

Benefits of technology

It achieves synchronized deformation of the head and torso, improves the realism of digital human rendering and voice synchronization, solves the problems of head-body inconsistency and eye control, and generates high-quality audio-driven digital humans.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116825127B_ABST
    Figure CN116825127B_ABST
Patent Text Reader

Abstract

A method for generating a voice-driven digital human based on neural fields includes the following steps: S1, constructing a deformable digital face using a face model; S2, encoding the audio features of a given speech and mapping the audio features to the expression space of the digital face; S3, driving the digital human in the standard space based on the audio features according to neural field representation; wherein, the digital human in the standard space is obtained based on the neural occupancy field and the neural texture field, and for the spatial coordinates in the standard space, the corresponding displacement is output by the neural displacement field according to the audio features. Further, step S3 also uses facial semantics as an explicit control signal to perform eye control based on facial semantics. Compared with traditional methods, this invention can achieve more synchronized face and torso driving and eye opening / closing control, surpassing traditional methods in both image quality and speech synchronization metrics.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and artificial intelligence, and in particular to a method for generating voice-driven digital humans based on neural fields. Background Technology

[0002] Voice-driven digital humans are a common face-driven technology, widely used in movies, games, and the emerging metaverse. To avoid the uncanny valley effect, digital humans need to be as realistic as possible to further enhance the audience's viewing experience. High realism comes from several aspects: not only should the face rendering be as realistic as possible, but the head posture also needs to change naturally within a certain range. In addition, eye control is also an important part of the interaction. Current face-driven methods often only consider the realism of face rendering, resulting in digital humans that, while realistic, are unnatural.

[0003] Traditional voice-driven digital humans are limited to two dimensions. A common approach is to map speech to an intermediate modality, then map that intermediate modality to lip movements, and finally perform 2D rendering to obtain the digital human. There are many intermediate modalities, including 3D meshes, 2D landmarks, and even latent codes that lack actual physical meaning. However, they all ultimately require 2D rendering to convert them into images. This limitation keeps digital humans at a two-dimensional level. While 2D rendering is sufficient for small head pose changes, it fails when these changes are large. This is because 2D methods lack 3D information, resulting in a lack of 3D consistency in the driving process. This restricts the practical application of voice-driven digital human methods.

[0004] The recently emerging neural radiation field method uses multi-view images as input, models the space using an MLP (Multilayer Perceptron), and compares the resulting 2D images with the input viewpoint images using volume rendering to model a continuous 3D space. Due to the 3D spatial modeling and the volume rendering method, neural radiation fields achieve 3D consistency and high-quality rendering results. There have been attempts to utilize neural radiation fields in voice-driven digital humans, resulting in voice-driven digital humans that can change head pose, overcoming the problems faced by 2D methods. However, voice-driven digital humans based on neural radiation fields also bring new challenges: the method of driving the head and torso separately causes instability in the torso, even resulting in head-body separation, leading to jittery and unnatural driving results. Furthermore, because no correlation is established between speech and eye opening / closing, the blinking of the digital human is uncontrolled, which also affects the user experience.

[0005] It should be noted that the information disclosed in the background section above is only for understanding the background of this application, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0006] The main objective of this invention is to overcome the shortcomings of the aforementioned background technology and provide a speech-driven digital human generation method based on neural fields.

[0007] To achieve the above objectives, the present invention adopts the following technical solution:

[0008] A speech-driven digital human generation method based on neural fields includes the following steps:

[0009] S1. Construct deformable digital faces using face models;

[0010] Preferably, a deformable digital face is constructed using an improved FLAME face model, wherein the improved FLAME face model is based on the FLAME model, but the pose basis, expression basis, and linear skinning weights are continuous in three-dimensional space;

[0011] S2. Encode the audio features of the given speech and map the audio features to the expression space of the digital face.

[0012] S3. Based on neural field expression, a digital human in the standard space is driven according to audio features; wherein, the digital human in the standard space is obtained based on the neural occupancy field and the neural texture field, and for the spatial coordinates in the standard space, the corresponding displacement is output by the neural displacement field according to the audio features.

[0013] In some embodiments, the digital face deformation process includes:

[0014] Given an average surface grid Where N is the number of vertices, the expression coefficients are multiplied by the expression basis to perform non-rigid deformation, and then the linear hybrid skinning (LBS) algorithm is used to calculate rigid deformation. LBS uses the rotation vectors of the neck, chin, and eyes, and the global rotation vector, multiplied by the linear hybrid skinning weights to perform rigid deformation on the entire face; the specific calculation process is as follows:

[0015]

[0016]

[0017] in, The pose coefficient, Let K be the pose basis, K be the number of joints, and B be the number of joints. P For the pose blendshape, For expression coefficients, For the expression base, B E For the expression blendshape, T P For a face mesh that has undergone non-rigid deformation, It is a linear skin weight. The regression yields the joint coordinates, W is the linear skin equation, and M is the final mesh.

[0018] In some embodiments, in step S2, audio is converted into feature vectors by a pre-trained speech model. The audio features are then passed through four one-dimensional convolutional layers and mapped to expression coefficients and chin pose. Preferably, a self-attention mechanism is used to output weighted expression features.

[0019] In some embodiments, the neural field expression includes:

[0020] The digital human is represented using two multilayer perceptron (MLP) processors: one predicts the occupancy value of the current position to represent the digital human's geometry, and the other predicts the RGB values ​​of the current position to render the digital human. This decoupling of the two processors brings about improvements in both geometric and rendering performance.

[0021]

[0022] Where A represents the currently extracted audio feature. These are coordinates, where occ∈[0,1] represents x. c The probability of being occupied, θ o This represents learnable parameters; additional audio is added as input to the MLP to explain topological transformations that cannot be explained by deformation; a neural texture MLP is used to assign colors to the digital human, the neural texture MLP being responsive to x. c The color of the current coordinate point is output as shown in the following formula:

[0023]

[0024] Where, η d It represents the normal direction at the current position, obtained by regularizing the gradient of the geometric field.

[0025] In some embodiments, in step S3, for a 3D point x in the standard spatial field c Predict its corresponding expression basis ε, pose basis P, and linear skinning weights ω:

[0026]

[0027] The process of combining the expression base and the pose base with the expression features obtained from the audio to obtain the point coordinates after non-rigid displacement is expressed as follows:

[0028]

[0029] Subsequently, the LBS weights of the linear hybrid skin are combined with the poses of each node as input, and the output is the point coordinates after rigid transformation.

[0030] The digital human representation obtained after audio driving is as follows:

[0031]

[0032] In some embodiments, in step S3, facial semantics is also used as an explicit control signal to perform eye control based on facial semantics.

[0033] In some embodiments, during the preprocessing stage, semantic segmentation methods are used to obtain the semantic segmentation results of the virtual image; the total number of pixels in the eye region is calculated, and a value called the eye proportion is obtained, which is normalized and added to the expression coefficient; during the training stage, the model learns the dimension of the eye proportion as a measure of eye opening and achieves control over the eyes.

[0034] In some embodiments, pixel loss is used as the training objective. Pixel loss measures the L2 distance between generated pixels and real pixels, and is specifically expressed as follows:

[0035]

[0036] Where p represents the pixel point on the human body surface that corresponds to the light ray hitting it;

[0037] Preferably, an additional mask loss is added to optimize the modeling results of the digital human using light that does not hit the surface of the human body;

[0038]

[0039] Here, CE represents the cross-entropy loss between the actual occupancy value and the predicted occupancy value; the point closest to the surface is selected on the ray. The calculated occupancy value is compared with the actual occupancy value.

[0040] In some embodiments, a semantic loss function is added using the semantic prior information of the face semantic segmentation map, as specifically expressed below:

[0041]

[0042] For a point belonging to the face in the standard field, find the nearest point in the corresponding face mesh and calculate their distance as the loss function.

[0043] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the neural field-based voice-driven digital human generation method.

[0044] The present invention has the following beneficial effects:

[0045] This invention proposes a voice-driven digital human generation method based on neural fields. Addressing the common problems of face and torso asynchrony and inability to control eyes in voice-driven digital humans, this invention successfully solves these problems through neural deformation fields and semantic-based eye control methods. Compared with traditional methods, it demonstrates more synchronized face and torso driving and eye opening and closing control, and surpasses traditional methods in both image quality and voice synchronization metrics.

[0046] This invention, in a preferred embodiment, combines neural fields and facial semantic information. Utilizing FLAME's facial priors, it simultaneously alters the poses of the face and torso, and then, combined with facial semantic information, controls the opening and closing of the eyes. This invention addresses the shortcomings of traditional methods, including rendering realism, head pose alteration, and eye opening and closing issues. Compared to previous methods, it demonstrates better rendering results and speech synchronization. By leveraging a text-to-speech and speech-driven digital human pipeline, a high-quality digital human capable of realistic and natural interaction can be obtained at extremely low cost.

[0047] Other beneficial effects of the embodiments of the present invention will be further described below. Attached Figure Description

[0048] Figure 1 This is a flowchart of a speech-driven digital human generation method (Sem-Avatar) based on neural field for facial semantic control, as described in an embodiment of the present invention.

[0049] Figure 2 This is a semantic map of a face according to an embodiment of the present invention. Detailed Implementation

[0050] The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary and not intended to limit the scope and application of the present invention.

[0051] It should be understood that the terms "length", "width", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", and "outer" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the present invention.

[0052] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of embodiments of the present invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0053] See Figure 1 This invention provides a speech-driven digital human generation method based on neural fields, comprising the following steps:

[0054] S1. Construct deformable digital faces using face models;

[0055] Preferably, a deformable digital face is constructed using an improved FLAME face model, wherein the improved FLAME face model is based on the FLAME model, but the pose basis, expression basis, and linear skinning weights are continuous in three-dimensional space;

[0056] S2. Encode the audio features of the given speech and map the audio features to the expression space of the digital face.

[0057] S3. Based on neural field expression, a digital human in the standard space is driven according to audio features; wherein, the digital human in the standard space is obtained based on the neural occupancy field and the neural texture field, and for the spatial coordinates in the standard space, the corresponding displacement is output by the neural displacement field according to the audio features.

[0058] like Figure 1 As shown, in a preferred embodiment, in step S3, facial semantics is also used as an explicit control signal to perform eye control based on facial semantics.

[0059] The method of this invention completely solves the problem of inconsistency between the head and body of a digital human voice driven by traditional methods, resulting in high-quality audio-driven character images. Furthermore, this invention proposes using semantic information to guide the eye region of the character image, which was previously uncontrollable in the FLAME model. Extensive experiments have verified the effectiveness of the method, demonstrating that it can synthesize highly realistic audio-driven character images.

[0060] The following describes specific embodiments of the present invention.

[0061] This invention proposes a semantically controlled semi-implicit neural field framework, named Sem-Avatar, for audio-driven digital human synthesis, achieving high-fidelity audio-driven virtual avatars. This semi-implicit neural field framework can be found in [link to relevant documentation]. Figure 1To enable audio-driven digital human synthesis, audio features are encoded in this invention. The invention utilizes neural implicit fields, including an occupancy field and a texture field for generating a canonical facial model, followed by a deformation field that transforms the canonical space into the observation space. Finally, the method of this invention also proposes semantic eye control.

[0062] Face model

[0063] To deform a digital human, a 3D face model based on FLAME (see Li T, Bolkart T, Black MJ, et al. Learning a model of facial shape and expression from 4D scans[J]. ACMTrans.Graph.,2017,36(6):194:1-194:17) is subjected to a deformation process. The deformation process is as follows: Given an average surface mesh Where N is the number of vertices, FLAME first multiplies the expression coefficients by the expression basis to perform non-rigid deformation, and then uses the LBS (Linear Hybrid Skinning) algorithm to calculate rigid deformation. LBS uses the rotation vectors of four joints (neck, chin, and eyes) and the global rotation vector multiplied by the linear hybrid skinning weights to perform rigid deformation on the entire face. The specific calculation process is as follows:

[0064]

[0065]

[0066] in, The pose coefficient, (K is the number of joints), which is the pose basis, B P For the pose blendshape, For expression coefficients, For the expression base, B E For the expression blendshape. T P The face mesh is a non-rigidly deformed face mesh. It is a linear skin weight. Regression yields the joint coordinates, W is the linear skinning equation, and M is the final mesh. Unlike FLAME, where the pose basis, expression basis, and linear skinning weights are discrete, in this invention, these weights are continuous in three-dimensional space.

[0067] Voice to Emoji Module

[0068] Given driving audio, audio-driven digital human generation is achieved through audio-to-expression space mapping. First, a pre-trained speech model converts the audio into a feature vector; specifically, for a 20-millisecond audio clip, the model outputs a 29-dimensional feature vector. In the training video, audio from the neighboring 20 blocks is sampled for each frame to obtain... The audio characteristics.

[0069] Audio feature A is then passed through four one-dimensional convolutional layers and mapped to expression coefficients. and chin position θ jaw To smooth out audio features, a self-attention mechanism was used to output weighted facial expression features. This approach ensures that the audio takes into account contextual information, and the facial expressions output in consecutive frames are smoother and more natural.

[0070] Digital Humans Expressed by Neural Fields

[0071] The digital human is represented using two multilayer perceptrons: one predicts the occupancy value of the current position to represent the digital human's geometry, and the other predicts the RGB value of the current position to render the digital human. This decoupling results in better geometric and rendering performance.

[0072]

[0073] Here, A represents the currently extracted audio feature. These are coordinates, where occ∈[0,1] represents x. c The probability of being occupied, θ o This represents the learnable parameters. We additionally added audio as input to the MLP to explain topological transformations that cannot be explained by deformation. We used a neural texture MLP to assign color to the digital human; this MLP will adjust for x... c The color of the current coordinate point is output as shown in the following formula:

[0074]

[0075] Among these, η d It represents the normal direction at the current position, obtained by regularizing the gradient of the geometric field.

[0076] Neural field-driven digital human

[0077] A digital human in standard space was obtained based on neural occupancy field and neural texture field. Next, the digital human in standard space was driven by audio. For the spatial coordinates x in the standard field... c The neural displacement field outputs the corresponding displacement based on the audio signal.

[0078] First, for a 3D point x in a standard spatial field cPredict its corresponding expression basis ε, pose basis P, and linear skinning weights ω:

[0079]

[0080] Among them, the expression base and pose base are combined with the expression features obtained through audio to obtain the coordinates of the points after non-rigid displacement. This process can be represented as:

[0081]

[0082] We then use LBS weights combined with the poses of each node as input to output the coordinates of the points after rigid transformation.

[0083]

[0084] This is the digital human obtained after being driven by audio.

[0085] Eye control based on facial semantics

[0086] Because the FLAME model does not model eye opening and closing, the neural implicit digital human cannot perform any eye movements. This limitation, inherited from the FLAME model, severely affects the realism of the digital human, causing it to remain with its eyes open throughout the process, appearing highly unrealistic. To address this issue, this invention proposes a novel method that uses facial semantics as an explicit control signal.

[0087] In the preprocessing stage, we use existing semantic segmentation methods to obtain the semantic segmentation results of the virtual avatar, such as... Figure 2 As shown, we calculate the total number of pixels in the eye region and form a value called the eye ratio. We normalize this eye ratio and add it to the expression coefficients. During training, the deformation module gradually learns to interpret this dimension as a measure of eye opening and achieves control over the eyes.

[0088] Training objectives

[0089] We use pixel loss as the training objective. Pixel loss measures the L2 distance between generated pixels and real pixels, and its specific expression is as follows:

[0090]

[0091] Here, p represents the pixels that correspond to the light rays hitting the human body surface. To further improve the geometric quality and rendering results, we added an additional mask loss to optimize the modeling results of the digital human by utilizing the light rays that did not hit the human body surface.

[0092]

[0093] Here, CE represents the cross-entropy loss between the actual occupancy value and the predicted occupancy value. Since these rays do not hit the human body surface, we select the point on the ray that is closest to the surface. The calculated occupancy value is compared with the actual occupancy value.

[0094] In addition, we added a semantic loss function by utilizing the semantic prior information of the face semantic segmentation map obtained from the previous method. This loss function effectively stabilizes the training process, and its expression is as follows:

[0095]

[0096] For a point belonging to the face in the standard field, we find the nearest point in the corresponding face mesh and calculate their distance as the loss function. It should be noted that this loss function cannot be used in non-face regions because the FLAME only contains the face part and is missing the hair and torso parts.

[0097] Performance Analysis

[0098] Dataset: Unlike previous studies, our model only requires a 3-5 minute video as training data. Specifically, we selected one publicly released video from each of AD-NeRF and LSP. For each video, we used 80% of the frames as the training set and 20% as the test set.

[0099] Evaluation Metrics: To quantitatively evaluate the results, we selected Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) as metrics for image quality. Additionally, we used Landmark Distance (LMD) and Synchronization Confidence Score (Sync) as metrics for evaluating audio and lip synchronization.

[0100] Implementation details: We implemented our framework using PyTorch. The network uses the Adam optimizer with a learning rate of 0.0002. We trained the model on eight NVIDIA Tesla V100s with a batch size of 8 for 40 epochs. The model was trained at a resolution of 256×256 using a 5-minute dataset, and the convergence time was approximately 40 hours.

[0101] Experimental setup: The method of this invention will be compared with the following methods: 1) AudioDVP, a state-of-the-art (SOTA) method for audio-driven digital humans based on a model; 2) Wav2lip, a method that produces state-of-the-art lip-synchronization results; 3) AD-NeRF, a state-of-the-art (SOTA) method for audio-driven digital humans based on an implicit model.

[0102] Quantitative Comparison: We present the quantitative results of this invention in Tables 1 and 2. Since the model-based method only uses audio to drive the mouth region, the rest was copied from the original video. We only cropped the lower half of the face for comparison. We compared it with AD-NeRF at full resolution because it can also drive the entire digital human.

[0103] Table 1

[0104]

[0105] Table 2

[0106]

[0107] Under the cropping setting, the method of this invention achieved the best results on dataset A in terms of PSNR, SSIM, and LMD metrics. This demonstrates that the method of this invention can generate high-fidelity digital humans with fine-grained details. It is worth noting that Wav2Lip achieved the highest score on the Sync metric because they used a pre-trained SyncNet during training. Despite the high lip synchronization score, the dialogue digital humans generated by Wav2lip appear unnatural, with a blurred mouth area and clearly visible boundaries around it. Moreover, generating digital humans with 3D perception capabilities is not achievable for Wav2lip. Under the full resolution setting, the method of this invention outperforms AD-NeRF on all metrics. Since AD-NeRF does not use semantic information from the eyes, we conducted another experiment in which we disabled the semantic control module for a fair comparison. As shown in Table 2, the method of this invention outperforms AD-NeRF in both settings, demonstrating that the invention successfully addresses the inconsistency problem between the head and torso.

[0108] Qualitative Comparison: This invention conducted qualitative experiments by comparing keyframes in the rendering results of each method. The comparison results show that AD-NeRF frequently encounters the aforementioned inconsistencies, and AudioDVP's lip movements are poorly synchronized with the driving audio. Although Wav2lip achieved the highest score on the Syncnet metric, its lip region appears very unnatural. The method of this invention achieves a delicate balance between lip synchronization and image quality, producing the most realistic speaking digital human.

[0109] Ablation analysis

[0110] Audio-to-Face Module: Within the framework of this invention, we choose to train the audio-to-face module in an end-to-end manner without explicit supervision of the face coefficients. Therefore, we conducted elimination experiments in two settings. (1) Using Face Supervision: We train the audio-to-face module using the tracked FLAME face as the ground truth, allowing us to directly regress audio features to FLAME face coefficients. According to Table 3, we observe a decrease in the lip-sync metric, indicating that this setting is unsuitable for generating lip movements synchronized with the audio. (2) Using Face Base Supervision: We set the ground truth of the predicted face base to the face base of the nearest vertex in the FLAME. This also leads to a decrease in the lip-sync metric, as shown in Table 3. These two experiments demonstrate that while we can use audio to drive a digital human similar to FLAME, this does not achieve optimal results. This is because both the tracked FLAME face and the pre-trained FLAME face base may be inaccurate, leading to increased errors. Directly using audio features as a condition yields the best results.

[0111] Table 3

[0112]

[0113] Semantic Loss: We also tested modules without semantic loss to demonstrate the effect of the proposed semantic loss. The results are shown in Table 3, where we found that the semantic loss is beneficial to the final rendering results and audio synchronization. Furthermore, the semantic loss stabilizes the training process, effectively preventing frequent training crashes that occurred before its introduction.

[0114] Semantic Eye Control Module: To demonstrate the effectiveness of the semantic eye control module, we conducted a removal experiment. We compared the image fidelity of the cropped eye region, and the results are shown in Table 4. It can be observed that by explicitly controlling eye movements, the semantic eye control module achieves better fidelity and greater realism in the driven results.

[0115] Table 4

[0116]

[0117] In summary, this invention proposes a novel framework, Semantic Controlled NeuralField for Audio Driven Avatar (Sem-Avatar), which utilizes the explicit FLAME model to achieve audio-driven digital human transformation, completely resolving the head-to-body inconsistency problem and resulting in high-quality audio-driven human figures. This invention also proposes using semantic information to guide the eye region of the human figure, which was previously uncontrollable in the FLAME model. Extensive experiments have verified the effectiveness of the method, demonstrating that it can synthesize highly realistic audio-driven human figures.

[0118] In summary, this invention provides an end-to-end solution for using audio to drive highly realistic digital humans, successfully solving the problems of head pose variation and head-body inconsistency, and generating highly realistic digital humans. The method of this invention can also control eye opening and closing through facial semantic information, overcoming the problem of unnatural audio-driven digital humans. The driving effect of this invention is superior to similar methods in audio lip-sync and rendering effects, and significantly outperforms traditional solutions in terms of digital human naturalness and rendering realism.

[0119] This invention also provides a storage medium for storing a computer program, which, when executed, performs at least the methods described above.

[0120] This invention also provides a control device, including a processor and a storage medium for storing a computer program; wherein the processor executes the computer program by performing at least the method described above.

[0121] This invention also provides a processor that executes a computer program, at least performing the methods described above.

[0122] The storage medium can be implemented by any type of volatile or non-volatile storage device, or a combination thereof. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM); magnetic surface memory can be disk storage or magnetic tape storage. Volatile memory can be random access memory (RAM), which serves as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), SyncLink Dynamic Random Access Memory (SLDRAM), and Direct Rambus Random Access Memory (DRRAM). The storage media described in the embodiments of the present invention are intended to include, but are not limited to, these and any other suitable types of memory.

[0123] In the several embodiments provided by this invention, it should be understood that the disclosed systems and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, and the indirect coupling or communication connection between devices or units can be electrical, mechanical, or other forms.

[0124] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of this embodiment according to actual needs.

[0125] In addition, in the various embodiments of the present invention, each functional unit can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.

[0126] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0127] Alternatively, if the integrated units of this invention are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this invention, or the parts that contribute to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.

[0128] The methods disclosed in the several method embodiments provided by this invention can be arbitrarily combined without conflict to obtain new method embodiments.

[0129] The features disclosed in the several product embodiments provided by this invention can be arbitrarily combined without conflict to obtain new product embodiments.

[0130] The features disclosed in the several method or device embodiments provided by the present invention can be arbitrarily combined without conflict to obtain new method or device embodiments.

[0131] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various equivalent substitutions or obvious modifications can be made without departing from the concept of the present invention, and all such modifications, achieving the same performance or application, should be considered within the scope of protection of the present invention.

Claims

1. A method for generating speech-driven digital humans based on neural fields, characterized in that, Includes the following steps: S1. Construct a deformable digital face using a face model; wherein, a modified FLAME face model is used to construct the deformable digital face, the modified FLAME face model being based on the FLAME model, but the pose basis, expression basis, and linear skinning weights are continuous in three-dimensional space. S2. Encode the audio features of the given speech and map the audio features to the expression space of the digital face. S3. A digital human in a standard space is driven by audio features based on neural field expression; wherein, a digital human in a standard space is obtained based on neural occupancy field and neural texture field, and for spatial coordinates in the standard space, the corresponding displacement is output by neural displacement field according to audio features. In step S3, for 3D points in the standard spatial field Predict its corresponding expression base Posture base and linear skin weights : The process of combining the expression base and the pose base with the expression features obtained from the audio to obtain the point coordinates after non-rigid displacement is expressed as follows: Subsequently, the LBS weights of the linear hybrid skin are combined with the poses of each node as input, and the coordinates of the points after rigid transformation are output. The digital human representation obtained after audio driving is as follows: = W( ) The pose coefficient, For the pose blendshape, For expression coefficients, For the expression blendshape, Regression yields the coordinates of the key points. It is a linear skin equation.

2. The speech-driven digital human generation method based on neural fields as described in claim 1, characterized in that, The digital face deformation process includes: Given an average surface grid ,in The number of vertices is used to multiply the expression coefficients by the expression basis, performing non-rigid deformation. Then, the linear hybrid skinning (LBS) algorithm is used to calculate rigid deformation. LBS multiplies the rotation vectors of the neck, chin, and eyes, along with the global rotation vector, with the linear hybrid skinning weights to perform rigid deformation on the entire face. The specific calculation process is as follows: in, The pose coefficient, Let K be the pose basis, and K be the number of joints. For the pose blendshape, For expression coefficients, For emoji base, For the expression blendshape, For a face mesh that has undergone non-rigid deformation, It is a linear skin weight. Regression yields the coordinates of the key points. It is a linear skin equation. This is the final mesh you get.

3. The speech-driven digital human generation method based on neural fields as described in claim 1 or 2, characterized in that, In step S2, the audio is converted into a feature vector using a pre-trained speech model. The audio features are then passed through four one-dimensional convolutional layers and mapped to expression coefficients and chin pose. A self-attention mechanism is used to output weighted expression features.

4. The speech-driven digital human generation method based on neural fields as described in any one of claims 1 to 2, characterized in that, The neural field expression includes: The digital human is represented using two multilayer perceptron (MLP) processors: one predicts the occupancy value of the current position to represent the digital human's geometry, and the other predicts the RGB values ​​of the current position to render the digital human. This decoupling of the two processors brings about improvements in both geometric and rendering performance. in, This represents the currently extracted audio features. It is a coordinate. It represents The probability of being occupied. Represents learnable parameters; additional audio is added as input to the MLP to explain topological transformations that cannot be explained by deformation; a neural texture MLP is used to assign colors to the digital human, the neural texture MLP being... The color of the current coordinate point is output as shown in the following formula: in, This represents the normal direction at the current position, obtained by regularizing the gradient of the geometric field. The pose coefficient, This represents the expression coefficient.

5. The method for generating a speech-driven digital human based on a neural field as described in any one of claims 1 to 2, characterized in that, In step S3, facial semantics is also used as an explicit control signal to perform eye control based on facial semantics.

6. The speech-driven digital human generation method based on neural fields as described in claim 5, characterized in that, In the preprocessing stage, semantic segmentation methods are used to obtain the semantic segmentation results of the virtual image; the total number of pixels in the eye region is calculated and a value called the eye ratio is obtained, which is normalized and added to the expression coefficient; During the training phase, the model learns the dimension of eye proportion as a measure of eye opening and achieves control over the eyes.

7. The speech-driven digital human generation method based on neural fields as described in any one of claims 1 to 2, characterized in that, Pixel loss is used as the training objective. Pixel loss measures the L2 distance between generated pixels and real pixels, and is expressed as follows: in, This represents the pixel on the human body surface that corresponds to the light ray hitting it. An additional mask loss is added to optimize the modeling results of the digital human by utilizing light rays that do not hit the human body surface; Here, CE represents the cross-entropy loss between the actual occupancy value and the predicted occupancy value; the point closest to the surface is selected on the ray. The calculated occupancy value is compared with the actual occupancy value; This represents the corresponding learnable parameters.

8. The speech-driven digital human generation method based on neural fields as described in any one of claims 1 to 2, characterized in that, By utilizing the semantic prior information of the face semantic segmentation map, a semantic loss function is added, which is specifically expressed as follows: For a point that belongs to the face in the standard field, find the nearest point in the corresponding face mesh and calculate their distance as the loss function; The face mesh is a non-rigidly deformed face mesh.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the speech-driven digital human generation method based on neural fields as described in any one of claims 1 to 8.