AI digital human intelligent creation management method and system

By extracting speech spectrum and emotional feature parameters, using preset models and fuzzy logic to control and correct lip movement sequences, and combining decomposition and rendering tasks in parallel processing, the problem of stiff lip movements in AI digital humans has been solved, improving the realism and vividness of digital human videos.

CN122289480APending Publication Date: 2026-06-26WUHAN BAOJI NEW MEDIA TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
WUHAN BAOJI NEW MEDIA TECHNOLOGY CO LTD
Filing Date
2026-04-14
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, AI digital humans cannot adaptively respond to the dynamic changes of lips in speech when generating videos, resulting in stiff lip movements and limited expressiveness, which reduces the realism of digital human videos and the user experience.

Method used

By extracting the speech spectrum features and emotional feature parameters of the speech content, an initial lip-sync sequence is generated using a preset model and corrected by fuzzy logic control. Combined with the emotional feature parameters, the digital human image is driven to generate animation, and the rendering task is decomposed and processed in parallel to improve synchronization.

Benefits of technology

It achieves adaptive matching of AI digital human lip movements and voice emotions, improving the realism and vividness of digital human videos and solving the problem of stiff lip movements.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289480A_ABST
    Figure CN122289480A_ABST
Patent Text Reader

Abstract

This application discloses an AI digital human intelligent creation management method and system, relating to the field of artificial intelligence technology. The method includes: determining the digital human image and speech content based on user-input digital human creation instructions; extracting features from the speech content to obtain speech spectrum features and emotional feature parameters; generating an initial lip-sync sequence based on the speech spectrum features; correcting the initial lip-sync sequence based on the emotional feature parameters to obtain a target lip-sync sequence; driving the digital human image based on the target lip-sync sequence and emotional feature parameters to generate a digital human animation; and creating an AI digital human based on the digital human animation and speech content. This application, by jointly driving the digital human image with the corrected target lip-sync sequence and emotional feature parameters, enables the generated AI digital human to adaptively follow the emotional fluctuations of the speech in both lip-sync dynamics and overall performance, enhancing the realism and vividness of the digital human video.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to an AI digital human intelligent creation management method and system. Background Technology

[0002] In the field of AI digital human creation, related technologies can already generate basic spoken-word videos based on input speech and a selected digital human image. However, in practical applications, the rich variations in tone, speed, and emotional nuance inherent in the speech signal often fail to be adaptively reflected in the final digital human's overall performance. For example, when the speech content expresses excitement or sadness, the overall dynamic performance of the digital human, such as the rhythm of lip movements, does not adjust accordingly, resulting in stiff, monotonous movements and a severe disconnect from the vividness of the speech, thus reducing the realism and user experience of the digital human video. Therefore, how to enable the created AI digital human to adaptively respond to the dynamic changes in lip movements of speech to improve the vividness of AI digital humans is a technical problem that urgently needs to be solved in this field. Summary of the Invention

[0003] The main purpose of this application is to provide an AI digital human intelligent creation management method and system, which aims to solve the technical problem of how to enable the created AI digital human to adaptively respond to the dynamic changes of lips in speech.

[0004] To achieve the above objectives, this application provides an AI digital human intelligent creation management method, the method comprising the following steps: The digital human's appearance and voice content are determined based on the user's input instructions for digital human creation. Feature extraction is performed on the speech content to obtain speech spectrum features and emotion feature parameters; An initial lip-shape sequence is generated based on the speech spectrum features, and the initial lip-shape sequence is corrected based on the emotion feature parameters to obtain the target lip-shape sequence. The digital human image is driven by the target lip movement sequence and the emotional feature parameters to generate a digital human animation, and an AI digital human is created based on the digital human animation and the voice content.

[0005] In one embodiment, the step of generating an initial lip-gesture sequence based on the speech spectrum features includes: The speech spectrum features are input into a preset model, and the model output is used as an initial lip movement sequence. The preset model is used to output the corresponding lip movement sequence based on the input speech spectrum features. The preset model is obtained after training based on the following steps: Acquire sample pairing data, which includes several speech spectrum features and corresponding lip movement sequences; An end-to-end model consisting of a convolutional neural network, a long short-term memory network, and a fully connected layer is constructed. The aforementioned speech spectral features are used as model inputs, and the aforementioned lip-shape sequences are used as labels. The end-to-end model is trained using mean squared error as the loss function to obtain the preset model.

[0006] In one embodiment, the step of correcting the initial lip-shape sequence based on the emotional feature parameters includes: The emotional intensity, speech rate, and tone change rate among the emotional feature parameters are used as input variables. Fuzzy inference is performed on the input variables according to a preset fuzzy rule base to obtain the fuzzy membership degree of the lip movement adjustment coefficient. The fuzzy membership degree is defuzzified to obtain the lip movement adjustment coefficient; The initial lip movement sequence is corrected using the lip movement adjustment coefficient to obtain the target lip movement sequence.

[0007] In one embodiment, the step of driving the digital human avatar to generate digital human animation based on the target lip movement sequence and the emotional feature parameters includes: In the preset expression library and preset action library, match the target expression resources and target action resources corresponding to the emotional feature parameters; Based on the target lip movement sequence, the target facial expression resource, and the target motion resource, the digital human avatar is driven to generate a digital human animation.

[0008] In one embodiment, the step of creating an AI digital human based on the digital human animation and the voice content includes: The digital human animation's digital human rendering task, background rendering task, and special effects rendering task are decomposed into mutually independent sub-tasks; The subtasks are distributed to multiple distributed computing nodes for parallel rendering, generating digital human frame sequences, background frame sequences, and special effects frame sequences, respectively. The digital human frame sequence, the background frame sequence, and the special effects frame sequence are layer-blended to obtain a fused sequence. The fused sequence and the voice content are time-synchronized and encoded to generate an AI digital human.

[0009] In one embodiment, before the step of decomposing the digital human rendering task, background rendering task, and special effects rendering task of the digital human animation into mutually independent sub-tasks, the method further includes: The hardware computing power score and network quality score are determined based on the user's local hardware configuration and network status, and the rendering computing power requirement value is determined based on the user's input creation requirement parameters. Based on the hardware computing power score, the network quality score, and the rendering computing power requirement value, a rendering route is matched and determined. The rendering route includes one of pure local rendering, cloud-edge collaborative rendering, or pure cloud rendering.

[0010] Furthermore, to achieve the above objectives, this application also proposes an AI digital human intelligent creation management system, which includes: The instruction parsing module is used to determine the digital human's appearance and voice content based on the digital human creation instructions input by the user; The feature extraction module is used to extract features from the speech content to obtain speech spectrum features and emotional feature parameters; The correction module is used to generate an initial lip movement sequence based on the speech spectrum features, and correct the initial lip movement sequence based on the emotion feature parameters to obtain a target lip movement sequence. The digital human creation module is used to drive the digital human image to generate digital human animation based on the target lip movement sequence and the emotional feature parameters, and to create an AI digital human based on the digital human animation and the voice content.

[0011] In addition, to achieve the above objectives, this application also proposes an AI digital human intelligent creation management device, the device comprising: a memory, a processor, and an AI digital human intelligent creation management program stored in the memory and executable on the processor, the AI ​​digital human intelligent creation management program being configured to implement the steps of the AI ​​digital human intelligent creation management method described above.

[0012] In addition, to achieve the above objectives, this application also proposes a storage medium, which is a computer-readable storage medium, storing an AI digital human intelligent creation management program, wherein when the AI ​​digital human intelligent creation management program is executed by a processor, it implements the steps of the AI ​​digital human intelligent creation management method described above.

[0013] In addition, to achieve the above objectives, the present invention also provides a computer program product, which includes an AI digital human intelligent creation management program. When the AI ​​digital human intelligent creation management program is executed by a processor, it implements the steps of the AI ​​digital human intelligent creation management method described above.

[0014] This application determines the digital human image and voice content based on the user's input digital human creation instructions; extracts features from the voice content to obtain voice spectrum features and emotional feature parameters; generates an initial lip-sync sequence based on the voice spectrum features, and corrects the initial lip-sync sequence based on the emotional feature parameters to obtain a target lip-sync sequence; drives the digital human image to generate a digital human animation based on the target lip-sync sequence and the emotional feature parameters, and creates an AI digital human based on the digital human animation and the voice content. The method described in this application extracts emotional feature parameters from the speech content and uses these parameters to correct the initial lip movement sequence. This avoids the problem of lip movements failing to respond to changes in tone, speed, and emotion found in related technologies, enabling the rhythm and amplitude of lip movements to adaptively match the emotional tone of the speech. Furthermore, this application uses the corrected target lip movement sequence and emotional feature parameters to jointly drive the digital human image, thus avoiding the problem of stiff movements and limited expressiveness caused by the disconnect between the overall performance of the digital human and the emotional tone of the speech. Consequently, the generated AI digital human can adaptively follow the emotional fluctuations of the speech in both lip movement dynamics and overall performance, enhancing the realism and vividness of the digital human video. Attached Figure Description

[0015] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0016] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a flowchart illustrating the first embodiment of the AI ​​digital human intelligent creation management method of this application; Figure 2 This is a flowchart illustrating the second embodiment of the AI ​​digital human intelligent creation management method of this application; Figure 3 This is a flowchart illustrating the third embodiment of the AI ​​digital human intelligent creation management method of this application; Figure 4 This is a schematic diagram illustrating the user privacy data collection method of the AI ​​digital human intelligent creation management method in this application; Figure 5 This is a structural block diagram of the first embodiment of the AI ​​digital human intelligent creation management system of this application; Figure 6 This is a schematic diagram of the structure of the AI ​​digital human intelligent creation management device of this application.

[0018] The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0019] It should be understood that the specific embodiments described herein are only used to explain the technical solutions of this application and are not intended to limit this application.

[0020] It should be noted that the executing entity of the embodiments of this application can be a computing service device with data processing, network communication, and program execution functions, such as a smart wearable device, a personal computer, or a mobile phone, or an electronic device capable of realizing the above functions, such as the aforementioned AI digital human intelligent creation management device. The following embodiments will be described using the AI ​​digital human intelligent creation management device as an example.

[0021] This application provides an AI digital human intelligent creation management method, referring to... Figure 1 , Figure 1 This is a flowchart illustrating the first embodiment of the AI ​​digital human intelligent creation management method of this application.

[0022] In this embodiment, the AI ​​digital human intelligent creation management method includes the following steps: Step S10: Determine the digital human image and voice content based on the digital human creation instructions input by the user.

[0023] It should be noted that the aforementioned digital human creation instructions refer to the complete set of operation commands issued by the user through an interactive interface (such as a webpage, application, or mini-program) to initiate and configure the digital human video production process. Digital human creation instructions may include requirements for selecting or customizing the digital human avatar, selecting or inputting voice content, and setting video production parameters (such as resolution, background style, etc.). For example, if a user clicks the "Real Person Clone" mode button, uploads a 10-second front-facing short video, and enters "Hello everyone, today I'm sharing a practical office tool…" in the text box, while selecting "Portrait 9:16, 1080P resolution," all of these operations constitute a single digital human creation instruction. The aforementioned digital human avatar represents the virtual persona entity used to represent the user. The digital human avatar can be a "digital twin" reconstructed based on a real person's photo or video, or a custom-designed "virtual model." The aforementioned voice content refers to the sound information that the digital human will emit in the video. This can be in the form of an audio file directly uploaded by the user, or synthesized speech generated by a speech synthesis engine after the user inputs text.

[0024] In its implementation, the AI ​​Digital Human Intelligent Creation Management System (hereinafter referred to as the System) can receive digital human creation instructions input by users through the user interaction module. When a user selects the live-action clone mode and uploads a photo or short video, the System calls the 3D face reconstruction algorithm in the digital human customization unit to construct a digital human image based on the facial image provided by the user. When a user selects the virtual model mode, the System matches preset digital human models from the resource classification and retrieval unit according to tags such as gender, age, and occupation. At the same time, the System analyzes the voice source in the instructions: if the user directly uploads an audio file, the audio is used as the voice content; if the user inputs text, the speech synthesis engine in the speech processing unit is called to synthesize the text into a speech waveform.

[0025] Step S20: Extract features from the speech content to obtain speech spectrum features and emotional feature parameters.

[0026] It should be noted that the aforementioned speech spectral features refer to the feature representation obtained by transforming the time-domain waveform of the speech signal to the frequency domain and then performing nonlinear compression according to the characteristics of human auditory perception (such as the Mel scale). Speech spectral features reflect the energy distribution of speech across different frequency channels and serve as the fundamental input for subsequent lip-sync algorithms. Speech signals can be processed using a Mel filter bank to obtain a Mel spectrum matrix of dimension F×T, where F is the number of Mel filters (e.g., 80) and T is the number of speech frames. The aforementioned emotional feature parameters refer to a set of numerical indicators that quantify the emotional coloring contained in speech, including but not limited to emotional intensity (reflecting the degree of emotional intensity), speech rate (the number of words spoken per unit time), and pitch variation rate (reflecting the intensity of pitch fluctuations). For example, for a speech expressing intense emotion, the emotional feature parameters can be extracted as: emotional intensity = 0.85 (range 0~1), speech rate = 180 words / minute, and pitch variation rate = 0.72.

[0027] In its implementation, the system performs feature extraction on the acquired speech content. Specifically, the system preprocesses the speech signal within the speech content by resampling, pre-emphasis, framing, and adding a Hamming window. Then, it performs a Fast Fourier Transform on each frame to convert the time-domain signal into a frequency-domain energy spectrum. This spectrum is then weighted, summed, and logarithmically calculated using a set of Mel filters to generate the Mel spectral feature matrix (i.e., the aforementioned speech spectral features). Simultaneously, the system further calculates emotional feature parameters from the same speech content: by analyzing the fundamental frequency curve, energy envelope, and duration information, it extracts emotional intensity (e.g., quantized through energy fluctuations and fundamental frequency range), speech rate (by counting the number of words pronounced per unit time using phoneme recognition), and pitch variation rate (calculated using the mean absolute value of the fundamental frequency difference).

[0028] Step S30: Generate an initial lip movement sequence based on the speech spectrum features, and modify the initial lip movement sequence based on the emotion feature parameters to obtain the target lip movement sequence.

[0029] It should be noted that the aforementioned speech spectrum features refer to the feature representation obtained by transforming the time-domain waveform of the speech signal to the frequency domain and then performing nonlinear compression according to the characteristics of human auditory perception (such as the Mel scale). Speech spectrum features can reflect the energy distribution of speech in different frequency channels. The aforementioned initial lip movement sequence refers to the lip movement data sequence directly generated based on the speech spectrum features, without adjustment for emotional parameters. The initial lip movement sequence is frame-by-frame, with each frame containing the values ​​of the hybrid deformation weights of the digital human face, used to drive the basic movement shape of the digital human lips. For example, for a 5-second speech segment, if the frame rate is 30 frames / second, the initial lip movement sequence contains 150 frames of data, each frame being a 52-dimensional vector, corresponding to the 52 hybrid deformation weights of the digital human face. The aforementioned target lip movement sequence refers to the final lip movement data sequence obtained after correction by emotional feature parameters. The target lip movement sequence has the same number of frames and dimensional structure as the initial lip movement sequence, but the weight values ​​have been adjusted according to the emotional intensity, speech rate, and pitch change rate of the speech, thus more accurately reflecting the emotional coloring of the speech.

[0030] It should be understood that when performing the step of generating the initial lip-shape sequence based on speech spectral features, the system can invoke a pre-built neural network model. This neural network model employs an end-to-end architecture, taking the extracted speech spectral features (such as the Mel spectrum matrix) as input, passing them through multiple layers of nonlinear transformations, and directly outputting a frame-level hybrid deformation weight sequence. Each frame in this sequence corresponds to the values ​​of 52 hybrid deformation weights for the digital human face, collectively forming the initial lip-shape sequence. This neural network model has been pre-trained using a large amount of speech-lip-shape pairing data and is capable of learning the mapping relationship between different pronunciations and lip shapes.

[0031] Furthermore, after obtaining the initial lip-shape movement sequence, the system can read the emotional feature parameters of the speech. These parameters have been calculated synchronously during the feature extraction stage, including the emotional intensity value representing the degree of emotion, the speech rate value representing the speed of pronunciation, and the pitch change rate value representing the amplitude of pitch fluctuation. The system uses these parameters as input to execute a fuzzy logic control process. The fuzzy logic control process first maps each input parameter to a preset fuzzy set. For example, emotional intensity is mapped to three fuzzy sets: "low," "medium," and "high"; speech rate is mapped to three fuzzy sets: "slow," "medium," and "fast"; and pitch change rate is mapped to three fuzzy sets: "small," "medium," and "large." Each fuzzy set corresponds to a membership function. Subsequently, the system performs rule matching according to the fuzzy rule base. For example, when emotional intensity belongs to the "high" set, speech rate belongs to the "fast" set, and pitch change rate belongs to the "large" set, the corresponding rule is matched, and the fuzzy membership distribution of the lip-shape opening and closing amplitude adjustment coefficient and the lip-shape movement rhythm adjustment coefficient is output. The system employs the centroid method to convert the fuzzy membership distribution into precise adjustment coefficient values. Finally, the system identifies 18 weights related to lip movements in the initial lip movement sequence (e.g., 18 weights corresponding to lip opening and closing, lip bead movement, and corner changes), multiplies the adjustment coefficients by these weights to obtain the corrected weight values, while leaving the original values ​​unchanged for weights related to non-lip movements. After this frame-by-frame processing, the system obtains the target lip movement sequence.

[0032] Step S40: Drive the digital human image to generate a digital human animation based on the target lip movement sequence and the emotional feature parameters, and create an AI digital human based on the digital human animation and the voice content.

[0033] It should be noted that the aforementioned digital human animation refers to a sequence of moving images of a digital human figure. The aforementioned AI digital human refers to a finished digital human film that includes a digital human figure, voice content, background images, and special effects elements. The finished digital human film can be in the form of a short video or a long video, and this embodiment does not limit it in this way.

[0034] In one feasible implementation, a digital human animation engine can be invoked to load a 3D model of the digital human avatar. This 3D model includes a facial hybrid deformation controller and a body skeletal system. The engine reads the target lip movement sequence frame by frame, assigning 52 hybrid deformation weights in each frame to the facial controller to synchronize the shape of the digital human's lips with the target lip movement sequence. Simultaneously, the engine reads emotional feature parameters, determines the overall tone of the digital human's performance based on emotional intensity values ​​and tone change rates, selects facial expression change patterns and body posture change patterns that match the emotional tone from preset animation resources, converts the expression change patterns into incremental sequences of hybrid deformation weights, and converts the body posture change patterns into incremental sequences of skeletal rotation angles. These incremental sequences are then fused temporally with the lip movement driving sequence to generate a complete digital human animation including lip shape, expression, and movement. Finally, the digital human animation is aligned with the speech content to create the AI ​​digital human.

[0035] This embodiment determines the digital human image and voice content based on the user's input digital human creation instructions; extracts features from the voice content to obtain voice spectrum features and emotional feature parameters; generates an initial lip-sync sequence based on the voice spectrum features, and corrects the initial lip-sync sequence based on the emotional feature parameters to obtain a target lip-sync sequence; drives the digital human image to generate a digital human animation based on the target lip-sync sequence and the emotional feature parameters, and creates an AI digital human based on the digital human animation and the voice content. This embodiment extracts emotional feature parameters from the speech content and uses these parameters to correct the initial lip movement sequence. This avoids the problem of lip movements failing to respond to changes in tone, speed, and emotion, as seen in related technologies. The method enables the rhythm and amplitude of lip movements to adaptively match the emotional tone of the speech. Furthermore, this embodiment uses the corrected target lip movement sequence and emotional feature parameters to jointly drive the digital human image. This avoids the problem of stiff movements and limited expressiveness caused by a disconnect between the overall performance of the digital human and the emotional tone of the speech. Consequently, the generated AI digital human can adaptively follow the emotional fluctuations of the speech in both lip movement dynamics and overall performance, enhancing the realism and vividness of the digital human video.

[0036] Reference Figure 2 , Figure 2 This is a flowchart illustrating the second embodiment of the AI ​​digital human intelligent creation management method of this application.

[0037] In one feasible implementation, step S30 may include: Step S301: Input the speech spectrum features into a preset model, and use the model output as the initial lip movement sequence. The preset model is used to output the corresponding lip movement sequence based on the input speech spectrum features.

[0038] The preset model is obtained after training based on the following steps: Acquire sample pairing data, which includes several speech spectrum features and corresponding lip-shape action sequences; construct an end-to-end model consisting of a convolutional neural network, a long short-term memory network, and a fully connected layer, using the several speech spectrum features as model inputs and the several lip-shape action sequences as labels, and train the end-to-end model with mean squared error as the loss function to obtain the preset model.

[0039] It should be noted that the above end-to-end model consists of a convolutional neural network, a long short-term memory network, and a fully connected layer. The convolutional neural network layer performs convolution operations on the Mel-spectrum matrix to extract the spatial structure features of the speech spectrum across different frequency channels. The long short-term memory network receives the feature sequence output by the convolutional neural network, processes it frame by frame, and retains the temporal context information of the speech, enabling lip-shape prediction in the current frame to reference the speech features of previous and subsequent frames. The fully connected layer maps the temporal feature vector output by the long short-term memory network into 52 hybrid deformation weight values, each weight value corresponding to a control parameter of the digital face.

[0040] Specifically, the training process for the end-to-end model is as follows: The convolutional neural network (CNN) layer receives the Mel-spectrum feature matrix as input and extracts local spatial features of the speech spectrum in the frequency and time dimensions through multiple convolutional and pooling operations, outputting a sequence of feature maps. The long short-term memory (LSTM) network layer receives the feature map sequence output by the CNN layer and processes it frame-by-frame using memory units and gating mechanisms to capture the long-range dependencies of speech features on the time axis, outputting a sequence of feature vectors incorporating temporal information. The fully connected layer receives the feature vectors output by the LSM layer and maps them into 52 hybrid deformable weight values ​​through linear transformation and non-linear activation functions. The entire model uses mean squared error as the loss function. During training, the model predicts the input Mel-spectrum features frame-by-frame as a sequence of hybrid deformable weights. The loss function calculates the error between the predicted weight sequence and the labeled weight sequence, and updates the weight parameters in the model through backpropagation, gradually decreasing the loss function value. After multiple rounds of iterative training, the model converges to obtain the preset model. In the application phase, the system inputs the real-time extracted speech Mel-spectrum features into the trained preset model. The forward propagation process of the pre-defined model is the same as that of the training phase. The convolutional neural network layer extracts spatial features, the long short-term memory network layer captures temporal dependencies, and the fully connected layer outputs a hybrid deformation weight sequence. The model output is directly used as the initial lip-sync sequence for subsequent sentiment correction processing.

[0041] Step S302: Using the emotional intensity, speech rate and tone change rate in the emotional feature parameters as input variables, perform fuzzy inference on the input variables according to the preset fuzzy rule base to obtain the fuzzy membership degree of the lip movement adjustment coefficient.

[0042] It should be noted that the aforementioned emotional intensity refers to a numerical indicator used to quantify the intensity of emotion in speech; the aforementioned speech rate refers to the number of words or syllables pronounced per unit time, used to quantify the pace of speech expression; and the aforementioned pitch variation rate refers to a numerical indicator used to quantify the intensity of pitch fluctuations in speech. The aforementioned pre-set fuzzy rule base refers to a pre-established knowledge base containing multiple fuzzy conditional statements. Each rule in the pre-set fuzzy rule base can adopt the form of "if…then…", associating the fuzzy state of the input variable with the fuzzy state of the output variable, used to simulate expert experience for reasoning and decision-making. For example, a fuzzy rule could be: "If the emotional intensity is high, the speech rate is fast, and the pitch variation rate is large, then the lip opening / closing amplitude adjustment coefficient is high, and the lip movement rhythm adjustment coefficient is high."

[0043] It should be understood that the aforementioned lip movement adjustment coefficients refer to multiplicative factors used to correct the mixed deformation weights in the initial lip movement sequence. The lip movement adjustment coefficients include an opening / closing amplitude adjustment coefficient and a movement rhythm adjustment coefficient, typically ranging from 0.8 to 1.5. For example, an opening / closing amplitude adjustment coefficient of 1.2 indicates that the mixed deformation weights related to lip opening / closing are amplified by 20%; a movement rhythm adjustment coefficient of 0.9 indicates that the temporal rhythm of the lip movements is compressed by 10%. The aforementioned fuzzy membership degree refers to a numerical value used in fuzzy set theory to describe the degree to which a precise value belongs to a certain fuzzy set. The fuzzy membership degree ranges from 0 to 1, where 0 indicates no membership at all and 1 indicates full membership. For example, when the emotional intensity is 0.85, the membership degree for the "high" fuzzy set is 0.8, for the "medium" fuzzy set is 0.2, and for the "low" fuzzy set is 0.

[0044] Step S303: Defuzzify the fuzzy membership degree to obtain the lip movement adjustment coefficient.

[0045] Step S304: Correct the initial lip movement sequence using the lip movement adjustment coefficient to obtain the target lip movement sequence.

[0046] In one feasible implementation, the system extracts three values—emotional intensity, speech rate, and tone change rate—from the emotional feature parameters as input variables for fuzzy logic control. These three input variables are then fuzzified; that is, the membership degree of each input variable to each fuzzy set is calculated according to a preset triangular membership function. Specifically, emotional intensity is mapped to three fuzzy sets: "low," "medium," and "high"; speech rate is mapped to three fuzzy sets: "slow," "medium," and "fast"; and tone change rate is mapped to three fuzzy sets: "small," "medium," and "large." After fuzzification, each input variable corresponds to a set of membership values. For example, when the emotional intensity is 0.85, the membership degree to the "high" set is 0.8, and the membership degree to the "medium" set is 0.2.

[0047] Then, fuzzy inference is performed based on a pre-set fuzzy rule base. This base contains 27 fuzzy rules, covering all fuzzy combinations of input variables. Each rule takes the form of "If the emotional intensity is a certain fuzzy set, the speech rate is a certain fuzzy set, and the tone change rate is a certain fuzzy set, then the lip opening / closing amplitude adjustment coefficient is a certain fuzzy set, and the lip movement rhythm adjustment coefficient is a certain fuzzy set." For a given input membership degree, the system matches all rules, calculates the trigger strength of each rule (usually taking the minimum value of the input membership degree), and applies the trigger strength to the output fuzzy set in the rule's conclusion, obtaining the fuzzy membership degree distribution of the output variables. For example, when the input triggers the rule "high emotional intensity, fast speech rate, and high tone change rate," the fuzzy set of output "high lip opening / closing amplitude adjustment coefficient and high rhythm adjustment coefficient" is activated.

[0048] Next, the centroid method is used to calculate the precise value of the output variable: the abscissa value corresponding to the centroid of the area under the fuzzy membership function curve is taken as the defuzzification result. For the opening / closing amplitude adjustment coefficient, its value ranges from 0.8 to 1.5, and the centroid method calculation formula is that the adjustment coefficient equals the abscissa multiplied by the integral of the membership function divided by the integral of the membership function. After defuzzification, the lip movement adjustment coefficient, including the opening / closing amplitude adjustment coefficient and the rhythm adjustment coefficient, can be obtained.

[0049] Finally, the system corrects the initial lip movement sequence using lip movement adjustment coefficients. The system identifies 18 weights related to lip movements in the initial lip movement sequence, primarily including weights corresponding to lip opening and closing, lip bead movement, and corner changes. For these weights, the system multiplies them by an opening / closing amplitude adjustment coefficient and linearly scales the time axis of the weight sequence according to a rhythm adjustment coefficient, ensuring that the amplitude and rhythm of the lip movements match the emotional characteristics of the speech. For mixed deformation weights not related to lip movements, the system retains their original values. After this frame-by-frame processing, the system obtains the corrected target lip movement sequence.

[0050] This embodiment inputs the speech spectrum features into a preset model and uses the model output as the initial lip-shape sequence. The preset model is used to output the corresponding lip-shape sequence based on the input speech spectrum features. The preset model is obtained after training based on the following steps: acquiring sample pairing data, which includes several speech spectrum features and corresponding lip-shape sequences; constructing an end-to-end model composed of a convolutional neural network, a long short-term memory network, and a fully connected layer; using the several speech spectrum features as model input and the several lip-shape sequences as labels, training the end-to-end model with mean squared error as the loss function to obtain the preset model; using the emotion intensity, speech rate, and pitch change rate among the emotion feature parameters as input variables, performing fuzzy inference on the input variables according to a preset fuzzy rule base to obtain the fuzzy membership degree of the lip-shape adjustment coefficient; defuzzifying the fuzzy membership degree to obtain the lip-shape adjustment coefficient; and using the lip-shape adjustment coefficient to correct the initial lip-shape sequence to obtain the target lip-shape sequence. In this embodiment, the above-described method constructs an end-to-end model composed of a convolutional neural network, a long short-term memory network, and a fully connected layer. The model is trained using mean squared error as the loss function to obtain a preset model, thereby enabling the accurate generation of a temporally matched initial lip movement sequence using speech spectral features. Based on this, by using emotional intensity, speech rate, and pitch change rate as input variables, lip movement adjustment coefficients are obtained through fuzzy rule reasoning and the initial lip movement sequence is corrected. This avoids the problem in related technologies where lip movements cannot respond to changes in speech rate, pitch, and emotion. The opening and closing amplitude and rhythm of lip movements can be adaptively adjusted according to the emotional fluctuations of speech, thereby improving the vividness and realism of digital human lip performance.

[0051] Reference Figure 3 , Figure 3 This is a flowchart illustrating the third embodiment of the AI ​​digital human intelligent creation management method of this application.

[0052] In one feasible implementation, step S40 may include: Step S401: In the preset expression library and preset action library, match the target expression resource and target action resource corresponding to the emotional feature parameters.

[0053] It should be noted that the aforementioned preset expression library refers to a pre-built collection of expression resources that stores various facial expression data. Each expression resource in the preset expression library exists in the form of a mixed deformation weight sequence or an expression animation clip, used to control the muscle movements of the digital human's face to express different emotional states. The aforementioned preset motion library refers to a pre-built collection of motion resources that stores various limb motion data. Each motion resource in the preset motion library exists in the form of a skeletal animation sequence, used to control the movement posture of the digital human's body.

[0054] Step S402: Drive the digital human image to generate digital human animation based on the target lip movement sequence, the target facial expression resource, and the target motion resource.

[0055] In its implementation, for each frame of the digital human animation, the engine reads the blending deformation weights of the lip region from the target lip movement sequence and the blending deformation weights of the eyebrow and eye region from the target facial expression resource. These weights are then superimposed and assigned to the facial controller, ensuring that the digital human's lip shape and eyebrow / eye expression are presented collaboratively in the same frame. Simultaneously, the engine reads the bone rotation angle sequence from the target motion resource and assigns it frame-by-frame to the digital human model's skeletal system, ensuring that the digital human's limb movements and facial expressions are synchronized in time. To address potential inconsistencies in timing length between the target lip movement sequence, target facial expression resource, and target motion resource, the engine adjusts these settings using linear interpolation or loop adaptation to ensure that the start and end times of each motion channel are aligned. After this frame-by-frame processing, a complete digital human animation is generated.

[0056] Step S403: Decompose the digital human rendering task, background rendering task, and special effects rendering task of the digital human animation into mutually independent sub-tasks.

[0057] It should be noted that the aforementioned digital human rendering task involves calculating the geometric shape, texture mapping, lighting effects, and animation weights of the digital human image frame by frame, and finally outputting a sequence of digital human frames with transparency information; the aforementioned background rendering task involves performing corresponding rendering processing according to the type of background material (static image, dynamic video, or 3D virtual scene), and finally outputting a sequence of background frames; the aforementioned special effects rendering task involves calculating the special effects elements frame by frame according to their type, position, display duration, and animation effects, and finally outputting a sequence of special effects frames with transparency information.

[0058] Step S404: Distribute the subtasks to multiple distributed computing nodes for parallel rendering, generating digital human frame sequences, background frame sequences, and special effects frame sequences respectively.

[0059] Step S405: Perform layer fusion on the digital human frame sequence, the background frame sequence, and the special effects frame sequence to obtain a fused sequence.

[0060] Step S406: Perform time-series synchronization and encoding encapsulation on the fused sequence and the voice content to generate an AI digital human.

[0061] In one feasible implementation, the decomposed subtasks can be distributed across multiple distributed computing nodes for parallel rendering. Within these nodes, the master node is responsible for task scheduling and result collection, while the slave nodes handle the actual rendering calculations. The master node can perform pixel-level fusion frame-by-frame according to a preset layer stacking order (background layer at the bottom, digital human layer in the middle, and effects layer at the top). Finally, the fused sequence and audio content are synchronized and encoded to generate the AI ​​digital human.

[0062] In one feasible implementation, prior to step S503, the following may also be included: Step S4a: Determine the hardware computing power score and network quality score based on the user's local hardware configuration and network status, and determine the rendering computing power requirement value based on the user's input creation requirement parameters.

[0063] It should be noted that the above-mentioned user local hardware configuration refers to the physical hardware specifications of the user's terminal device used to run the system, the above-mentioned network status refers to the network connection status between the user's terminal device and the system server, and the above-mentioned creation requirement parameters refer to the video production specification parameters set by the user when initiating the production command.

[0064] Step S4b: Based on the hardware computing power score, the network quality score, and the rendering computing power requirement value, match and determine the rendering route, which includes one of pure local rendering, cloud-edge collaborative rendering, or pure cloud rendering.

[0065] It should be noted that the above-mentioned pure local rendering refers to a rendering route in which all rendering tasks are completed on the user's local terminal device and do not rely on cloud computing power; the above-mentioned cloud-edge collaborative rendering refers to a rendering route in which the rendering task is split into a local execution part and a cloud execution part, which are jointly completed by the local device and the cloud edge node; the above-mentioned pure cloud rendering refers to a rendering route in which all rendering tasks are completed on the distributed computing nodes in the cloud, and the local device is only responsible for sending operation instructions and downloading the finished product.

[0066] In one feasible implementation, after the user initiates a production command, the system automatically calls a detection program to read the hardware information of the user's terminal device, including graphics card model, video memory capacity, number of CPU cores, and RAM capacity, and generates a hardware computing power score according to preset hardware scoring rules. Simultaneously, the system uses a network testing module to detect the user terminal's uplink bandwidth, downlink bandwidth, and network latency, generating a network quality score based on the combined bandwidth and latency data. Then, the system parses the user-input creative requirement parameters, including the user-selected rendering mode (2D or 3D), video resolution, frame rate, and total duration, and calculates the rendering computing power requirement value based on these parameters. Finally, based on the hardware computing power score, network quality score, and rendering computing power requirement value, the system matches and determines the rendering route. The matching rules can be as follows: When the hardware computing power score is greater than or equal to the first threshold (e.g., 60) and the rendering computing power requirement is less than or equal to the second threshold (e.g., 50), the local device is deemed capable of independently completing the rendering, and the pure local rendering route is selected; when the hardware computing power score is between the third and first thresholds (e.g., between 30 and 60) and the network quality score is greater than or equal to the fourth threshold (e.g., 50), the local computing power is deemed insufficient to independently complete all tasks, but the network transmission conditions are good, and the cloud-edge collaborative rendering route is selected, allocating computing power-intensive sub-tasks (e.g., 3D digital human animation rendering) to cloud edge nodes, while lightweight sub-tasks (e.g., background rendering) are completed locally; when the hardware computing power score is less than the third threshold (e.g., 30) or the rendering computing power requirement is greater than the fifth threshold (e.g., 80), the local device is deemed unable to handle the rendering tasks, and the pure cloud rendering route is selected, where all rendering tasks are completed by cloud distributed computing nodes, and the local device is only responsible for sending operation instructions and downloading the finished product.

[0067] In this embodiment, target expression resources and target action resources corresponding to the emotional feature parameters are matched in a preset expression library and a preset action library; the digital human image is driven to generate a digital human animation based on the target lip movement sequence, the target expression resources, and the target action resources; the digital human rendering task, background rendering task, and special effects rendering task of the digital human animation are decomposed into mutually independent sub-tasks; the sub-tasks are distributed to multiple distributed computing nodes for parallel rendering, generating digital human frame sequences, background frame sequences, and special effects frame sequences respectively; the digital human frame sequences, background frame sequences, and special effects frame sequences are layer-fused to obtain a fused sequence; the fused sequence and the voice content are time-synchronized and encoded to generate an AI digital human; hardware computing power scores and network quality scores are determined based on the user's local hardware configuration and network status, and rendering computing power requirements are determined based on the user's input creation requirements parameters; based on the hardware computing power scores, network quality scores, and rendering computing power requirements, a rendering route is matched and determined, and the rendering route includes one of pure local rendering, cloud-edge collaborative rendering, or pure cloud rendering. In this embodiment, the method matches corresponding facial expression and motion resources based on emotional feature parameters, and uses these to drive the digital human image. This avoids the problem of relying solely on lip movements while ignoring the overall facial expressions and body movements that change with emotions, ensuring that the digital human's facial expressions and body postures are consistent with the emotional tone of the speech. Simultaneously, by decomposing the rendering task into independent sub-tasks and distributing them to multiple distributed nodes for parallel rendering, and dynamically matching purely local, cloud-edge collaborative, or purely cloud-based rendering routes based on local hardware configuration, network status, and rendering requirements, the method avoids the problems of low rendering efficiency of a single node or insufficient local computing power leading to generation delays and excessively high hardware requirements. This enables the efficient generation of vivid digital human videos where expressions, movements, and speech emotions are coordinated, effectively solving the technical problems of stiff and monotonous digital human movements in related technologies.

[0068] Furthermore, all user-related data involved in this application (e.g., digital human image, voice content, voice spectrum characteristics, user's local hardware configuration, etc.) were obtained with the user's permission or consent; that is, when this application is applied to specific products or technologies, user permission is required to obtain and process all user-related data, and the processing of all user-related data must comply with the relevant laws, regulations and regulatory standards of the relevant countries and regions.

[0069] For example, please refer to Figure 4 , Figure 4 This diagram illustrates the collection of user privacy data in the AI ​​digital human intelligent creation management method described in this application. Figure 4As shown, when it is necessary to obtain a user's voice content, a voice acquisition prompt can be displayed on the user's terminal. After receiving confirmation from the user regarding the voice acquisition prompt, the terminal can obtain the user's voice content.

[0070] Reference Figure 5 , Figure 5 This is a structural block diagram of the first embodiment of the AI ​​digital human intelligent creation management system of this application.

[0071] like Figure 5 As shown in the embodiments of this application, the AI ​​digital human intelligent creation management system includes: The instruction parsing module 501 is used to determine the digital human image and voice content based on the digital human creation instructions input by the user. Feature extraction module 502 is used to extract features from the speech content to obtain speech spectrum features and emotion feature parameters; The correction module 503 is used to generate an initial lip movement sequence based on the speech spectrum features, and correct the initial lip movement sequence based on the emotion feature parameters to obtain a target lip movement sequence. The digital human creation module 504 is used to drive the digital human image to generate a digital human animation based on the target lip movement sequence and the emotional feature parameters, and to create an AI digital human based on the digital human animation and the voice content.

[0072] This embodiment determines the digital human image and voice content based on the user's input digital human creation instructions; extracts features from the voice content to obtain voice spectrum features and emotional feature parameters; generates an initial lip-sync sequence based on the voice spectrum features, and corrects the initial lip-sync sequence based on the emotional feature parameters to obtain a target lip-sync sequence; drives the digital human image to generate a digital human animation based on the target lip-sync sequence and the emotional feature parameters, and creates an AI digital human based on the digital human animation and the voice content. This embodiment extracts emotional feature parameters from the speech content and uses these parameters to correct the initial lip movement sequence. This avoids the problem of lip movements failing to respond to changes in tone, speed, and emotion, as seen in related technologies. The method enables the rhythm and amplitude of lip movements to adaptively match the emotional tone of the speech. Furthermore, this embodiment uses the corrected target lip movement sequence and emotional feature parameters to jointly drive the digital human image. This avoids the problem of stiff movements and limited expressiveness caused by a disconnect between the overall performance of the digital human and the emotional tone of the speech. Consequently, the generated AI digital human can adaptively follow the emotional fluctuations of the speech in both lip movement dynamics and overall performance, enhancing the realism and vividness of the digital human video.

[0073] Based on the first embodiment of the AI ​​digital human intelligent creation management system described in this application, a second embodiment of the AI ​​digital human intelligent creation management system of this application is proposed.

[0074] In this embodiment, the correction module 503 is further configured to input the speech spectrum features into a preset model and use the model output as an initial lip-shape sequence. The preset model is configured to output a corresponding lip-shape sequence based on the input speech spectrum features. The preset model is obtained after training based on the following steps: acquiring sample pairing data, which includes several speech spectrum features and several corresponding lip-shape sequences; constructing an end-to-end model composed of a convolutional neural network, a long short-term memory network, and a fully connected layer; using the several speech spectrum features as model input and the several lip-shape sequences as labels; and training the end-to-end model with mean squared error as the loss function to obtain the preset model.

[0075] Furthermore, the correction module 503 is also used to take the emotional intensity, speech rate and tone change rate in the emotional feature parameters as input variables, perform fuzzy inference on the input variables according to the preset fuzzy rule base to obtain the fuzzy membership degree of the lip movement adjustment coefficient; defuzzify the fuzzy membership degree to obtain the lip movement adjustment coefficient; and use the lip movement adjustment coefficient to correct the initial lip movement sequence to obtain the target lip movement sequence.

[0076] Furthermore, the digital human creation module 504 is also used to match target expression resources and target action resources corresponding to the emotional feature parameters in the preset expression library and preset action library; and drive the digital human image to generate digital human animation based on the target lip movement sequence, the target expression resources and the target action resources.

[0077] Furthermore, the digital human creation module 504 is also used to decompose the digital human rendering task, background rendering task, and special effects rendering task of the digital human animation into mutually independent sub-tasks; allocate the sub-tasks to multiple distributed computing nodes for parallel rendering, generating digital human frame sequences, background frame sequences, and special effects frame sequences respectively; perform layer fusion on the digital human frame sequences, the background frame sequences, and the special effects frame sequences to obtain a fused sequence; and perform time synchronization and encoding encapsulation on the fused sequence and the voice content to generate an AI digital human.

[0078] Furthermore, the digital human creation module 504 is also used to determine the hardware computing power score and network quality score based on the user's local hardware configuration and network status, and to determine the rendering computing power requirement value based on the creation requirement parameters input by the user; and to match and determine the rendering route based on the hardware computing power score, the network quality score and the rendering computing power requirement value, wherein the rendering route includes one of pure local rendering, cloud-edge collaborative rendering or pure cloud rendering.

[0079] Other embodiments or specific implementations of the AI ​​digital human intelligent creation management system of this application can be found in the above-described method embodiments, and will not be repeated here.

[0080] This application provides an AI digital human intelligent creation management device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the AI ​​digital human intelligent creation management method in the above embodiment 1.

[0081] The following reference Figure 6 This document illustrates a structural schematic diagram of an AI digital human intelligent creation management device suitable for implementing embodiments of this application. The AI ​​digital human intelligent creation management device in the embodiments of this application may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Portable Application Description), PMPs (Portable Media Players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. Figure 6 The AI ​​digital human intelligent creation management device shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.

[0082] like Figure 6As shown, the AI ​​digital human intelligent creation management device may include a processing unit 1001 (e.g., a central processing unit, a graphics processing unit, etc.), which can perform various appropriate actions and processes according to a program stored in a read-only memory 1002 or a program loaded from a storage device 1003 into a random access memory 1004. The random access memory 1004 also stores various programs and data required for the operation of the AI ​​digital human intelligent creation management device. The processing unit 1001, the read-only memory 1002, and the random access memory 1004 are interconnected via a bus 1005. An input / output interface 1006 is also connected to the bus. Typically, the following systems can be connected to the input / output interface 1006: input devices 1007 including, for example, a touchscreen, touchpad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; output devices 1008 including, for example, a liquid crystal display (LCD), speaker, vibrator, etc.; storage devices 1003 including, for example, magnetic tape, hard disk, etc.; and communication devices 1009. The communication device 1009 allows the AI ​​digital human intelligent creation management device to communicate wirelessly or wiredly with other devices to exchange data. While the figure shows AI digital human intelligent creation management devices with various systems, it should be understood that implementing or having all of the systems shown is not required. More or fewer systems may be implemented alternatively.

[0083] Specifically, according to the embodiments disclosed in this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments disclosed in this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from storage device 1003, or installed from read-only memory 1002. When the computer program is executed by processing device 1001, it performs the functions defined in the methods of the embodiments disclosed in this application.

[0084] The AI ​​digital human intelligent creation management device provided in this application, employing the AI ​​digital human intelligent creation management method in the above embodiments, can solve the technical problem of how to enable the created AI digital human to adaptively respond to the dynamic changes of lips in speech. Compared with the prior art, the beneficial effects of the AI ​​digital human intelligent creation management device provided in this application are the same as those of the AI ​​digital human intelligent creation management method provided in the above embodiments, and other technical features in this AI digital human intelligent creation management device are the same as those disclosed in the previous embodiment method, and will not be repeated here.

[0085] It should be understood that the various parts disclosed in this application can be implemented using hardware, software, firmware, or a combination thereof. In the description of the above embodiments, specific features, structures, materials, or characteristics can be combined in any suitable manner in one or more embodiments or examples.

[0086] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

[0087] This application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon, which are used to execute the AI ​​digital human intelligent creation management method in the above embodiments.

[0088] The computer-readable storage medium provided in this application may be, for example, a USB flash drive, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems or devices, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this embodiment, the computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system or device. The program code contained on the computer-readable storage medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (Radio Frequency), etc., or any suitable combination thereof.

[0089] The aforementioned computer-readable storage medium may be included in the AI ​​digital human intelligent creation management device; or it may exist independently and not be assembled into the AI ​​digital human intelligent creation management device.

[0090] The aforementioned computer-readable storage medium carries one or more programs that, when executed by the AI ​​digital human intelligent creation management device, enable the AI ​​digital human intelligent creation management device to write computer program code for performing the operations of this application in one or more programming languages ​​or a combination thereof. These programming languages ​​include object-oriented programming languages ​​such as Java, Smalltalk, and C++; and also conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network, such as a local area network (LAN) or a wide area network (WAN), or connected to an external computer (e.g., via the Internet using an Internet service provider).

[0091] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0092] The modules described in the embodiments of this application can be implemented in software or hardware. The names of the modules do not necessarily limit the functionality of the unit itself.

[0093] The readable storage medium provided in this application is a computer-readable storage medium that stores computer-readable program instructions (i.e., computer programs) for executing the above-described AI digital human intelligent creation management method. This solves the technical problem of how to enable the created AI digital human to adaptively respond to dynamic changes in lip movements during speech. Compared with the prior art, the beneficial effects of the computer-readable storage medium provided in this application are the same as those of the AI ​​digital human intelligent creation management method provided in the above embodiments, and will not be repeated here.

[0094] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the AI ​​digital human intelligent creation management method described above.

[0095] The computer program product provided in this application can solve the technical problem of intelligent creation management of AI digital humans. Compared with the prior art, the beneficial effects of the computer program product provided in this application are the same as the beneficial effects of the AI ​​digital human intelligent creation management method provided in the above embodiments, and will not be repeated here.

[0096] The above description is only a part of the embodiments of this application and does not limit the scope of protection of this application. All equivalent structural transformations made under the technical concept of this application and using the content of this application specification and drawings, or direct / indirect applications in other related technical fields, are included in the scope of protection of this application.

Claims

1. An AI digital human intelligent creation management method, characterized in that, The method includes the following steps: The digital human's appearance and voice content are determined based on the user's input instructions for digital human creation. Feature extraction is performed on the speech content to obtain speech spectrum features and emotion feature parameters; An initial lip-shape sequence is generated based on the speech spectrum features, and the initial lip-shape sequence is corrected based on the emotion feature parameters to obtain the target lip-shape sequence. The digital human image is driven by the target lip movement sequence and the emotional feature parameters to generate a digital human animation, and an AI digital human is created based on the digital human animation and the voice content.

2. The AI ​​digital human intelligent creation management method as described in claim 1, characterized in that, The step of generating an initial lip movement sequence based on the speech spectrum features includes: The speech spectrum features are input into a preset model, and the model output is used as an initial lip movement sequence. The preset model is used to output the corresponding lip movement sequence based on the input speech spectrum features. The preset model is obtained after training based on the following steps: Acquire sample pairing data, which includes several speech spectrum features and corresponding lip movement sequences; An end-to-end model consisting of a convolutional neural network, a long short-term memory network, and a fully connected layer is constructed. The aforementioned speech spectral features are used as model inputs, and the aforementioned lip-shape sequences are used as labels. The end-to-end model is trained using mean squared error as the loss function to obtain the preset model.

3. The AI ​​digital human intelligent creation management method as described in claim 1, characterized in that, The step of correcting the initial lip movement sequence based on the emotional feature parameters to obtain the target lip movement sequence includes: The emotional intensity, speech rate, and tone change rate among the emotional feature parameters are used as input variables. Fuzzy inference is performed on the input variables according to a preset fuzzy rule base to obtain the fuzzy membership degree of the lip movement adjustment coefficient. The fuzzy membership degree is defuzzified to obtain the lip movement adjustment coefficient; The initial lip movement sequence is corrected using the lip movement adjustment coefficient to obtain the target lip movement sequence.

4. The AI ​​digital human intelligent creation management method as described in claim 1, characterized in that, The step of driving the digital human image to generate digital human animation based on the target lip movement sequence and the emotional feature parameters includes: In the preset expression library and preset action library, match the target expression resources and target action resources corresponding to the emotional feature parameters; Based on the target lip movement sequence, the target facial expression resource, and the target motion resource, the digital human avatar is driven to generate a digital human animation.

5. The AI ​​digital human intelligent creation management method as described in claim 1, characterized in that, The step of creating an AI digital human based on the digital human animation and the voice content includes: The digital human animation's digital human rendering task, background rendering task, and special effects rendering task are decomposed into mutually independent sub-tasks; The subtasks are distributed to multiple distributed computing nodes for parallel rendering, generating digital human frame sequences, background frame sequences, and special effects frame sequences, respectively. The digital human frame sequence, the background frame sequence, and the special effects frame sequence are layer-blended to obtain a fused sequence. The fused sequence and the voice content are time-synchronized and encoded to generate an AI digital human.

6. The AI ​​digital human intelligent creation management method as described in claim 5, characterized in that, Before the step of decomposing the digital human rendering task, background rendering task, and special effects rendering task of the digital human animation into mutually independent sub-tasks, the method further includes: The hardware computing power score and network quality score are determined based on the user's local hardware configuration and network status, and the rendering computing power requirement value is determined based on the user's input creation requirement parameters. Based on the hardware computing power score, the network quality score, and the rendering computing power requirement value, a rendering route is matched and determined. The rendering route includes one of pure local rendering, cloud-edge collaborative rendering, or pure cloud rendering.

7. An AI digital human intelligent creation management system, characterized in that, The AI ​​digital human intelligent creation management system includes: The instruction parsing module is used to determine the digital human's appearance and voice content based on the digital human creation instructions input by the user; The feature extraction module is used to extract features from the speech content to obtain speech spectrum features and emotional feature parameters; The correction module is used to generate an initial lip movement sequence based on the speech spectrum features, and correct the initial lip movement sequence based on the emotion feature parameters to obtain a target lip movement sequence. The digital human creation module is used to drive the digital human image to generate digital human animation based on the target lip movement sequence and the emotional feature parameters, and to create an AI digital human based on the digital human animation and the voice content.

8. An AI digital human intelligent creation management device, characterized in that, The device includes: a memory, a processor, and an AI digital human intelligent creation management program stored in the memory and executable on the processor, the AI ​​digital human intelligent creation management program being configured to implement the steps of the AI ​​digital human intelligent creation management method as described in any one of claims 1 to 6.

9. A storage medium, characterized in that, The storage medium is a computer-readable storage medium, and the storage medium stores an AI digital human intelligent creation management program. When the AI ​​digital human intelligent creation management program is executed by a processor, it implements the steps of the AI ​​digital human intelligent creation management method as described in any one of claims 1 to 6.

10. A computer program product, characterized in that, The computer program product includes an AI digital human intelligent creation management program, which, when executed by a processor, implements the steps of the AI ​​digital human intelligent creation management method as described in any one of claims 1 to 6.