A digital figure voice emotion synchronization generation method based on machine learning
By constructing an emotion delay modeling framework and an NDDE model, the problem of dynamic evolution of voice emotion in digital avatars in the continuous time domain was solved, realizing the synchronous generation of voice output stream, facial expression parameter sequence and emotion expression parameter sequence, thus improving the continuity and naturalness of emotional expression in digital avatars.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIANGSU ELECTRIC POWER INFORMATION TECH
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-12
Smart Images

Figure CN122201353A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of virtual human emotional expression modeling, and in particular to a method for synchronously generating voice emotions of digital avatars based on machine learning. Background Technology
[0002] With the development of voice interaction and virtual avatar technologies, voice-driven digital avatar generation is increasingly being applied to scenarios such as intelligent customer service, virtual anchors, and digital human interaction. Existing technologies typically extract features from voice signals and then drive speech synthesis and facial animation generation separately to achieve the voice output and visual representation of digital avatars.
[0003] However, existing methods mostly use discrete time frames as the basis for modeling, and the processing of speech emotion often remains at the level of static classification or short-term features, making it difficult to accurately depict the dynamic evolution of emotion in the continuous time domain. When speech emotion changes have lag, accumulation, or gradual characteristics, existing technologies lack effective means to model the delay in emotional response and the effect of emotional memory, resulting in problems such as abrupt changes, incoherence, or inconsistencies with the speech in the emotional expression of the generated digital image.
[0004] Furthermore, in existing technologies, voice output, facial expression parameters, and emotional expression parameters are often generated separately by different modules, lacking a unified emotional state driving mechanism. This easily leads to temporal asynchrony and emotional inconsistency between multimodal outputs, thus affecting the realism and naturalness of the digital avatar. In complex voice scenarios, it is particularly difficult to guarantee the precise correspondence between changes in voice prosody and facial expressions and emotional expressions.
[0005] Therefore, how to provide a method for synchronously generating digital avatar voice emotions based on machine learning is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0006] One objective of this invention is to propose a machine learning-based method for synchronously generating digital avatar voice emotions. This invention uses the temporal features of voice emotion as input to construct an emotion delay modeling framework. Within this framework, an NDDE model is introduced to model the evolution of the emotion component state set in the continuous time domain. Through a voice-driven delay generation mechanism and a dual-timescale delay evolution structure, a target emotion state sequence that characterizes the continuous change of emotion over time is generated. This target emotion state sequence serves as a unified driver to synchronously generate the voice output stream, facial expression parameter sequence, and emotion expression parameter sequence, thereby achieving synchronous generation of digital avatar voice emotions. This invention can characterize the delay effect and cumulative characteristics of voice emotion, ensuring the consistency of multimodal output in terms of temporal structure and emotion state, and improving the continuity and naturalness of digital avatar emotional expression.
[0007] A method for synchronously generating digital avatar voice emotion based on machine learning according to an embodiment of the present invention includes the following steps: Obtain the original speech data corresponding to the digital image to be generated, perform data preprocessing, and extract speech emotion temporal feature representation; An emotion delay modeling framework is constructed based on speech emotion temporal feature representation, and the NDDE model is introduced. The emotion delay modeling framework is initialized and configured to generate the framework configuration. Based on the framework configuration, the state variables in the NDDE model are processed to perform sentiment representation mapping to determine the set of state variables representing sentiment evolution. Then, the structure is reconstructed to split the state variables into a set of sentiment component states, and the delayed state channels are configured to generate a multi-channel delayed state structure. A speech-driven delay generation mechanism is constructed based on a multi-channel delay state structure. The delay time function set is calculated for each emotional component state set and written into the multi-channel delay state structure. In the NDDE model, a dual-timescale delayed evolution structure is introduced, which divides the delayed states in the multi-channel delayed state structure into short-term emotional response delayed states and long-term emotional memory delayed states. Joint feedback updates are performed on the set of emotional component states to generate a set of emotional evolution states. Based on the set of emotional evolution states, it is mapped to a sequence of target emotional states; Input the target emotional state sequence into the digital avatar generation module, perform synchronous generation processing, and generate the digital avatar voice emotion synchronous generation result.
[0008] Optionally, the extraction of the temporal feature representation of speech emotion includes: The raw speech data is processed with a uniform sampling rate and a uniform quantization bit width to obtain standardized raw speech data. Frame segmentation is performed on the standardized raw speech data to divide it into a continuous speech frame sequence. Perform endpoint detection processing on the speech frame sequence to generate a valid speech frame sequence; Time alignment processing is performed based on the effective speech frame sequence to generate a temporally continuous speech frame representation; Based on the time-continuous speech frame representation, the periodic features of each speech frame are calculated to generate a fundamental frequency sequence, the amplitude distribution of each speech frame is statistically analyzed to generate an energy sequence, the duration change of corresponding phonemes in adjacent speech frames is calculated to generate a speech rate sequence, the phoneme boundaries in the speech frame sequence are located, the duration of each phoneme is calculated, and a phoneme duration sequence is generated. The fundamental frequency sequence, energy sequence, speech rate sequence, and phoneme duration sequence are combined along a unified time axis to generate a temporal feature representation of speech emotion.
[0009] Optionally, the generation of the framework configuration includes: Based on the temporal distribution characteristics and time span information of speech emotion temporal feature representation, an emotion delay modeling framework that carries the processing of emotional state evolution is constructed; In the emotion delay modeling framework, the NDDE model is introduced, and the temporal feature representation of speech emotion is connected to the state variable input channel of the NDDE model as a continuous time driving signal. Based on the start time information represented by the temporal features of speech emotion, the initial state of the NDDE model is configured to generate the initial state configuration. Based on the temporal span information represented by speech emotion temporal features, the range of delay time values for the NDDE model is configured. Write the initial state configuration and the range of delay time values into the emotion delay modeling framework to generate the framework configuration.
[0010] Optionally, the generation of the multi-channel delay state structure includes: Based on the framework configuration, the state variables in the NDDE model are read, and the time-continuous change characteristics and numerical distribution characteristics of the state variables are analyzed to determine the state variables that can carry emotional evolution information and generate a set of state variables for emotional representation. Based on the temporal variation characteristics of speech emotion temporal feature representation, emotion representation mapping processing is performed on the set of state variables; Based on the emotional representation mapping processing results, the set of state variables is structurally reconstructed and split into an emotional component state set, including emotional polarity state components, emotional intensity state components, and emotional change rate state components. For the emotional polarity state component, the emotional intensity state component, and the emotional change rate state component, corresponding delayed state channels are configured respectively. The set of emotional component states configured with delayed state channels is integrated into a multi-channel delayed state structure.
[0011] Optionally, the construction and use of the voice-driven delay generation mechanism includes: Based on a multi-channel delay state structure, a voice-driven delay generation mechanism is constructed, and an input interface for receiving voice emotion temporal feature representation is set. By using the dynamic mapping structure in the delay generation mechanism, the temporal feature representation of speech emotion is mapped to delay in the continuous time domain to generate a delay time function. For the emotional polarity state component, emotional intensity state component, and emotional change rate state component, independent dynamic mapping paths are constructed in the delayed generation mechanism to generate corresponding delay time functions for each emotional state component, forming a set of delay time functions. In the process of generating the set of delay time functions, upper and lower limit constraints are applied to each delay time function; Write the set of delay time functions into the multi-channel delay state structure according to the correspondence between the emotional component and the state component.
[0012] Optionally, the generation of the set of emotional evolution states includes: The delay states in the multi-channel delay state structure are divided according to the time scale. The delay state that describes the local change response of speech emotion is defined as the short-term emotion response delay state, and the delay state that describes the cumulative change of emotion over time is defined as the long-term emotion memory delay state. The state evolution calculation process of the NDDE model is simultaneously incorporated into the short-term emotional response delay state and the long-term emotional memory delay state, respectively modulating the immediate response of local changes and the cumulative response of long-term changes. During the state evolution of the NDDE model, joint feedback update processing is performed on the set of sentiment component states; In the NDDE model, based on the emotional component state set updated by joint feedback, state evolution calculation is performed along the continuous time axis to generate an emotional evolution state set.
[0013] Optionally, the generation of the target emotional state sequence includes: The set of emotional evolution states is subjected to continuous temporal sampling processing in the continuous time domain according to a preset time resolution to obtain a set of sampled emotional states corresponding to the continuous time axis. Perform state smoothing on the sampled emotional state set in the continuous time domain; Based on the sampled emotional state set after state smoothing, time axis alignment processing is performed to map the sampled emotional state set to a time axis consistent with the speech emotional temporal feature representation, generating an emotional state representation with consistent time order; Emotional states that are consistent in time are processed in an order of execution to generate a sequence of target emotional states.
[0014] Optionally, the generation of the digital avatar's synchronized voice emotion generation result includes: Input the target emotional state sequence into the digital image generation module, establish a unified driving relationship between the target emotional state sequence and the digital image generation sequence, and set the target emotional state sequence as the only emotional control input to drive the generation of the speech output stream, facial expression parameter sequence and emotional expression parameter sequence; Based on the emotional state values corresponding to each time point in the target emotional state sequence, speech emotion modulation generation processing is performed in the digital image generation module to generate a speech output stream; Based on the same target emotional state sequence, perform facial expression parameter generation processing to generate a facial expression parameter sequence; Based on the same target emotional state sequence, perform emotional expression parameter generation processing to generate an emotional expression parameter sequence; The speech output stream, facial expression parameter sequence, and emotional expression parameter sequence, all driven by the same target emotional state sequence, are integrated in time synchronization to generate a digital image speech emotion synchronization generation result.
[0015] The beneficial effects of this invention are: First, by constructing an emotion delay modeling framework and introducing the NDDE model, the evolution process of speech emotion temporal feature representation in the continuous time domain is modeled, so that the emotion state is no longer limited to discrete frames or static results, but can reflect the dynamic characteristics of emotion changing over time. This effectively portrays the lag and gradualness of speech emotion in actual expression, and improves the authenticity and continuity of emotion modeling.
[0016] Secondly, by splitting the state variable into a set of emotional component states and combining a speech-driven delayed generation mechanism with a dual-timescale delayed evolution structure, different emotional components can participate in the state evolution calculation collaboratively at two timescales: short-term response and long-term accumulation. This avoids abrupt changes in emotional states during the change process, thereby enhancing the smoothness and stability of the emotional evolution state set and providing a reliable foundation for the subsequent generation of target emotional state sequences.
[0017] Furthermore, by using the target emotional state sequence as a unified driver, synchronous generation processing is performed on the speech output stream, facial expression parameter sequence, and emotional expression parameter sequence, ensuring that the multimodal generation results are consistent in terms of temporal structure and emotional state. This solves the problem of asynchronous speech, facial expression, and emotional expression in existing technologies, thereby significantly improving the naturalness, consistency, and overall performance of the synchronous generation results of digital image speech and emotion. Attached Figure Description
[0018] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings: Figure 1 This is an overall flowchart of a machine learning-based digital image voice emotion synchronization generation method proposed in this invention; Figure 2 This is a schematic diagram of the emotion delay modeling framework and multi-channel delay state structure in this invention; Figure 3 This is a schematic diagram of the dual-timescale delayed evolution structure and synchronous generation mechanism in this invention. Detailed Implementation
[0019] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.
[0020] refer to Figure 1-3 A method for synchronously generating digital avatar voice emotions based on machine learning includes the following steps: The original speech data corresponding to the digital image to be generated is obtained, and data preprocessing is performed on the original speech data, including framing, endpoint detection, and time alignment. The speech emotion temporal feature representation is extracted, which includes fundamental frequency sequence, energy sequence, speech rate sequence, and phoneme duration sequence. An emotion delay modeling framework is constructed based on speech emotion temporal feature representation. The NDDE model is introduced into the emotion delay modeling framework. The framework configuration is generated by performing initialization configuration processing based on speech emotion temporal feature representation. Based on the framework configuration, the state variables in the NDDE model are processed by sentiment representation mapping to determine the set of state variables used to represent sentiment evolution. The set of state variables is then processed by structural reconstruction to split the set of state variables into a set of sentiment component states. The set of sentiment component states includes sentiment polarity state components, sentiment intensity state components, and sentiment change rate state components. Delayed state channels are configured for the set of sentiment component states to generate a multi-channel delayed state structure. A speech-driven delay generation mechanism is constructed based on a multi-channel delay state structure. The delay generation mechanism receives the temporal feature representation of speech emotion and calculates the delay time function set for each emotion component state set. The delay time function set is written into the multi-channel delay state structure so that the delay time function set participates in the state evolution calculation of the NDDE model. In the NDDE model, a dual-timescale delayed evolution structure is introduced, which divides the delayed states in the multi-channel delayed state structure into short-term emotional response delayed states and long-term emotional memory delayed states. Based on the short-term emotional response delayed states and long-term emotional memory delayed states, a joint feedback update is performed on the emotional component state set to generate an emotional evolution state set. Based on the set of emotional evolution states, by performing continuous time sampling, state smoothing and time axis alignment on the set of emotional evolution states, the set of emotional evolution states is mapped into a target emotional state sequence arranged in time order. The target emotional state sequence is used to characterize the emotional change trajectory of the digital image in the continuous time domain. The target emotional state sequence is input into the digital avatar generation module. Based on the target emotional state sequence, the digital avatar's voice output stream, facial expression parameter sequence, and emotional expression parameter sequence are synchronously generated to produce the digital avatar's voice and emotion synchronous generation result.
[0021] In this embodiment, the extraction of the temporal feature representation of speech emotion includes: The raw speech data is processed with a uniform sampling rate and a uniform quantization bit width to eliminate the differences in temporal resolution and amplitude scale of speech data under different acquisition conditions, thus obtaining standardized raw speech data; Frame segmentation processing is performed based on standardized raw speech data. The standardized raw speech data is divided into a continuous speech frame sequence according to a preset frame length and a preset frame shift. Each speech frame carries corresponding timestamp information to represent the positional relationship of the speech on the time axis. Endpoint detection processing is performed on the speech frame sequence. The speech start position and speech end position are determined based on the speech frame energy change and zero crossing rate change. Non-speech intervals are removed to generate a valid speech frame sequence. Time alignment processing is performed based on the effective speech frame sequence, and the timestamps of the effective speech frame sequence are mapped to a unified time axis to generate a time-continuous speech frame representation. The fundamental frequency sequence is extracted based on the time-continuous speech frame representation. By calculating the periodic features of each speech frame, a fundamental frequency sequence arranged in time order is generated. Energy sequences are extracted based on time-continuous speech frame representation. By statistically analyzing the amplitude distribution of each speech frame, an energy sequence arranged in chronological order is generated. Speech rate sequences are extracted based on time-continuous speech frame representation. By calculating the duration changes of corresponding phonemes in adjacent speech frames, speech rate sequences arranged in chronological order are generated. Based on the temporally continuous speech frame representation, the phoneme duration sequence is extracted. By locating the phoneme boundaries in the speech frame sequence, the duration corresponding to each phoneme is calculated, and a phoneme duration sequence arranged in chronological order is generated. By combining the fundamental frequency sequence, energy sequence, speech rate sequence, and phoneme duration sequence along a unified time axis, a temporal feature representation of speech emotion is generated, which is used to characterize the emotional changes of the original speech data in the continuous time domain.
[0022] In this embodiment, the generation of the framework configuration includes: Based on the temporal distribution characteristics and time span information of speech emotion temporal feature representation, an emotion delay modeling framework is constructed to carry out emotion state evolution processing, so that the emotion delay modeling framework has the ability to model emotion in continuous time. In the emotion delay modeling framework, the NDDE model is introduced, and the speech emotion temporal feature representation is connected to the state variable input channel of the NDDE model, so that the speech emotion temporal feature representation can participate in the state evolution calculation of the NDDE model as a continuous time driving signal. Based on the start time information of the speech emotion temporal feature representation, the initial state of the NDDE model is configured to generate an initial state configuration corresponding to the start time of the speech emotion temporal feature representation. Based on the time span information represented by the temporal features of speech emotion, the range of delay time values in the NDDE model is configured so that the range of delay time values covers the lag interval of emotional response generated during the process of speech emotion change. The initial state configuration and the range of delay time values are written into the emotion delay modeling framework to generate a framework configuration for subsequent emotion state mapping and state structure reconstruction.
[0023] In this embodiment, the generation of the multi-channel delay state structure includes: Based on the framework configuration, the state variables used to describe the system evolution in the NDDE model are read, the time continuous change characteristics and numerical distribution characteristics of the state variables are analyzed, the state variables that can be used to carry emotional evolution information are determined, and a set of state variables for emotional representation is generated. Based on the temporal variation characteristics of fundamental frequency sequence, energy sequence, speech rate sequence, and phoneme duration sequence in speech emotion temporal feature representation, emotion representation mapping processing is performed on the set of state variables to establish a correspondence between the set of state variables and speech emotion change patterns, so that the set of state variables can represent the change process of emotion over time. The emotion representation mapping process is based on the changing trend relationship of each feature in the speech emotion temporal feature representation within the continuous time domain. Specifically, for the fundamental frequency sequence, energy sequence, speech rate sequence, and phoneme duration sequence, the direction, amplitude, and rate of change between adjacent time points are analyzed. Using these changing features as the mapping basis, corresponding modulation relationships are applied to the state variables in the NDDE model, ensuring that the direction of change of the state variables' values is consistent with the changing trend of speech emotion, that the amplitude of change of the state variables reflects the strength of emotional expression, and that the rate of change of the state variables reflects the speed of emotional state evolution over time. Through emotion representation mapping, the state variables are transformed into state representations that can characterize changes in emotional polarity, emotional intensity, and the rate of change of emotional change during continuous time evolution, thereby establishing a correspondence between state variables and speech emotion evolution features. Based on the emotional representation mapping processing results, the set of state variables is restructured and split into emotional component state sets according to the emotional representation dimension. The emotional component state sets are composed of emotional polarity state components, emotional intensity state components, and emotional change rate state components. For the emotional polarity state component, emotional intensity state component, and emotional change rate state component, corresponding delayed state channels are configured respectively, so that each emotional component state component has independent delayed response capability in the state evolution process of the NDDE model. The set of emotional component states configured with delayed state channels is integrated into a multi-channel delayed state structure, which is used for the construction of subsequent speech-driven delay generation mechanism and dual-timescale delay evolution processing.
[0024] In this embodiment, the construction and use of the voice-driven delay generation mechanism includes: Based on the multi-channel delayed state structure, a speech-driven delayed generation mechanism is constructed. In the delayed generation mechanism, an input interface for receiving speech emotion temporal feature representation is set, and a delayed generation correlation relationship is established between the speech emotion temporal feature representation and the emotion component state set. The speech-driven delay generation mechanism specifically refers to a delay generation structure with a dynamic mapping structure embedded in the emotion delay modeling framework as its core. This delay generation structure takes the speech emotion temporal feature representation as the only input and takes the emotion polarity state component, emotion intensity state component, and emotion change rate state component in the emotion component state set as the delay action object. By performing mapping operations on the change features of the speech emotion temporal feature representation in the continuous time domain, it generates a delay time function that corresponds one-to-one with each emotion component state component. The generated delay time function is written into the multi-channel delay state structure in real time, so that the delay time function is dynamically updated with the change of the speech emotion temporal feature representation and directly participates in the state evolution calculation of the NDDE model. The speech emotion temporal feature representation is input into the delay generation mechanism. Through the dynamic mapping structure in the delay generation mechanism, the speech emotion temporal feature representation is processed to perform feature-to-delay mapping in the continuous time domain, generating a delay time function that changes continuously with time. The dynamic mapping structure consists of an input feature mapping unit, a time continuity constraint unit, and a delay value constraint unit. Each unit performs a different function in the process of generating the delay time function and works together in a fixed order. The input feature mapping unit is used to receive the speech emotion temporal feature representation, analyze the change information of the fundamental frequency sequence, energy sequence, speech rate sequence and phoneme duration sequence in the continuous time domain, extract the feature quantity that can reflect the change trend of speech emotion, and map the feature quantity into the delay modulation intermediate quantity, so that the change trend in the speech emotion temporal feature representation can be transformed into a mapping result that drives the change of delay time. The time continuity constraint unit is used to perform continuity constraint processing on the delay modulation intermediate quantity output by the input feature mapping unit. By performing continuity verification and transition adjustment on the delay modulation intermediate quantity corresponding to adjacent time points, the generated delay change process is kept smooth and continuous on the time axis, avoiding abrupt changes in delay time between adjacent time points, thereby ensuring the stable evolution of the delay time function in the continuous time domain. The delay value constraint unit is used to impose a range restriction on the delay change result after time continuity constraint processing. By restricting the delay change result to the preset minimum delay time and maximum delay time range, a delay time function that satisfies the upper and lower limit constraints is generated, so that the delay time function is always within the effective delay interval when participating in the state evolution calculation of the NDDE model. Through the synergistic effect of the above-mentioned input feature mapping unit, time continuity constraint unit and delay value constraint unit, the temporal feature representation of speech emotion is stably mapped into a delay time function that changes continuously with time and satisfies the value constraint conditions. The delay time function is written into the multi-channel delay state structure as the delay control input for the state evolution calculation of the NDDE model. For the emotional polarity state component, emotional intensity state component, and emotional change rate state component in the emotional component state set, independent dynamic mapping paths are constructed in the delay generation mechanism to generate corresponding delay time functions for each emotional component state component, forming a set of delay time functions. In the process of generating the set of delay time functions, upper and lower limit constraints are applied to each delay time function to make the delay time function change continuously with time within the range of the preset minimum delay time and the preset maximum delay time, so as to avoid discontinuous jumps in the delay value. The set of delay time functions is written into the multi-channel delayed state structure according to the correspondence between the emotional component and the state component, so that each emotional component and the state component participate in the state evolution calculation by using the corresponding delay time function in the state evolution process of the NDDE model. The set of delay time functions is dynamically updated during the continuous-time evolution of the NDDE model as the temporal features of speech emotion change, thereby completing the construction and operation of the speech-driven delay generation mechanism.
[0025] In this embodiment, the generation of the set of emotional evolution states includes: Based on the multi-channel delay state structure, the delay states in the multi-channel delay state structure are divided according to the time scale. The delay state used to characterize the local change response of speech emotion is determined as the short-term emotion response delay state, and the delay state used to characterize the cumulative change of emotion over time is determined as the long-term emotion memory delay state. The short-term emotional response delay state and the long-term emotional memory delay state are simultaneously incorporated into the state evolution calculation process of the NDDE model. This allows the short-term emotional response delay state to be used to modulate the immediate response of the emotional component state set to local changes in the speech emotional temporal feature representation, and the long-term emotional memory delay state to be used to modulate the cumulative response of the emotional component state set to long-term changes in the speech emotional temporal feature representation. During the state evolution of the NDDE model, joint feedback update processing is performed on the set of emotional component states, so that the short-term emotional response delay state and the long-term emotional memory delay state participate in feedback modulation together in the same emotional state update process, avoiding the independent action of delay states at each time scale. In the NDDE model, based on the emotional component state set updated by joint feedback, state evolution calculation is performed along the continuous time axis. The evolution results of the emotional component state set in the continuous time domain are collected to generate an emotional evolution state set that changes continuously over time. The emotional evolution state set is used to characterize the overall state result of the continuous evolution of emotion over time and is used for the subsequent generation and processing of the target emotional state sequence. The process of collecting the evolution results of the emotional component state set in the continuous time domain specifically includes: on the continuous time axis, taking the state evolution calculation time of the NDDE model as the reference time point, synchronously reading the state values of the emotional polarity state component, emotional intensity state component, and emotional change rate state component at each time point; at the same time point, arranging and storing the state values of each emotional component state component according to the preset emotional state representation structure to form the emotional state representation corresponding to that time point; repeating the above state reading and state organization process sequentially along the continuous time axis for each time point, and storing the emotional state representations formed at each time point in chronological order to obtain the emotional evolution state set covering the continuous time domain.
[0026] In this embodiment, the generation of the target emotional state sequence includes: Receive the set of emotional evolution states, perform continuous time sampling processing on the set of emotional evolution states in the continuous time domain according to a preset time resolution, and obtain the set of sampled emotional states corresponding to the continuous time axis; Based on the sampled emotional state set, state smoothing processing is performed on the sampled emotional state set in the continuous time domain to ensure that the emotional state changes corresponding to adjacent time points meet the continuity constraint and suppress abrupt changes in emotional state on the time axis. Based on the sampled emotional state set after state smoothing, time axis alignment processing is performed to map the sampled emotional state set to a time axis consistent with the speech emotional temporal feature representation, generating an emotional state representation with consistent time order; Based on the temporally consistent emotional state representation, the temporally consistent emotional state representation is sequentially organized according to the chronological relationship of adjacent time points on a continuous time axis. The emotional states corresponding to each time point are written into the emotional state sequence structure in sequence, and the consistency of the time interval during the writing process is verified. This generates a target emotional state sequence with complete temporal order constraints in the continuous time domain. The target emotional state sequence serves as the emotional evolution trajectory in the continuous time domain, used to represent the emotional state of the digital image as it changes over time, and is used for subsequent digital image voice emotion synchronization generation processing.
[0027] In this embodiment, the generation of the digital avatar's voice emotion synchronization generation result includes: The target emotional state sequence is input into the digital image generation module. A unified driving relationship between the target emotional state sequence and the digital image generation sequence is established in the digital image generation module. The target emotional state sequence is set as the only emotional control input to drive the generation of the speech output stream, facial expression parameter sequence and emotional expression parameter sequence. Based on the emotional state values corresponding to each time point in the target emotional state sequence, speech emotion modulation generation processing is performed in the digital image generation module. According to the target emotional state sequence, the changes in intonation, energy and speech rate during the speech synthesis process are controlled in time synchronization to generate a speech output stream that is consistent with the target emotional state sequence in terms of time structure. Based on the same target emotional state sequence, the expression parameter generation process is performed in the digital image generation module. The emotional state corresponding to each time point in the target emotional state sequence is mapped to the change of expression parameters of the digital image, generating an expression parameter sequence that is consistent with the speech output stream on the time axis. Based on the same target emotional state sequence, the emotional expression parameter generation process is performed in the digital image generation module. The emotional expression parameters of the digital image are modeled continuously over time according to the target emotional state sequence, generating an emotional expression parameter sequence that is consistent with the speech output stream and facial expression parameter sequence in terms of time structure. The speech output stream, facial expression parameter sequence, and emotional expression parameter sequence, all driven by the same target emotional state sequence, are integrated in time synchronization to generate a digital image speech emotion synchronization generation result.
[0028] Example 1: To verify the feasibility of this invention in practice, it was applied to a scenario of digital avatar voice-driven emotion expression generation. The study investigated issues such as asynchrony between voice and visual emotions, abrupt emotional changes, and unnatural emotional transitions in long-duration speech during voice broadcasting, virtual explanations, and interactive expressions. In this scenario, the digital avatar needs to output a voice output stream, facial expression parameter sequence, and emotional expression parameter sequence in real time, driven by continuous voice input, consistent with changes in voice emotion. This ensures that auditory and visual emotional expressions remain consistent in temporal structure, thereby improving the naturalness and credibility of the overall performance.
[0029] In this application scenario, the system continuously receives raw speech data generated during digital image broadcasting or interaction. This raw speech data includes a speech stream formed by continuous speech segments. By performing uniform sampling rate processing, time alignment processing, and frame-level analysis on the raw speech data, a temporal feature representation of speech emotion is constructed. This representation covers temporal features directly related to emotional expression, such as fundamental frequency variation, energy variation, speech rate variation, and phoneme duration variation. Based on this temporal feature representation of speech emotion, an emotion delay modeling framework is constructed and an NDDE model is introduced, enabling speech emotion features to drive the evolution of emotional states in the form of continuous time signals. Within the emotion delay modeling framework, emotion representation mapping and structural reconstruction processing are performed on the state variables within the NDDE model. The emotion evolution process is decomposed into emotion polarity state components, emotion intensity state components, and emotion change rate state components, and delayed state channels are configured for each, forming a multi-channel delayed state structure.
[0030] During continuous speech input, the system, through a speech-driven delay generation mechanism, dynamically generates a set of delay time functions corresponding one-to-one with each emotional component's state component, based on the changes in the temporal characteristics of speech emotion within the continuous time domain. These delay time functions are then written into the multi-channel delay state structure in real time, ensuring that the evolution of emotional states reflects the response lag caused by changes in speech emotion. Furthermore, a dual-timescale delay evolution structure is introduced into the NDDE model, dividing the delay state into short-term emotional response delay states and long-term emotional memory delay states. This allows the emotional state to respond to both local emotional fluctuations in speech and the cumulative changes in emotion over long periods of speech. Through a joint feedback update mechanism, the short-term emotional response delay state and the long-term emotional memory delay state jointly participate in modulation during the same emotional state update process, generating a set of stable emotional evolution states within the continuous time domain.
[0031] Based on a set of emotional evolution states, the system generates a target emotional state sequence through continuous time sampling, state smoothing, and time axis alignment. This sequence constitutes a complete emotional change trajectory in the continuous time domain and possesses smoothness constraints to avoid abrupt changes in emotional states along the time axis. Subsequently, the target emotional state sequence is introduced as the sole emotional control input into the digital avatar generation module. A unified, synchronous generation process is performed on the speech output stream, facial expression parameter sequence, and emotional expression parameter sequence, ensuring that the digital avatar maintains temporal structural consistency in speech output, facial expression changes, and overall emotional expression, thereby achieving synchronous generation of speech and visual emotions.
[0032] In the experiment, multiple continuous speech segments ranging from 30 seconds to 2 minutes in length were selected as input. The differences in the synchronous generation effect of digital image speech emotion between the method of this invention and the traditional method without emotion delay modeling and dual-timescale processing were compared. The experimental evaluation indicators included the temporal synchronization error of speech and facial expression, emotion continuity score, long-term speech emotion stability, and overall subjective naturalness score. Through statistical analysis of the results of multiple rounds of experiments, the following comparative data were obtained.
[0033] Table 1 Comparison Results of Synchronized Generation of Digital Image Voice Emotion
[0034] As shown in Table 1, the average error of the proposed method for speech and facial expression synchronization is controlled at around 120 milliseconds, a significant reduction compared to the 180 milliseconds of the traditional method. This indicates that the speech-driven delay generation mechanism and multi-channel delay state structure can more accurately model the temporal characteristics of speech emotion changes, making the visual emotion output closer to speech changes on the time axis. Regarding emotion continuity scoring, the proposed method achieves a score of 80, an improvement of approximately 8 points compared to the traditional method. This demonstrates that introducing a dual-timescale delay evolution structure in the continuous time domain results in smoother transitions between adjacent time periods, reducing inconsistencies in emotional expression.
[0035] In terms of long-term speech emotion stability index, the method of this invention achieves 0.76, while the traditional method achieves 0.68. This difference reflects the role of the long-term emotional memory delay state in joint feedback updates, enabling the emotional state to maintain a consistent evolutionary trend over a longer period and avoiding frequent shifts in the overall emotional direction due to local speech fluctuations. Simultaneously, the incidence of emotional abrupt changes decreases from 1.9 times per minute in the traditional method to 1.2 times per minute, further validating the practical effectiveness of smoothing constraints and continuous-time modeling in suppressing emotional abrupt changes.
[0036] Regarding subjective naturalness rating, the test participants gave an average score of 4.1 for the overall performance of the digital avatar generated using the method of this invention, while the traditional method scored 3.6. This result indicates that when the speech output stream, facial expression parameter sequence, and emotional expression parameter sequence are uniformly driven and synchronously generated by the same target emotional state sequence, the digital avatar's expression of emotion is more in line with human perception habits of natural communication.
[0037] This invention, without pursuing excessive complexity, effectively improves the time mismatch and emotional instability problems in the synchronous generation of digital image voice emotions through emotion delay modeling, dual time-scale delay evolution, and a unified driven synchronous generation mechanism, and has good practical application value.
[0038] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. A method for synchronously generating digital avatar voice emotions based on machine learning, characterized in that, Includes the following steps: Obtain the original speech data corresponding to the digital image to be generated, perform data preprocessing, and extract speech emotion temporal feature representation; An emotion delay modeling framework is constructed based on speech emotion temporal feature representation, and the NDDE model is introduced. The emotion delay modeling framework is initialized and configured to generate the framework configuration. Based on the framework configuration, the state variables in the NDDE model are processed to perform sentiment representation mapping to determine the set of state variables representing sentiment evolution. Then, the structure is reconstructed to split the state variables into a set of sentiment component states, and the delayed state channels are configured to generate a multi-channel delayed state structure. A speech-driven delay generation mechanism is constructed based on a multi-channel delay state structure. The delay time function set is calculated for each emotional component state set and written into the multi-channel delay state structure. In the NDDE model, a dual-timescale delayed evolution structure is introduced, which divides the delayed states in the multi-channel delayed state structure into short-term emotional response delayed states and long-term emotional memory delayed states. Joint feedback updates are performed on the set of emotional component states to generate a set of emotional evolution states. Based on the set of emotional evolution states, it is mapped to a sequence of target emotional states; Input the target emotional state sequence into the digital avatar generation module, perform synchronous generation processing, and generate the digital avatar voice emotion synchronous generation result.
2. The method for synchronously generating digital image voice emotion based on machine learning according to claim 1, characterized in that, The extraction of the temporal feature representation of speech emotion includes: The raw speech data is processed with a uniform sampling rate and a uniform quantization bit width to obtain standardized raw speech data. Frame segmentation is performed on the standardized raw speech data to divide it into a continuous speech frame sequence. Perform endpoint detection processing on the speech frame sequence to generate a valid speech frame sequence; Time alignment processing is performed based on the effective speech frame sequence to generate a temporally continuous speech frame representation; Based on the time-continuous speech frame representation, the periodic features of each speech frame are calculated to generate a fundamental frequency sequence, the amplitude distribution of each speech frame is statistically analyzed to generate an energy sequence, the duration change of corresponding phonemes in adjacent speech frames is calculated to generate a speech rate sequence, the phoneme boundaries in the speech frame sequence are located, the duration of each phoneme is calculated, and a phoneme duration sequence is generated. The fundamental frequency sequence, energy sequence, speech rate sequence, and phoneme duration sequence are combined along a unified time axis to generate a temporal feature representation of speech emotion.
3. The method for synchronously generating digital image voice emotion based on machine learning according to claim 1, characterized in that, The generation of the framework configuration includes: Based on the temporal distribution characteristics and time span information of speech emotion temporal feature representation, an emotion delay modeling framework that carries the processing of emotional state evolution is constructed; In the emotion delay modeling framework, the NDDE model is introduced, and the temporal feature representation of speech emotion is connected to the state variable input channel of the NDDE model as a continuous time driving signal. Based on the start time information represented by the temporal features of speech emotion, the initial state of the NDDE model is configured to generate the initial state configuration. Based on the temporal span information represented by speech emotion temporal features, the range of delay time values for the NDDE model is configured. Write the initial state configuration and the range of delay time values into the emotion delay modeling framework to generate the framework configuration.
4. The method for synchronously generating digital image voice emotion based on machine learning according to claim 1, characterized in that, The generation of the multi-channel delay state structure includes: Based on the framework configuration, the state variables in the NDDE model are read, and the time-continuous change characteristics and numerical distribution characteristics of the state variables are analyzed to determine the state variables that can carry emotional evolution information and generate a set of state variables for emotional representation. Based on the temporal variation characteristics of speech emotion temporal feature representation, emotion representation mapping processing is performed on the set of state variables; Based on the emotional representation mapping processing results, the set of state variables is structurally reconstructed and split into an emotional component state set, including emotional polarity state components, emotional intensity state components, and emotional change rate state components. For the emotional polarity state component, the emotional intensity state component, and the emotional change rate state component, corresponding delayed state channels are configured respectively. The set of emotional component states configured with delayed state channels is integrated into a multi-channel delayed state structure.
5. The method for synchronously generating digital avatar voice emotions based on machine learning according to claim 1, characterized in that, The construction and use of the speech-driven delay generation mechanism includes: Based on a multi-channel delay state structure, a voice-driven delay generation mechanism is constructed, and an input interface for receiving voice emotion temporal feature representation is set. By using the dynamic mapping structure in the delay generation mechanism, the temporal feature representation of speech emotion is mapped to delay in the continuous time domain to generate a delay time function. For the emotional polarity state component, emotional intensity state component, and emotional change rate state component, independent dynamic mapping paths are constructed in the delayed generation mechanism to generate corresponding delay time functions for each emotional state component, forming a set of delay time functions. In the process of generating the set of delay time functions, upper and lower limit constraints are applied to each delay time function; Write the set of delay time functions into the multi-channel delay state structure according to the correspondence between the emotional component and the state component.
6. The method for synchronously generating digital image voice emotion based on machine learning according to claim 1, characterized in that, The generation of the set of emotional evolution states includes: The delay states in the multi-channel delay state structure are divided according to the time scale. The delay state that describes the local change response of speech emotion is defined as the short-term emotion response delay state, and the delay state that describes the cumulative change of emotion over time is defined as the long-term emotion memory delay state. The state evolution calculation process of the NDDE model is simultaneously incorporated into the short-term emotional response delay state and the long-term emotional memory delay state, respectively modulating the immediate response of local changes and the cumulative response of long-term changes. During the state evolution of the NDDE model, joint feedback update processing is performed on the set of sentiment component states; In the NDDE model, based on the emotional component state set updated by joint feedback, state evolution calculation is performed along the continuous time axis to generate an emotional evolution state set.
7. The method for synchronously generating digital image voice emotion based on machine learning according to claim 1, characterized in that, The generation of the target emotional state sequence includes: The set of emotional evolution states is subjected to continuous temporal sampling processing in the continuous time domain according to a preset time resolution to obtain a set of sampled emotional states corresponding to the continuous time axis. Perform state smoothing on the sampled emotional state set in the continuous time domain; Based on the sampled emotional state set after state smoothing, time axis alignment processing is performed to map the sampled emotional state set to a time axis consistent with the speech emotional temporal feature representation, generating an emotional state representation with consistent time order; Emotional states that are consistent in time are processed in an order of execution to generate a sequence of target emotional states.
8. The method for synchronously generating digital image voice emotion based on machine learning according to claim 1, characterized in that, The generation of the digital avatar's synchronized voice emotion generation result includes: Input the target emotional state sequence into the digital image generation module, establish a unified driving relationship between the target emotional state sequence and the digital image generation sequence, and set the target emotional state sequence as the only emotional control input to drive the generation of the speech output stream, facial expression parameter sequence and emotional expression parameter sequence; Based on the emotional state values corresponding to each time point in the target emotional state sequence, speech emotion modulation generation processing is performed in the digital image generation module to generate a speech output stream; Based on the same target emotional state sequence, perform facial expression parameter generation processing to generate a facial expression parameter sequence; Based on the same target emotional state sequence, perform emotional expression parameter generation processing to generate an emotional expression parameter sequence; The speech output stream, facial expression parameter sequence, and emotional expression parameter sequence, all driven by the same target emotional state sequence, are integrated in time synchronization to generate a digital image speech emotion synchronization generation result.