A method and device for generating pictures based on object emotion recognition

By employing multimodal data fusion and reinforcement learning optimization methods, this study addresses the shortcomings of existing emotion recognition technologies, such as insufficient multimodal fusion performance and lack of personalization in dynamic image generation. It achieves high-precision emotion recognition and dynamic image generation, thereby enhancing the naturalness and personalization of the user experience.

CN122196473APending Publication Date: 2026-06-12SHANGHAI GUIXU ELECTRONICS TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI GUIXU ELECTRONICS TECH CO LTD
Filing Date
2026-03-10
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing emotion recognition technologies have limitations in multimodal fusion, complex emotional state recognition, and dynamic adjustment, failing to meet the demands for high accuracy and flexibility. Single-modal emotional information is easily affected by noise or individual differences, fixed emotion classification models struggle to handle complex emotions, and dynamic image generation lacks personalization and real-time performance.

Method used

Employing multimodal data perception and weighted summation feature fusion techniques, combined with generative adversarial networks and reinforcement learning algorithms, this method acquires emotional features from speech, facial expressions, and body posture data, performs emotion recognition using a deep learning model, generates dynamic images through generative adversarial networks, and optimizes the generation strategy using reinforcement learning.

🎯Benefits of technology

It improves the accuracy and robustness of emotion recognition, accurately identifies complex emotional states, achieves a high degree of matching between dynamic images and user emotions, and enhances the naturalness and personalization of the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196473A_ABST
    Figure CN122196473A_ABST
Patent Text Reader

Abstract

The application relates to the technical field of affective computing and human-computer interaction, and discloses a picture mutual generation method and equipment based on object emotion recognition, wherein the generation method comprises the following steps: acquiring multi-modal data, including voice, facial expression and body posture data; extracting emotion features in the multi-modal data; fusing the extracted multi-modal features to obtain fused emotion features, wherein the fusion step is performed in the form of weighted summation; and performing emotion recognition on the fused emotion features; through the multi-modal data perception and weighted summation feature fusion technology, accurate and comprehensive emotion recognition is realized. Through composite emotion weighted combination, the multi-level emotion state of a user can be accurately expressed; in addition, through an adversarial network, dynamic picture generation is performed, the high matching of the picture and the emotion is ensured, and through a reinforcement learning mechanism, the naturalness and individualization of emotion interaction are improved, so that each user obtains a more customized interactive experience.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of affective computing and human-computer interaction, specifically to a method and device for generating interactive visuals based on object emotion recognition. Background Technology

[0002] With the advancement of artificial intelligence technology, emotion recognition technology is widely used in intelligent interaction systems. Existing emotion recognition methods typically rely on single-modal data, such as speech or facial expressions, to identify a user's emotional state. However, single-modal emotional information cannot fully reflect a user's emotions. For example, emotional information in speech is easily affected by background noise, while facial expressions are greatly affected by individual differences, making single-modal emotion recognition insufficient in accuracy and robustness.

[0003] While multimodal emotion recognition combines multiple modalities such as speech, facial expressions, and body posture, most methods employ simple feature fusion strategies, such as direct concatenation or weighted summation. This approach fails to adequately consider the differences between different modalities, resulting in less than ideal fusion effects. Furthermore, the weights often rely on manual settings or experience, limiting the accuracy of emotion recognition.

[0004] Most existing emotion recognition systems rely on fixed emotion classification models, typically dividing emotional states into basic emotion categories. While this approach can handle some simple emotion recognition tasks, it cannot cope with more complex emotional states, especially in the case of compound emotions, where existing technologies struggle to accurately identify the superposition of multiple emotions.

[0005] In terms of dynamic image generation, existing technologies typically generate images based on preset emotion tags and fixed templates, lacking personalization and flexibility. These systems fail to adaptively adjust to the user's real-time emotional changes, resulting in generated images that do not realistically reflect the user's emotional shifts and lack a sense of naturalness and immersion.

[0006] Therefore, existing emotion recognition and image generation technologies have limitations in multimodal fusion, complex emotional state recognition, and dynamic adjustment, and cannot meet the needs of high accuracy and flexibility in practical applications. Summary of the Invention

[0007] To address the shortcomings of existing technologies, this invention provides a method and device for generating interactive images based on object emotion recognition, which solves the problems of insufficient multimodal data fusion, difficulty in recognizing complex emotional states, and lack of personalization and real-time performance in existing emotion recognition technologies.

[0008] To achieve the above objectives, the present invention provides the following technical solution: a method for generating interactive visuals based on object emotion recognition, comprising the following steps:

[0009] S1. Acquire multimodal data, including speech, facial expressions, and body posture data;

[0010] S2. Extract the sentiment features from the multimodal data;

[0011] S3. The extracted multimodal features are fused to obtain the fused sentiment features. The fusion step is performed by weighted summation, wherein the weight of each modal feature is automatically optimized through the training process.

[0012] S4. Perform emotion recognition on the fused emotion features. The recognition steps are carried out using an emotion state recognition model.

[0013] S5. Generate dynamic images that match the identified emotional state, and use a generative network to adjust the image elements during the generation process;

[0014] S6. Optimize the generated dynamic images and user emotional interaction using reinforcement learning algorithms, and adjust the generation strategy accordingly.

[0015] Preferably, the emotion state recognition model in step S4 is a deep learning-based model, which is a convolutional neural network, a long short-term memory network, or a Transformer network.

[0016] Preferably, the emotion state recognition model in step S4 includes a multilayer perceptron model, which is used to map the fused emotion features to emotion categories, including basic emotions and complex emotions.

[0017] Preferably, the emotion fusion in step S3 is represented by weighted summation of multiple basic emotions, wherein the weight of each basic emotion is determined through the training process, and the generated composite emotion is a weighted combination of basic emotions.

[0018] Preferably, the dynamic image generation in step S5 is achieved through a generative adversarial network (GAN). The GAN generates image elements based on emotional states, and the discriminator of the GAN determines the difference between the generated image and the real image and optimizes it accordingly.

[0019] Preferably, the reinforcement learning algorithm in step S6 optimizes the strategy through a reward function. The reward function evaluates the interactive effect of the generated image based on user feedback, and the optimization goal is to maximize the matching degree between the user's emotional experience and the generated image.

[0020] Preferably, the reinforcement learning algorithm in step S6 employs the policy gradient method, which optimizes image generation by dynamically adjusting the policy.

[0021] Preferably, the emotion state recognition model in step S4 captures the temporal series features of emotions by performing spatiotemporal modeling of emotion features at multiple times and using a long short-term memory network or a Transformer network.

[0022] A device for generating interactive visuals based on object emotion recognition, comprising:

[0023] The multimodal data acquisition module is used to collect users' multimodal data, including voice, facial expressions, and body posture data;

[0024] The emotion feature extraction module is used to extract emotion features from the multimodal data, including voice features, facial expression features, and body posture features;

[0025] The emotional feature fusion module is used to fuse the emotional features extracted by the feature extraction module to obtain fused emotional features, and to perform emotional recognition on the fused features to identify the user's emotional state.

[0026] The image generation module is used to generate dynamic images that match the emotional state identified by the feature fusion module.

[0027] The optimization module is used to improve the matching degree between the images generated by the image generation module and the user's emotions through reinforcement learning, and to adjust the generation strategy.

[0028] Preferably, the feature fusion module fuses features from multiple modalities by weighted summation, wherein the weights are automatically adjusted during the training process in the emotion recognition task.

[0029] This invention provides a method and device for generating interactive visuals based on object emotion recognition. It has the following beneficial effects:

[0030] 1. This invention employs multimodal data perception and weighted summation feature fusion technology to integrate various emotional information such as voice, facial expressions, and body posture. This achieves improved accuracy and robustness in emotion recognition. Compared to existing emotion recognition methods that rely on single modality or simple fusion, this invention comprehensively considers features from multiple modalities, overcoming the limitation of single-modal emotion recognition in fully reflecting the user's emotional state, thus making emotion recognition more accurate and comprehensive.

[0031] 2. This invention employs a weighted combination of complex emotions, representing the user's complex emotional states through a weighted sum of various basic emotions. This achieves a more accurate expression of the user's multi-layered emotions. Compared to traditional emotion recognition methods that rely solely on single classification methods based on basic emotions, this invention overcomes the inability to handle complex emotional combinations, accurately identifying and expressing the superposition and changes of multiple emotions.

[0032] 3. This invention utilizes Generative Adversarial Networks (GANs) combined with user emotional states to dynamically generate visuals, achieving a high degree of matching between emotion and visuals. This results in real-time dynamic adjustment of visual elements and an improved user experience. Compared to fixed or preset visual generation methods in existing technologies, the dynamic generation technology of this invention overcomes the low adaptability and limitations of traditional methods in emotional interaction, enabling each user to receive more personalized and real-time visual feedback during interaction.

[0033] 4. This invention introduces a reinforcement learning mechanism to optimize the generation strategy of dynamic images. It achieves the effect of adjusting the generation strategy in real time based on user feedback, further improving the naturalness and personalization of emotional interaction. Compared to traditional emotion generation models that cannot optimize in real time according to changes in user emotions, this invention achieves self-learning and optimization of the system through reinforcement learning, enabling it to better adapt to changes in user emotions and enhance user immersion and emotional resonance. Attached Figure Description

[0034] Figure 1 This is a flowchart illustrating the steps of the interactive image generation method based on object emotion recognition according to the present invention.

[0035] Figure 2 This is a flowchart of the emotion recognition module of the present invention;

[0036] Figure 3 This is a flowchart of the dynamic image generation process of the present invention;

[0037] Figure 4 This is a flowchart of the closed-loop optimization and reinforcement learning process of the present invention;

[0038] Figure 5 This is a schematic diagram of the reinforcement learning parameter update of the present invention;

[0039] Figure 6 This is a system module diagram of the interactive image generation device of the present invention. Detailed Implementation

[0040] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0041] Example:

[0042] Please see the appendix Figure 1 -Appendix Figure 5 This invention provides a method for generating interactive visuals based on object emotion recognition, including:

[0043] S1. Acquire multimodal data, including speech, facial expressions, and body posture data;

[0044] Please see the appendix Figure 1 In implementing this invention, it is first necessary to collect multimodal data from users through various devices. A user's emotional state is typically expressed through multiple signals such as voice, facial expressions, and body posture. Therefore, three data sources are employed to acquire the user's emotional information.

[0045] Speech: The user's speech signal is acquired through a microphone, and acoustic features such as Mel-frequency cepstral coefficients (MFCC) are used to analyze the emotional information in the speech. MFCC features can effectively extract emotion-related features from audio signals and identify emotional changes in speech.

[0046] Facial expressions: The system captures images of the user's face using a camera and employs a convolutional neural network (CNN) to extract facial features. CNNs can automatically recognize facial expressions, such as smiles and frowns, from facial images. These expressions are closely related to emotional states.

[0047] Body posture: By using posture recognition sensors to acquire the user's body movement status, their body language can be analyzed. Body movements can reflect the user's emotions, such as anxiety, pleasure, or anger.

[0048] After preprocessing, features are extracted from the data of each modality and represented as follows: , , These feature matrices, each with different dimensions and contents depending on the type of signal acquired, are fed into the feature fusion module.

[0049] S2. Extracting emotional features from the multimodal data; In this embodiment, to achieve high-precision emotion recognition, it is first necessary to perform preliminary perception of the user's emotional state through a multimodal data perception layer. Emotion recognition is a complex process, and a single data source often cannot fully reflect the user's true emotions. Therefore, this invention uses multiple modalities such as voice, facial expressions, and body posture to comprehensively capture the user's emotional changes, and integrates the data from each modality to construct a complete emotional feature representation.

[0050] Multimodal data perception: The multimodal data perception module is responsible for collecting different types of data from the user from multiple sensors, including voice, facial expressions, and body posture data. Each data type corresponds to different perceptual information and has different characteristics. Therefore, this invention employs a feature extraction method tailored to each data source.

[0051] Speech Data Acquisition and Feature Extraction: In the speech signal processing, Mel-frequency cepstral coefficients (MFCC) are used as the speech feature extraction method. MFCC is a feature widely used in speech recognition and emotion recognition, which can effectively capture the frequency components in the speech signal and reflect the emotional information of the speech. The process first processes the audio signal through short-time Fourier transform (STFT) to obtain a spectrogram. Then, a Mel filter bank is used to transform the spectrogram to a Mel scale, and finally, the cepstral coefficients within each time window are calculated. The specific formula is as follows:

[0052] ;

[0053] in, Indicates time Mel frequency cepstral coefficients, Indicates time Frequency is The Fourier transform result of the audio signal, For Mel filter banks, This represents the number of frequency components. In this way, MFCC extracts frequency components related to speech emotion, providing strong support for subsequent emotion recognition.

[0054] Facial Expression Data Acquisition and Feature Extraction: Facial expressions can intuitively reflect an individual's emotional state, especially when used in conjunction with speech and body language, enabling a more comprehensive understanding of emotional changes. Therefore, convolutional neural networks (CNNs) are used to extract emotional features from facial expression images.

[0055] In practice, facial images captured by a camera are input into a CNN network for processing. The CNN automatically extracts features from the facial images, such as the movement of areas like the eyes, eyebrows, and mouth. Each convolutional layer extracts different features from the image, from low-level edge information to high-level facial expression features. The output of the CNN forms a high-dimensional representation of emotional features. This process can be represented as:

[0056] ;

[0057] in, For convolution kernel, For the input facial image, For bias terms, For activation function, This is the feature map after the convolution operation. Through the combination of convolutional layers and pooling layers, the network can learn various detailed features in facial expressions, such as smiling, frowning, and changes in eye contact.

[0058] Body posture data acquisition and feature extraction: Body language is an important way for users to express their emotions, and it can work with facial expressions to support emotion recognition. In this invention, the user's body posture is captured by sensors or cameras, and then the features of the body posture are extracted.

[0059] Body posture extraction is achieved through human keypoint detection algorithms. Using posture recognition algorithms such as OpenPose, the positions of various key parts of the human body can be detected, resulting in a representation of the human skeleton. By calculating the coordinate changes of each keypoint, body features related to the user's emotions can be obtained. For example, leaning forward may indicate anxiety, while crossing arms may indicate defensiveness. The feature extraction process of body posture can be represented as follows:

[0060] ;

[0061] in, Describes the characteristics of body posture. Information about the human skeleton. The coordinates of the key points, This is a pose estimation algorithm. The formula shows that the features of body pose are composed of the skeletal information and key point positions of the human body. By processing this information through a pose estimation algorithm, the final emotional features are obtained.

[0062] Emotional Feature Fusion: After features are extracted from speech, facial expressions, and body posture data, these features are fused into a unified emotional feature representation. Multimodal data fusion is a key step in this invention, aiming to effectively combine emotional features from different sources, thereby improving the accuracy and robustness of emotion recognition.

[0063] In this embodiment, a weighted summation method is used for multimodal feature fusion. Specifically, the extracted sentiment features of each modality... The fused features are obtained by weighted summation. The merged emotional characteristics are represented as follows:

[0064] ;

[0065] in, The combined emotional characteristics For the first Emotional characteristics of various modalities For the weights of each mode, This represents the total number of modes. The weights are automatically optimized during training based on the emotion recognition task, and different weight values ​​determine the degree to which each modality contributes to the final emotion features. This weight optimization can be achieved using common optimization algorithms such as gradient descent.

[0066] During training, the system automatically adjusts the weights based on the importance of different modalities in emotion recognition. The value of is determined to achieve the best feature fusion effect.

[0067] Through the above-described multimodal data perception and sentiment feature extraction process, the system ultimately obtains a comprehensive sentiment feature. This feature represents the user's multidimensional emotional state and will serve as input for subsequent emotion recognition models, used by further emotion state classification and image generation modules.

[0068] In summary, this invention, through multimodal data perception and feature extraction technology, can comprehensively acquire and process users' emotional information, providing reliable support for emotion recognition and dynamic image generation. Through this step, the system can not only accurately capture users' facial expressions, voice changes, body movements, and other emotional signals, but also effectively fuse these signals, providing complete emotional feature data for subsequent emotion recognition and image generation.

[0069] S3. The extracted multimodal features are fused to obtain the fused sentiment features. The fusion step is performed by weighted summation, wherein the weight of each modal feature is automatically optimized through the training process.

[0070] In the preceding steps, the system successfully extracted rich emotional feature data from multiple sensory sources (voice, facial expressions, and body posture) through the multimodal data perception module. Next, the system will fuse this multimodal data and perform further emotion recognition processing. Multimodal feature fusion not only enhances the accuracy of emotion analysis but also ensures the complementarity between different modalities. By organically integrating data from different modalities, the system can provide a more comprehensive and accurate assessment of the user's emotional state.

[0071] In this embodiment, the multimodal feature fusion and emotion recognition steps are based on the extracted features. , , The process is as follows. In the previous stage, features of speech, facial expressions, and body posture were extracted using MFCC, convolutional neural networks (CNN), and pose estimation algorithms, respectively. Each modality's features were assigned corresponding weights, and after fusion, a comprehensive emotion feature was generated. These features serve as input to the emotion recognition model, supporting subsequent emotion classification tasks. The following is a detailed description of multimodal feature fusion and emotion recognition.

[0072] Multimodal feature fusion: The goal of modal feature fusion is to combine features extracted from various modalities into a unified representation, ensuring that the system can understand the user's emotional state from multiple perspectives. Generally, data from modalities such as speech, facial expressions, and body posture contain different emotional information. By fusing this information, the overall emotional state of the user can be better expressed.

[0073] In this embodiment, a weighted summation method is used for feature fusion. Features of each modality... By multiplying by a weight Then, the components are added together to obtain the combined emotional characteristics. The mathematical formula for this process is:

[0074] ;

[0075] in, This indicates the emotional characteristics after fusion. Indicates the first Emotional characteristics of each modality Weights for each modality, The number of modalities is represented by this weighted summation method. This allows the system to weight the contributions of each modality in sentiment analysis, ensuring that modal features that play a crucial role in sentiment recognition have a greater proportion in the fusion process.

[0076] In this embodiment, weight It is automatically optimized during training based on feedback from the emotion recognition task. By optimizing these weights, the system can find the optimal fusion method to achieve the best final emotion features. It can accurately reflect the user's emotional state.

[0077] As an option, weight A dynamic adjustment approach can be adopted, using reinforcement learning algorithms to adjust the settings under different emotional scenarios, thereby optimizing the feature fusion under different emotional states.

[0078] S4. Perform emotion recognition on the fused emotion features. The recognition steps are carried out using an emotion state recognition model.

[0079] Please see the appendix Figure 2 Emotion recognition model: After completing the fusion of multimodal features, the system will integrate the fused emotion features. The data is input into the emotion recognition model. The task of the emotion recognition model is to convert the fused features into specific emotional states. Generally, emotional states can be divided into basic emotions and complex emotions. Basic emotions include pleasure, anger, sadness, etc., while complex emotions refer to a combination of emotions composed of multiple basic emotions, such as the simultaneous presence of anger and anxiety.

[0080] To accurately identify emotional states, the emotion recognition model in this embodiment employs deep learning techniques. Specifically, the emotion recognition model can use models such as Convolutional Neural Networks (CNN), Long Short-Term Memory Networks (LSTM), or Transformers. These models can capture the spatial and temporal dependencies in emotional features, effectively improving the accuracy of emotion classification. Specifically, LSTM can model the temporal dependencies of emotional feature sequences, while Transformers can simultaneously focus on the correlations at different positions in the input sequence.

[0081] The task of identifying emotional states can be performed using a multilayer perceptron (MLP) model, which fuses features. This is mapped onto an emotion category space. Emotion categories include basic emotions such as joy, anger, and sadness, as well as complex emotions formed by their combinations. The output of the MLP model is the probability of each emotion category, and the system determines the user's final emotional state based on these probabilities. This process can be represented as:

[0082] ;

[0083] in, For the prediction results of sentiment category, The characteristics after fusion For the parameters of the model, This is a function of the emotion recognition model, and its output is... It can be the probability distribution of various emotion categories.

[0084] During training, the sentiment recognition model minimizes the gap between predicted and actual sentiment labels by optimizing the cross-entropy loss function. The cross-entropy loss function is defined as follows:

[0085] ;

[0086] in, For loss function, For the number of emotion categories, For real labels, This represents the probability of the sentiment category predicted by the model.

[0087] Through the aforementioned emotion recognition model, the system can obtain the user's basic and complex emotions. Basic emotions are identified by judging the model's output probabilities, while complex emotions are modeled by a weighted combination of basic emotions. As mentioned earlier, complex emotions are composed of multiple basic emotions; therefore, in this embodiment, the calculation of complex emotions uses a weighted summation method. Specifically, if the user's emotions include multiple emotions, such as anger and anxiety, the system will determine the appropriate emotion based on the weights learned during training. We then weight these emotions to calculate the complex emotions.

[0088] The formula for expressing complex emotions is:

[0089] ;

[0090] in, For complex emotions, For the first Basic emotions, The weighting coefficient represents the degree of influence of basic emotion on complex emotion. This involves the number of basic emotions. By optimizing the weights of each basic emotion, the system can accurately identify complex emotions.

[0091] The emotion recognition model in this embodiment not only relies on traditional deep learning methods but can also be further optimized by combining reinforcement learning. Through reinforcement learning, the system can dynamically adjust the weights according to the user's emotional changes. and This allows the system to better capture and identify emotional changes in different emotional scenarios.

[0092] Through these steps, this invention can efficiently extract emotional features from multimodal data and identify emotional states using a combination of deep learning and reinforcement learning. The system can handle complex emotional states, such as the identification of basic and complex emotions, providing accurate data support for subsequent image generation and user interaction.

[0093] In multimodal emotion recognition systems, the emotional features of a single modality are often insufficient to comprehensively and accurately reflect a user's emotional state. Therefore, this invention employs a weighted combination of multimodal data, which, by fusing emotional features from different modalities (such as voice, facial expressions, and body posture), yields a more comprehensive and accurate representation of the emotional state. This weighted combination of multimodal data not only allows information from each modality to be complementary but also maximizes the contribution of each modality to emotion recognition.

[0094] In the preceding steps, the system first extracted sentiment features from multiple modal data sources and then fused these features. Through feature fusion, a sentiment feature containing multimodal information was obtained. Next, the system needs to further process these emotional features, weighting and combining multiple basic emotions (such as joy, anger, sadness, etc.) to form the final composite emotional representation. This step is one of the core components of the emotion recognition process in this invention, capable of handling complex emotional states and transforming them into features suitable for subsequent processing.

[0095] In this embodiment, composite emotion is represented by a weighted combination of basic emotions. The contribution of each basic emotion to the composite emotion is represented by a weighting coefficient. It learns automatically through a training process. During this process, the system adjusts the weight of each basic emotion based on the user's emotional characteristics and corresponding emotional tags to most accurately represent complex emotions.

[0096] Definitions of basic and complex emotions: Basic emotions refer to emotion categories that can be identified individually and have clear emotional characteristics, typically including basic emotions such as pleasure, anger, sadness, and fear. Complex emotions are combinations of multiple basic emotions, capable of representing more complex and nuanced emotional states. For example, the combination of anger and anxiety can represent "anxious anger," while the combination of pleasure and surprise can represent "pleasurable pleasure."

[0097] Weighted combination of complex emotions: In this embodiment, the weighted combination of complex emotions is achieved by combining the various basic emotions through a weighted summation, as shown in the following formula:

[0098] ;

[0099] in, Indicates complex emotions. For the first A basic emotion, The weighting coefficient for basic emotions, The quantity of basic emotions. Complex emotions. These are the various basic emotions The weighted sum, weight The magnitude of the value reflects the contribution of each basic emotion to the complex emotion. The optimization process is carried out through training to ensure that the final composite emotion can accurately express the user's emotional state.

[0100] Training and optimization of weight coefficients: During training, the system adjusts the weight coefficients by optimizing the loss function. Specifically, the sentiment recognition model adjusts the weight of each basic sentiment based on the gap between the actual sentiment label and the sentiment state predicted by the model. This allows for the acquisition of the optimal composite emotional expression.

[0101] loss function The cross-entropy loss function is typically used, expressed as:

[0102] ;

[0103] in, For loss function, The total number of emotion categories. As a true emotional label, This represents the probability of the sentiment category predicted by the model. By minimizing the loss function, the system can effectively adjust... To optimize the recognition of complex emotions.

[0104] The relationship between multimodal features and complex emotions: The emotion recognition system of this invention can organically combine emotional information from speech, facial expressions, and body posture through weighted combination of multimodal features. The weight coefficients are optimized through the training process. The system can dynamically adjust the weight of basic emotions based on the contribution of different modalities to the emotional state. This allows the system to handle more complex emotional states and provide more accurate emotional representations for subsequent tasks such as image generation.

[0105] For example, in an emotional state, if the voice signal indicates anger, the facial expression indicates anxiety, and the body posture indicates tension, the system will adjust the weights based on training. By appropriately combining the weights of three basic emotions—anger, anxiety, and tension—an accurate composite emotion can be formed. This weighted combination enhances the system's ability to recognize complex emotional states, thus providing more precise emotional features for subsequent image generation.

[0106] System Optimization and Enhancement: In some embodiments, reinforcement learning techniques can be employed to further optimize the emotion recognition process in order to improve the accuracy of weighted combination of emotion states. Through reinforcement learning, the system can adjust the weight of each basic emotion based on real-time user feedback. This allows for dynamic optimization. In this approach, the system can flexibly adjust based on changes in the user's emotions, improving the real-time nature and personalization of emotion recognition.

[0107] By combining a weighted combination of multimodal features with a deep learning model, this invention provides an accurate and dynamic emotional state recognition solution. The weighted combination of complex emotions provides rich emotional information for subsequent image generation and emotional interaction, enabling the system to achieve a more natural and immersive user experience.

[0108] S5. Generate dynamic images that match the identified emotional state, and use a generative network to adjust the image elements during the generation process;

[0109] Please see the appendix Figure 3In this invention, dynamic image generation is one of the key steps in the emotion recognition system. Through the aforementioned multimodal feature extraction and emotion state recognition, the system has obtained the user's emotion state representation. At this point, the system generates dynamic images that match the user's current emotion state, thereby achieving a high degree of matching between emotion and visual experience.

[0110] The generation of weighted combinations of emotional states (i.e., complex emotions) provides an important basis for subsequent image generation. Specifically, the system dynamically adjusts visual elements in the image, such as color, lighting, and particle effects, based on different components of the complex emotion (such as anger, pleasure, and anxiety). Through this process, the user's emotions and interactive experience can be presented more vividly and personally.

[0111] In this embodiment, the dynamic image generation process employs Generative Adversarial Network (GAN) technology. GAN generates high-quality dynamic images that match the user's emotional state through a game-like process between a generator and a discriminator. Specifically, the generator generates image elements that match the emotional state input, while the discriminator optimizes the generator based on the differences between the generated image and the real image, ensuring that the generated image is both realistic and consistent with the emotional state.

[0112] The generator and discriminator are the core components of a generative adversarial network (GAN). The generator's task is to generate images based on the input emotional state, while the discriminator's task is to judge the differences between the generated images and the real images, and feed this information back to the generator to optimize the generation process.

[0113] In this embodiment, the generator receives the fused emotional features. This feature incorporates emotional information from multimodal data such as speech, facial expressions, and body posture. Based on these emotional features, the generator produces matching visual elements, such as background color, particle movement, and facial expressions. During the generation process, the system references preset rules for different emotional states, automatically adjusting the generated visual features according to the emotional category (such as joy, anger, sadness, etc.).

[0114] The discriminator evaluates the generated images to determine whether they conform to the style and emotional requirements of the real images. By comparing the generated images with the real images, the discriminator provides a score and feedback, which is then passed to the generator. This allows the generator to continuously optimize and ultimately generate images that perfectly match the emotional state.

[0115] Generation and adjustment of visual elements: In this embodiment, visual elements include multiple aspects such as color, lighting, texture, background, and particle effects. The system dynamically adjusts these elements according to different emotional states.

[0116] For example, when the emotional state indicates anger or anxiety, the system may choose more intense hues, such as red or black, and use more tense and rapid particle effects to simulate a tense atmosphere. Conversely, when the user's emotional state is pleasant or relaxed, the system may generate softer blues or greens and use slower, gentler animation effects to bring the user a more tranquil and comfortable feeling.

[0117] Specifically, the generator is based on emotional features The following image elements are generated via the network:

[0118] Color: The system adjusts the hue and brightness of the image based on the emotional state. For example, it may use a more intense red when angry, and a warm yellow or green when happy.

[0119] Lighting and Shadows: Adjust the position and intensity of light sources in the image according to the emotional state. When angry, the image may tend to have a high contrast and dark effect, while when happy, use soft light and a bright background.

[0120] Particle effects: Particle effects vary depending on the emotional state. In anger or tension, particles may exhibit rapid explosions or vibrations; while in joy or relaxation, light and slow particle movements are used.

[0121] Background elements: The choice of background is also influenced by emotional state. Tension or anger may generate dull background colors or confrontational scenes, while joy and tranquility may generate natural landscapes or bright cityscapes.

[0122] The generation of these visual elements is accomplished through a deep neural network. The generator encodes the emotional features of the input through the network, thereby generating visual elements that meet emotional needs. The system adjusts these elements based on the user's emotional feedback and real-time interaction, making the visual presentation more in line with the user's emotional changes.

[0123] Reinforcement Learning Optimization: To enhance the interactivity and personalization of dynamic image generation, this embodiment also introduces a reinforcement learning optimization mechanism. Through reinforcement learning, the system can adjust the generation strategy based on the user's real-time emotional feedback, improving the quality of the generated images and the user experience.

[0124] During reinforcement learning, the system evaluates the generated images using a reward function and adjusts its strategy based on the user's emotional feedback. Specifically, the system rewards the generated images based on their emotional relevance; if the generated images better match the user's emotional state, the system will award a higher reward, thus incentivizing the generator to produce images that better meet emotional needs.

[0125] In this way, the generator continuously optimizes its generation strategy, ensuring that each generated image better reflects the user's emotional changes. This real-time optimization capability enhances the system's interactivity, making each user's experience more personalized and responsive.

[0126] Output of Dynamic Visuals: Finally, after multiple iterations and optimizations, the system will generate the final dynamic visuals. These visuals not only reflect the user's emotional state but also dynamically adjust according to changes in emotion. In some embodiments, the generated visuals can also be combined with other elements such as sound effects and text prompts to further enhance the richness and diversity of emotional expression.

[0127] In summary, the dynamic image generation step in this embodiment achieves highly personalized emotion and visual matching by combining generative adversarial networks and reinforcement learning. Through real-time adjustments to visual elements such as color, lighting, and particle effects, the system can provide users with a more immersive emotional interaction experience. This innovative method will have wide applications in fields such as virtual reality, smart homes, and affective computing.

[0128] S6. Optimize the generated dynamic images and user emotional interaction using reinforcement learning algorithms, and adjust the generation strategy accordingly.

[0129] Please see the appendix Figure 4 -Appendix Figure 5 In this invention, reinforcement learning plays a crucial role as part of closed-loop optimization. By utilizing reinforcement learning, the system can dynamically adjust its generation strategy based on real-time user feedback, making emotion recognition and dynamic image generation more adaptive and personalized. This closed-loop optimization process not only improves the naturalness of emotional interaction but also provides customized experiences for different users and contexts.

[0130] In this embodiment, the core of closed-loop optimization is establishing a feedback mechanism between the generated visuals and the user's emotional state. Specifically, the system adjusts the parameters of the emotion recognition model based on the user's real-time emotional responses, thereby optimizing the generated dynamic visuals. Through continuous adjustment and updating of the feedback signals, the system can gradually improve the accuracy of emotion recognition and visual generation. Closed-loop optimization relies on reinforcement learning algorithms; after each interaction, the system evaluates the matching degree between the generated visuals and the emotional state based on the reward value, thereby optimizing the model.

[0131] In one possible implementation, the system evaluates whether the currently generated image accurately reflects the user's emotional state using a reward function derived from reinforcement learning. The reward function can be set based on the following parameters:

[0132] ;

[0133] in, Indicates the reward value. Indicates the accuracy of emotion recognition. This indicates the quality of the generated image. Emotion recognition accuracy. The quality of the generated image is measured by comparing the difference between the emotional state predicted by the system and the user's actual emotional response. This is determined by the user's subjective rating or its match with the user's emotional state. The system uses a reward function. The value of the parameter is adjusted to optimize the emotion recognition and dynamic image generation model.

[0134] Specifically, the closed-loop optimization steps include: First, the system performs emotion recognition using multimodal data and generates corresponding images; then, it obtains the matching degree between the generated images and the emotional state through user feedback or the emotion assessment module; and finally, it uses reinforcement learning algorithms to update the parameters of the emotion recognition model and the image generation model, thereby improving the accuracy and user experience in subsequent interactions.

[0135] In some embodiments, the reinforcement learning process is performed using the Q-learning algorithm, specifically optimized using the following update formula:

[0136] ;

[0137] in, This is the current state. For the current action, As a reward value, As a discount factor, The learning rate. Through continuous updates. With the right value, the system can gradually improve the effectiveness of emotion recognition and image generation.

[0138] In general, the system can collect the user's emotional feedback after each interaction and adjust the model based on this feedback to ensure that the system performs better in emotion recognition and image generation quality in subsequent interactions.

[0139] Alternatively, this invention can be combined with other reinforcement learning algorithms, such as Deep Q-Networks (DQN), to further enhance the system's adaptability. Specifically, DQN uses deep neural networks to approximate the Q-value function, enabling the system to learn and optimize efficiently in complex environments.

[0140] Through this series of closed-loop optimization processes, the present invention achieves continuous self-improvement of emotion recognition and image generation technology, solves the shortcomings of existing technologies that are difficult to adapt to users' emotional changes and needs in real time, and provides users with a smoother and more personalized emotional interaction experience.

[0141] The following description of an interactive screen generation device based on object emotion recognition can be used as a reference to the above description of an interactive screen generation method based on object emotion recognition.

[0142] Please see the appendix Figure 6 A device for generating interactive visuals based on object emotion recognition, comprising:

[0143] The multimodal data acquisition module is used to collect users' multimodal data, including voice, facial expressions, and body posture data;

[0144] The emotion feature extraction module is used to extract emotion features from the multimodal data, including voice features, facial expression features, and body posture features;

[0145] The emotional feature fusion module is used to fuse the emotional features extracted by the feature extraction module to obtain fused emotional features, and to perform emotional recognition on the fused features to identify the user's emotional state.

[0146] The image generation module is used to generate dynamic images that match the emotional state identified by the feature fusion module.

[0147] The optimization module is used to improve the matching degree between the images generated by the image generation module and the user's emotions through reinforcement learning, and to adjust the generation strategy.

[0148] The feature fusion module fuses features from multiple modalities using a weighted summation method, where the weights are automatically adjusted during the training process in the emotion recognition task.

[0149] The device in this embodiment can be used to execute the above method embodiments, and its principle and technical effects are similar, so they will not be described again here.

[0150] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method for generating interactive visuals based on object emotion recognition, characterized in that, Includes the following steps: S1. Acquire multimodal data, including speech, facial expressions, and body posture data; S2. Extract the sentiment features from the multimodal data; S3. The extracted multimodal features are fused to obtain the fused sentiment features. The fusion step is performed by weighted summation, wherein the weight of each modal feature is automatically optimized through the training process. S4. Perform emotion recognition on the fused emotion features. The recognition steps are carried out using an emotion state recognition model. S5. Generate dynamic images that match the identified emotional state, and use a generative network to adjust the image elements during the generation process; S6. Optimize the generated dynamic images and user emotional interaction using reinforcement learning algorithms, and adjust the generation strategy accordingly.

2. The method for generating interactive visuals based on object emotion recognition according to claim 1, characterized in that, The emotion state recognition model in step S4 is a deep learning-based model, which can be a convolutional neural network, a long short-term memory network, or a Transformer network.

3. The method for generating interactive visuals based on object emotion recognition according to claim 1, characterized in that, The emotion state recognition model in step S4 includes a multilayer perceptron model, which is used to map the fused emotion features to emotion categories, including basic emotions and complex emotions.

4. The method for generating interactive visuals based on object emotion recognition according to claim 1, characterized in that, The emotion fusion in step S3 is represented by weighted summation of multiple basic emotions, where the weights of each basic emotion are determined through the training process, and the generated composite emotion is a weighted combination of basic emotions.

5. The method for generating interactive visuals based on object emotion recognition according to claim 1, characterized in that, The dynamic image generation in step S5 is achieved through a generative adversarial network. The generative network generates image elements based on emotional states, and the discriminator of the generative network judges the difference between the generated image and the real image and optimizes it.

6. The method for generating interactive visuals based on object emotion recognition according to claim 1, characterized in that, The reinforcement learning algorithm in step S6 optimizes the strategy through a reward function. The reward function evaluates the interactive effect of the generated image based on user feedback, and the optimization goal is to maximize the matching degree between the user's emotional experience and the generated image.

7. The method for generating interactive visuals based on object emotion recognition according to claim 1, characterized in that, The reinforcement learning algorithm in step S6 employs the policy gradient method, which optimizes image generation by dynamically adjusting the policy.

8. The method for generating interactive visuals based on object emotion recognition according to claim 1, characterized in that, The emotion state recognition model in step S4 captures the temporal series features of emotions by performing spatiotemporal modeling of emotion features at multiple times and using a long short-term memory network or a Transformer network.

9. A video interaction generation device based on object emotion recognition, characterized in that, The method for generating interactive visuals based on object emotion recognition, as described in any one of claims 1-8, comprises: The multimodal data acquisition module is used to collect users' multimodal data, including voice, facial expressions, and body posture data; The emotion feature extraction module is used to extract emotion features from the multimodal data, including voice features, facial expression features, and body posture features; The emotional feature fusion module is used to fuse the emotional features extracted by the feature extraction module to obtain fused emotional features, and to perform emotional recognition on the fused features to identify the user's emotional state. The image generation module is used to generate dynamic images that match the emotional state identified by the feature fusion module. The optimization module is used to improve the matching degree between the images generated by the image generation module and the user's emotions through reinforcement learning, and to adjust the generation strategy.

10. The interactive image generation device based on object emotion recognition according to claim 9, characterized in that, The feature fusion module fuses features from multiple modalities using a weighted summation method, where the weights are automatically adjusted during the training process in the emotion recognition task.