Vehicle-mounted emotion interaction method and device based on multi-dimensional recognition

By integrating facial expressions, voice emotions, driving behavior, and physiological state features into a multi-dimensional recognition model, and combining external environmental data and hazard levels, an adaptive interaction strategy is designed and optimized through online learning. This solves the problem of insufficient emotion recognition and interaction strategies in in-vehicle emotion interaction, and improves the level of intelligence and user experience.

CN120951079BActive Publication Date: 2026-06-26SHENZHEN ZHI HUI LIN NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN ZHI HUI LIN NETWORK TECH CO LTD
Filing Date
2025-07-18
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing in-vehicle emotion interaction methods lack a systematic approach to emotion recognition, making it difficult to integrate multi-dimensional features such as facial expressions, voice emotions, driving behavior, and physiological states. Furthermore, the interaction strategies fail to comprehensively consider the external environment and the level of danger in the scenario, and lack an adaptive interaction mechanism, resulting in unsatisfactory interaction effects.

Method used

By collecting and integrating camera video data, microphone voice data, vehicle sensor data, and physiological sensor data from vehicle terminals, a multi-dimensional recognition model is constructed, including a long short-term memory neural network and a multimodal feature fusion network. Combined with external environmental data and hazard levels, an adaptive interaction strategy is designed, and the model is optimized through online learning using an interaction effect evaluation model.

Benefits of technology

It enables accurate judgment of driver emotions and dynamic adjustment of personalized interactive content, improving the intelligence level of in-vehicle emotional interaction and user experience, and ensuring the adaptability and effectiveness of interaction strategies in complex driving scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120951079B_ABST
    Figure CN120951079B_ABST
Patent Text Reader

Abstract

The embodiment of the application provides a kind of based on multi-dimension recognition's vehicle emotion interaction method and device, through innovatively constructing emotion fusion identification model, by integrating facial expression, speech emotion, driving behavior and physiological state characteristics, the accurate judgment of driver emotion is realized.Design scene-based adaptive interaction strategy, combined with external environment data and danger level, establish interaction trigger threshold for intelligent matching.Introduce interaction effect evaluation mechanism, through online learning module, continuously optimize interaction strategy model, realize the dynamic adjustment of personalized interaction content.The method effectively solves the deficiency of traditional technology in emotion recognition, interaction strategy and effect evaluation, significantly improves the intelligent level and user experience of vehicle emotion interaction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing, specifically to an in-vehicle emotion interaction method and device based on multi-dimensional recognition. Background Technology

[0002] Existing in-vehicle emotion interaction methods have significant shortcomings. Traditional systems lack a systematic approach to emotion recognition, making it difficult to effectively integrate multi-dimensional features such as facial expressions, voice emotions, driving behavior, and physiological states, thus affecting the accuracy of emotion judgment.

[0003] Furthermore, existing technologies suffer from bottlenecks in interaction strategies. Most systems fail to comprehensively consider the external environment and the level of danger in the scenario, and lack adaptive interaction mechanisms based on emotion intensity, resulting in less than ideal interaction effects.

[0004] Existing systems have technical shortcomings in effectiveness evaluation. They lack the ability to dynamically evaluate interaction effects and struggle to continuously optimize strategy models through online learning, thus impacting the user experience. Addressing these issues is crucial for improving the level of in-vehicle emotional interaction. Summary of the Invention

[0005] To address the problems in existing technologies, this application provides a vehicle-mounted emotion interaction method and device based on multi-dimensional recognition, which can effectively solve the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, and significantly improve the intelligence level and user experience of vehicle-mounted emotion interaction.

[0006] To solve at least one of the above problems, this application provides the following technical solution:

[0007] Firstly, this application provides a vehicle-mounted emotion interaction method based on multi-dimensional recognition, including:

[0008] The system collects video data from a vehicle-mounted terminal camera, voice data from a microphone, data from vehicle sensors, and data from physiological sensors. Facial expression features are extracted from the video data, voice emotion feature vectors are extracted from the voice data, driving behavior feature vectors are extracted from the vehicle sensor data, and physiological state feature vectors are extracted from the physiological sensor data. A temporal feature extraction model is constructed based on a long short-term memory neural network. The facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors are input into the temporal feature extraction model to generate emotion change trend features. The emotion change trend features are input into an emotion fusion recognition model, and the emotion fusion recognition model outputs the driver's emotion state label and emotion intensity value.

[0009] External environmental data is acquired, weather and road condition features are extracted, a scene recognition model is constructed, the scene recognition model outputs the scene danger level, an interaction trigger threshold is set based on the scene danger level, an emotion interaction strategy model is trained, the emotion interaction strategy model generates interaction strategy parameters based on the emotion state label, the emotion intensity value, and the interaction trigger threshold, and selects the corresponding speech model and interaction material from the speech synthesis model library and the interaction content library based on the interaction strategy parameters.

[0010] An interaction effect evaluation model is constructed. The interaction effect evaluation model calculates the interaction satisfaction score based on the driver's feedback behavior to the interaction content. The interaction satisfaction score is input into an online learning module. The online learning module optimizes the parameters of the emotion interaction strategy model based on the interaction satisfaction score. The optimized parameters are written into the strategy model library. Interaction content is generated based on the interaction strategy parameters and sent to the vehicle terminal.

[0011] Furthermore, it also includes: collecting camera video data and microphone voice data through the sensor interface of the vehicle terminal; extracting frame sequences from the camera video data; locating the face region in each frame image using a face detection algorithm; extracting the coordinates of 68 facial feature points based on a facial key point localization algorithm; constructing a facial geometric feature descriptor; inputting the facial geometric feature descriptor into a pre-trained expression classification neural network to extract facial expression features; performing frame segmentation and pre-emphasis processing on the microphone voice data; calculating the Mel frequency cepstral coefficients and pitch period; and constructing a voice emotion feature vector by combining acoustic features such as sound intensity and pitch.

[0012] Steering wheel angle data, vehicle speed data, and braking depth data are acquired from the vehicle sensor data interface. Electrocardiogram (ECG) data, skin conductance data, and respiratory rate data are collected from physiological sensors. The steering wheel angle data, vehicle speed data, and braking depth data are normalized and divided into driving behavior feature vectors according to time windows. The ECG data, skin conductance data, and respiratory rate data are subjected to noise filtering and signal smoothing. Heart rate variability index and skin conductance level index are extracted. The heart rate variability index and skin conductance level index are combined to construct a physiological state feature vector.

[0013] Furthermore, it also includes: constructing a long short-term memory neural network structure, where the network input layer dimension corresponds to the feature vector dimension, the hidden layer contains multiple memory units, and each memory unit consists of an input gate, a forget gate, and an output gate. The input gate controls the importance of the new input information at the current moment, the forget gate controls the forgetting ratio of historical information, and the output gate controls the output degree of information. The labeled historical data is input into the long short-term memory neural network for training to obtain a temporal feature extraction model. The facial expression features, the voice emotion feature vector, the driving behavior feature vector, and the physiological state feature vector are input into the temporal feature extraction model according to the time sequence to generate trend features reflecting the driver's emotions changing over time.

[0014] A multimodal feature fusion network based on an attention mechanism is constructed as an emotion fusion recognition model. The attention weights of each time step in the trend features are calculated, and the features are weighted and summed according to the attention weights to obtain the fused feature representation. The fused feature representation is input into a fully connected layer, and the probability distribution of the emotion state label is output through a softmax classifier. The category with the highest probability is selected as the final emotion state label, and the fused feature representation is mapped to a value between 0 and 1 through a sigmoid function to obtain the emotion intensity value.

[0015] Furthermore, it also includes: collecting rain and snow weather data, visibility data, and temperature and humidity data through the environmental perception sensor of the vehicle terminal; obtaining real-time road condition data, traffic flow data, and road surface condition data from the vehicle communication module; performing feature extraction and numerical processing on the rain and snow weather data, the visibility data, and the temperature and humidity data to obtain weather condition features; performing feature extraction and numerical processing on the real-time road condition data, the traffic flow data, and the road surface condition data to obtain road condition features; and normalizing the weather condition features and the road condition features.

[0016] A multilayer perceptron neural network is constructed as a scene recognition model. The network input layer receives the weather condition features and the road condition features, the hidden layer uses the ReLU activation function, and the output layer uses the softmax function to output the probability distribution of the scene hazard level. The scene recognition model is trained based on historical labeled data, and the scene hazard level is mapped to a preset interval to obtain the interaction trigger benchmark value. The interaction trigger threshold under different scenarios is set according to the interaction trigger benchmark value.

[0017] Furthermore, it also includes: constructing an emotion interaction strategy model based on deep reinforcement learning, combining the emotion state label, the emotion intensity value, and the interaction trigger threshold to construct a state vector, using the interaction tone, interaction content type, and interaction timing as the action space, designing an interaction effect scoring function as the reward signal, and using a deep Q-network to train the emotion interaction strategy model. During the training process, the interaction action is selected based on the exploration-exploitation strategy, and the selected interaction action is decoded into interaction strategy parameters. The interaction strategy parameters include speech synthesis parameters, content matching parameters, and push timing parameters.

[0018] Based on the speech synthesis parameters, a corresponding speech clone model is selected from the speech synthesis model library. Based on the content matching parameters, the similarity score of the materials in the interactive content library is calculated. The audio, text, and image with the highest similarity score are selected as interactive materials. Based on the push timing parameters, the optimal interaction time is determined. The speech clone model is applied to the text content in the interactive materials to generate personalized voice broadcast content.

[0019] Furthermore, it also includes: constructing an interaction effect evaluation model based on behavior sequence analysis, collecting driver's click operation data, voice command data, and interruption behavior data on interactive content, performing time-series encoding on the click operation data, the voice command data, and the interruption behavior data to obtain an interaction behavior sequence, inputting the interaction behavior sequence into a bidirectional gated recurrent unit network to extract interaction feedback features, constructing a scoring function based on the interaction feedback features to calculate the interaction satisfaction score, and forming an evaluation sample pair with the interaction scenario information and the interaction satisfaction score;

[0020] An online learning module based on gradient boosting trees is constructed. The evaluation sample pairs are input into the online learning module in chronological order. The recent sample set is maintained using a sliding window method. The importance scores of each parameter in the emotion interaction strategy model are calculated based on the sample set. The parameter update step size is adjusted according to the importance scores. The network weights of the emotion interaction strategy model are updated online using the stochastic gradient descent method. The updated network weights are saved to the strategy model library.

[0021] Furthermore, it also includes: constructing a version control-based strategy model update mechanism, packaging the optimized network weights and model structure information into a model snapshot file, calculating the integrity check code of the model snapshot file, writing the model snapshot file and the integrity check code into the temporary storage area of ​​the strategy model library, decompressing and verifying the model snapshot file, transferring it to the formal storage area after successful verification, updating the version identifier of the strategy model, deleting expired historical version files, and generating a model update log;

[0022] Based on the interaction strategy parameters, the interactive materials are combined and processed, and the voice broadcast content, text prompt content, and image prompt content are encapsulated into a unified message format. The encapsulated message is then compressed and encrypted. The encrypted message is sent to the display control module and the audio control module through the message queue service of the vehicle terminal. The display control module displays the text and image content in a designated area of ​​the vehicle terminal display screen, and the audio control module plays the voice content through the vehicle terminal speaker.

[0023] Secondly, this application provides an in-vehicle emotion interaction device based on multi-dimensional recognition, comprising:

[0024] The data preprocessing module is used to collect camera video data, microphone voice data, vehicle sensor data, and physiological sensor data from the vehicle terminal. It extracts facial expression features from the camera video data, voice emotion feature vectors from the voice data, driving behavior feature vectors from the vehicle sensor data, and physiological state feature vectors from the physiological sensor data. It constructs a temporal feature extraction model based on a long short-term memory neural network, inputs the facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors into the temporal feature extraction model to generate emotion change trend features, inputs the emotion change trend features into an emotion fusion recognition model, and the emotion fusion recognition model outputs the driver's emotion state label and emotion intensity value.

[0025] The emotion interaction module is used to acquire external environmental data, extract weather and road condition features, construct a scene recognition model, output a scene danger level, set an interaction trigger threshold based on the scene danger level, train an emotion interaction strategy model, generate interaction strategy parameters based on the emotion state label, the emotion intensity value, and the interaction trigger threshold, and select corresponding speech models and interaction materials from the speech synthesis model library and the interaction content library based on the interaction strategy parameters.

[0026] An interaction evaluation module is used to construct an interaction effect evaluation model. The interaction effect evaluation model calculates an interaction satisfaction score based on the driver's feedback behavior to the interaction content. The interaction satisfaction score is input into an online learning module. The online learning module optimizes the parameters of the emotion interaction strategy model based on the interaction satisfaction score, writes the optimized parameters into a strategy model library, generates interaction content based on the interaction strategy parameters, and sends it to the vehicle terminal.

[0027] Thirdly, this application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the in-vehicle emotion interaction method based on multi-dimensional recognition.

[0028] Fourthly, this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the in-vehicle emotion interaction method based on multi-dimensional recognition.

[0029] Fifthly, this application provides a computer program product, including a computer program / instructions, which, when executed by a processor, implement the steps of the in-vehicle emotion interaction method based on multi-dimensional recognition.

[0030] As described above, this application provides a method and device for in-vehicle emotion interaction based on multi-dimensional recognition. It innovatively constructs an emotion fusion recognition model, integrating facial expressions, voice emotions, driving behavior, and physiological state features to achieve accurate judgment of the driver's emotions. A scenario-based adaptive interaction strategy is designed, combining external environmental data and hazard levels to establish interaction trigger thresholds for intelligent matching. An interaction effect evaluation mechanism is introduced, continuously optimizing the interaction strategy model through an online learning module to achieve dynamic adjustment of personalized interaction content. This method effectively addresses the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, significantly improving the intelligence level and user experience of in-vehicle emotion interaction. Attached Figure Description

[0031] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0032] Figure 1 This is a flowchart illustrating the in-vehicle emotion interaction method based on multi-dimensional recognition in the embodiments of this application;

[0033] Figure 2 This is a structural diagram of the in-vehicle emotion interaction device based on multi-dimensional recognition in the embodiments of this application;

[0034] Figure 3 This is a schematic diagram of the structure of the electronic device in the embodiments of this application.

[0035] Figure label:

[0036] Electronic device 9600, central processing unit 9100, memory 9140, communication module 9110, input unit 9120, audio processor 9130, display 9160, power supply 9170, buffer memory 9141, application / function storage unit 9142, data storage unit 9143, driver storage unit 9144, antenna 9111, speaker 9131, microphone 9132. Detailed Implementation

[0037] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0038] The acquisition, storage, use, and processing of data in this application all comply with the relevant provisions of national laws and regulations.

[0039] To address the shortcomings of existing technologies, this application provides a multi-dimensional recognition-based in-vehicle emotion interaction method and device. By innovatively constructing an emotion fusion recognition model, it integrates facial expressions, voice emotions, driving behavior, and physiological state features to achieve accurate judgment of the driver's emotions. A scenario-based adaptive interaction strategy is designed, combining external environmental data and hazard levels to establish interaction trigger thresholds for intelligent matching. An interaction effect evaluation mechanism is introduced, continuously optimizing the interaction strategy model through an online learning module to achieve dynamic adjustment of personalized interaction content. This method effectively solves the deficiencies of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, significantly improving the intelligence level and user experience of in-vehicle emotion interaction.

[0040] To effectively address the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, and to significantly improve the intelligence level and user experience of in-vehicle emotion interaction, this application provides an embodiment of an in-vehicle emotion interaction method based on multi-dimensional recognition. See [link to embodiment]. Figure 1 The in-vehicle emotion interaction method based on multi-dimensional recognition specifically includes the following:

[0041] Step S101: Collect camera video data, microphone voice data, vehicle sensor data, and physiological sensor data from the vehicle terminal. Extract facial expression features from the camera video data, extract voice emotion feature vectors from the voice data, extract driving behavior feature vectors from the vehicle sensor data, and extract physiological state feature vectors from the physiological sensor data. Construct a temporal feature extraction model based on a long short-term memory neural network. Input the facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors into the temporal feature extraction model to generate emotion change trend features. Input the emotion change trend features into an emotion fusion recognition model. The emotion fusion recognition model outputs the driver's emotion state label and emotion intensity value.

[0042] Optionally, this embodiment addresses the problems of incomplete feature extraction, weak temporal correlation, and insufficient modal fusion in traditional in-vehicle emotion recognition by innovatively designing an emotion recognition scheme based on multi-dimensional perception. This embodiment first collects driver behavior data through a multi-source sensor system in the vehicle terminal, including facial video streams captured by the in-vehicle front-facing high-definition camera, voice signals captured by a noise-canceling microphone array, driving operation data collected by the vehicle's CAN bus, and physiological signals collected by wearable devices. A distributed data acquisition framework is adopted, establishing a synchronization mechanism for data with different sampling frequencies: Sync_Time = Base_Time + Offset, where Base_Time is the system's base time and Offset is the time deviation of each data source, ensuring the temporal consistency of multimodal data. For example, the video frame rate captured by the camera is 30fps, the voice sampling frequency is 16kHz, and the physiological signal sampling frequency is 100Hz; data synchronization is achieved through timestamp alignment.

[0043] This embodiment deeply optimizes the visual feature extraction mechanism. For camera video data, an improved face detection network is used to locate the face region in real time, and 68 facial feature points are extracted using a key point localization algorithm. The system constructs a geometric feature descriptor based on the feature point coordinates, including micro-expression features such as the distance between eyebrows, the curvature of the corners of the mouth, and the opening and closing of the eyelids. Special attention is paid to changes in lighting and head posture in driving scenarios, and the robustness of feature extraction is improved through adaptive brightness compensation and posture normalization. The system inputs the geometric feature descriptor into a pre-trained emotion recognition convolutional neural network. This network uses a residual structure to enhance feature extraction capabilities and adapts to the characteristics of in-vehicle scenarios through transfer learning. The network output includes probability distributions for multiple emotion categories such as happiness, anger, fatigue, and anxiety, providing visual-dimensional emotional features for subsequent multimodal fusion.

[0044] This embodiment innovatively implements a speech emotion feature extraction strategy. The speech signal acquired by the microphone is preprocessed, including noise reduction, framing, and pre-emphasis. The system calculates acoustic features based on Mel-frequency, extracting a set of acoustic parameters including fundamental frequency, formants, and energy. Time-frequency analysis captures the dynamic features of the speech signal, paying particular attention to the patterns of speech rate, pitch, and intensity. These acoustic features have a direct physiological correlation with human emotional expression; for example, anger is typically manifested as a higher pitch and faster speech rate, while fatigue is manifested as a slower speech rate and lower intensity. The system uses a deep neural network to map the acoustic features to an emotion feature space, generating feature vectors reflecting the speaker's emotional state.

[0045] This embodiment deeply optimizes the driving behavior analysis mechanism. A driving behavior feature vector is constructed based on onboard sensor data, including operational features such as steering wheel angle, accelerator pedal depth, and braking force. The system employs a sliding time window technique to segment the sensor data, extracting statistical features and temporal patterns. Driving behavior pattern analysis identifies abnormal operations such as rapid acceleration, sharp turns, and frequent braking, which often reflect the driver's emotional state. For example, frequent rapid acceleration and braking may indicate that the driver is anxious or angry. The system normalizes the extracted behavioral features to construct a standardized feature vector, providing a behavioral dimension basis for emotional state recognition.

[0046] This embodiment innovatively designs a physiological feature extraction scheme. It collects physiological signals such as electrocardiogram (ECG), skin conductance, and respiration using wearable devices, and employs biomedical signal processing methods for preprocessing and feature extraction. The system focuses on heart rate variability (HRV) indicators, extracting features reflecting autonomic nervous system activity through time-domain and frequency-domain analysis. Combined with the trend of skin conductance levels, it assesses the driver's stress level and emotional activation. These physiological indicators exhibit a stable correlation with emotional states, providing objective physiological evidence for emotion recognition.

[0047] This embodiment achieves temporal fusion of multimodal features using a Long Short-Term Memory (LSTM) network. A multi-branch LSTM network structure is constructed, with each branch processing a feature sequence of one modality. The network learns feature dependencies at different time scales through a gating mechanism, capturing the gradual process and abrupt changes in emotional states. An attention mechanism is introduced on top of the LSTM to adaptively adjust the importance weights of different modal features. Through deep fusion of multimodal features, a temporal feature representation comprehensively reflecting the driver's emotional state is generated. Finally, the system outputs emotional state labels and intensity values ​​through an emotion fusion recognition model, providing a decision-making basis for subsequent interaction strategy generation.

[0048] This embodiment's innovative design solves the feature extraction and modality fusion problems of traditional methods, establishing a continuously optimizing emotion recognition framework. Through deep fusion of multi-dimensional data and extraction of temporal features, the system can accurately identify changes in the driver's emotional state, providing a reliable perceptual foundation for intelligent interaction. This recognition mechanism based on multimodal analysis ensures that the system maintains stable recognition performance even in complex driving scenarios. In particular, the introduction of physiological features significantly improves the objectivity and accuracy of emotion recognition.

[0049] Step S102: Acquire external environment data, extract weather and road condition features, construct a scene recognition model, the scene recognition model outputs scene danger level, set interaction trigger threshold based on scene danger level, train emotion interaction strategy model, the emotion interaction strategy model generates interaction strategy parameters based on emotion state label, emotion intensity value, and interaction trigger threshold, and selects corresponding speech model and interaction material from speech synthesis model library and interaction content library based on interaction strategy parameters;

[0050] Optionally, this embodiment addresses the problems of insufficient scene perception, single interaction strategies, and rigid triggering mechanisms in traditional in-vehicle interaction systems by innovatively designing an adaptive interaction scheme based on environment perception. This embodiment first collects multi-dimensional external environmental data through an in-vehicle environment perception system, including temperature, humidity, and visibility data collected by in-vehicle weather sensors, road condition data detected by millimeter-wave radar, and real-time traffic flow information obtained through a V2X communication network. The system adopts a hierarchical data processing architecture, standardizing different types of environmental data: Environment_Score = w1Weather + w2Traffic + w3*Road, where Weather, Traffic, and Road represent the scores for weather conditions, traffic conditions, and road conditions, respectively, and w1, w2, and w3 are corresponding weighting coefficients determined through expert experience and data analysis.

[0051] This embodiment deeply optimizes the weather feature extraction mechanism. A complete feature description system is established for different weather conditions. The system measures visibility levels using optical sensors and assesses the risk of road icing by combining temperature and humidity data. For rainy and snowy weather, precipitation is measured using precipitation intensity sensors to assess the impact on driving safety. These weather factors are directly related to driving risks; for example, fog reduces visibility, increasing the risk of rear-end collisions; rain and snow reduce tire adhesion, increasing the risk of skidding. The system uses fuzzy inference methods to map multi-dimensional weather features to a risk level space, providing a meteorological basis for scene hazard assessment.

[0052] This embodiment innovatively implements a road condition analysis strategy. Based on the fusion perception of vehicle-mounted millimeter-wave radar and cameras, it detects road conditions and traffic situations in real time. The system analyzes road images through a deep learning network to identify abnormal conditions such as water accumulation, bumps, and cracks. It combines traffic flow data to analyze road congestion levels and assesses traffic flow stability through vehicle density and average speed. It pays special attention to abrupt changes in road conditions, such as sudden braking ahead or sudden lane narrowing, which are high-risk scenarios. The system comprehensively analyzes the extracted road condition features through a spatiotemporal fusion network to generate a state vector reflecting the degree of danger in the current driving environment.

[0053] This embodiment deeply optimizes the scene recognition model design. A multilayer perceptron is used to construct the scene recognition network. The input layer receives a fused vector of weather and road condition features. The hidden layer extracts high-order feature representations using the ReLU activation function to capture the non-linear correlations between features. The network output layer uses the softmax function to generate a probability distribution of scene hazard levels, dividing the hazard levels into low, medium, and high risk. The system dynamically adjusts the interaction trigger threshold according to the hazard level: Threshold = Base_Value * Risk_Factor, where Base_Value is the baseline threshold and Risk_Factor is the adjustment factor corresponding to the hazard level. This adaptive triggering mechanism ensures more timely emotional intervention in high-risk scenarios.

[0054] This embodiment innovatively designs an interaction strategy generation scheme. A strategy model based on deep reinforcement learning is constructed, using emotion state labels, emotion intensity values, and interaction trigger thresholds as inputs to the state space. The model's action space includes dimensions such as interaction tone, content type, and trigger timing. By designing a reasonable reward function, the model is guided to learn the optimal interaction strategy. For example, when a driver experiences anxiety, the system will select an appropriate reassuring tone and prompt content based on the level of danger in the scenario; when the driver is fatigued, a more warning-oriented interaction method will be used.

[0055] This embodiment achieves intelligent scheduling of interactive resources through deep reinforcement learning technology. The system maintains a rich library of speech synthesis models and interactive content, including speech models of different genders, ages, and tones, as well as multimodal interactive materials such as text, images, and audio. Based on interaction strategy parameters, the system selects the most matching speech model and interactive materials from the resource library. Through emotional speech synthesis technology, the selected text content is converted into emotionally charged speech output, achieving a personalized interactive experience.

[0056] This embodiment's innovative design solves the problems of scene perception and policy generation in traditional methods, establishing a continuously optimizing interactive decision-making framework. Through environmental perception and risk assessment, the system can accurately grasp the timing of interaction, providing scene support for emotional intervention. This reinforcement learning-based policy generation mechanism ensures that the system maintains appropriate interaction strategies when facing complex driving scenarios. In particular, the intelligent scheduling of multimodal resources significantly improves the naturalness and acceptability of the interaction.

[0057] Step S103: Construct an interaction effect evaluation model. The interaction effect evaluation model calculates an interaction satisfaction score based on the driver's feedback behavior to the interaction content. The interaction satisfaction score is input into an online learning module. The online learning module optimizes the parameters of the emotion interaction strategy model based on the interaction satisfaction score. The optimized parameters are written into the strategy model library. Interaction content is generated based on the interaction strategy parameters and sent to the vehicle terminal.

[0058] Optionally, this embodiment addresses the problems of inaccurate feedback evaluation, untimely model optimization, and unsatisfactory interaction effects in traditional in-vehicle interaction systems by innovatively designing an adaptive optimization scheme based on real-time feedback. This embodiment first constructs a multi-dimensional interaction feedback collection mechanism, comprehensively collecting driver feedback behavior data through devices such as the in-vehicle terminal's touchscreen, voice recognition module, and camera. The system designs a behavior scoring function: Feedback_Score = w1Click + w2Voice + w3Gesture + w4Interrupt, where each item represents click responsiveness, voice command compliance, body language friendliness, and interaction interruption frequency, respectively. A weighted combination is used to calculate the comprehensive feedback score. This multi-dimensional feedback evaluation mechanism can comprehensively capture the driver's acceptance of the interactive content.

[0059] This embodiment deeply optimizes the behavior sequence analysis mechanism. A temporal behavior encoding model is constructed for the driver's interactive feedback behavior. The system uses a bidirectional gated recurrent unit network (Bi-GRU) to process the behavior sequence, capturing the temporal dependencies of behavior patterns through forward and backward hidden state. Particular attention is paid to the continuity characteristics of behavior; for example, continuous positive feedback usually indicates that the interaction strategy meets the driver's expectations, while frequent interruptions suggest that the interaction strategy needs adjustment. The system extracts high-order features of the behavior sequence through a deep learning network, providing a behavioral dimension basis for satisfaction assessment.

[0060] This embodiment innovatively implements a satisfaction assessment strategy. Based on extracted behavioral features, a hierarchical scoring model is constructed. The system considers the specificities of the interaction scenarios and adopts differentiated scoring standards for different emotional states and risk levels. For example, in high-risk scenarios, greater emphasis is placed on the driver's responsiveness to safety prompts; in emotion regulation scenarios, greater attention is paid to the improvement of emotional state. A fuzzy inference method is used to map multidimensional scores to a unified satisfaction space, generating quantitative indicators reflecting the interaction effect.

[0061] This embodiment deeply optimizes the online learning mechanism. It employs an online learning framework based on gradient boosting trees, maintaining the latest evaluation sample set through a sliding window. The system includes a feature importance analysis module to evaluate the impact of each parameter in the strategy model on the interaction effect. The learning rate is dynamically adjusted based on parameter importance: Learning_Rate = Base_Rate * Importance_Factor, where Base_Rate is the base learning rate and Importance_Factor is the parameter importance factor. This adaptive learning mechanism ensures that the model can quickly respond to user feedback and continuously optimize the interaction strategy.

[0062] This embodiment innovatively designs a model update strategy. A version-controlled parameter update mechanism is constructed to package and verify the optimized model parameters. The system ensures the reliability of parameter updates through integrity checks and uses incremental updates to reduce storage overhead. A rollback mechanism is specifically designed to quickly restore to a stable version when parameter updates cause performance degradation. The optimization process is recorded through model update logs, providing analytical basis for subsequent strategy improvements.

[0063] This embodiment utilizes deep learning technology to achieve intelligent generation of interactive content. Based on optimized strategy parameters, the system selects and combines suitable materials from the interactive content library. For voice content, natural voice prompts are generated using emotion-based speech synthesis technology; for visual content, the display position and method are dynamically adjusted according to the driving scenario. The system employs a message queue service to ensure real-time content delivery and protects the security of interactive content through data compression and encryption.

[0064] This embodiment's innovative design solves the feedback evaluation and strategy optimization problems in traditional methods, establishing a continuously evolving interactive system. Through real-time feedback analysis and online model optimization, the system can continuously improve the effectiveness of interaction strategies, providing drivers with more intelligent and human-centered emotional support. This deep learning-based optimization mechanism ensures the system maintains good adaptability when facing different drivers and scenarios. In particular, through multi-dimensional feedback analysis, the personalization level of the interaction strategy and user acceptance are significantly improved.

[0065] This embodiment achieves continuous evolution of the interactive system by establishing a complete evaluation-optimization-update closed loop. The system can dynamically adjust strategy parameters based on actual interaction effects, avoiding the limitations of traditional fixed strategies. Through real-time feedback and rapid optimization, the accuracy and timeliness of emotional interaction are significantly improved, providing strong emotional support for driving safety. This adaptive interaction mechanism demonstrates strong scenario adaptability and user satisfaction in practical applications.

[0066] As described above, the in-vehicle emotion interaction method based on multi-dimensional recognition provided in this application can accurately judge the driver's emotions by innovatively constructing an emotion fusion recognition model and integrating facial expressions, voice emotions, driving behavior, and physiological state features. It designs a scenario-based adaptive interaction strategy, combining external environmental data and hazard levels to establish interaction trigger thresholds for intelligent matching. An interaction effect evaluation mechanism is introduced, continuously optimizing the interaction strategy model through an online learning module to achieve dynamic adjustment of personalized interaction content. This method effectively solves the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, significantly improving the intelligence level and user experience of in-vehicle emotion interaction.

[0067] In one embodiment of the in-vehicle emotion interaction method based on multi-dimensional recognition in this application, it may further include the following:

[0068] Step S201: Collect camera video data and microphone voice data through the sensor interface of the vehicle terminal, extract frame sequences from the camera video data, locate the face region in each frame image using a face detection algorithm, extract the coordinates of 68 facial feature points based on the facial key point localization algorithm, construct a facial geometric feature descriptor, input the facial geometric feature descriptor into a pre-trained expression classification neural network to extract facial expression features, perform frame segmentation and pre-emphasis processing on the microphone voice data, calculate the Mel frequency cepstral coefficient and pitch period, and construct a voice emotion feature vector by combining acoustic features such as sound intensity and pitch.

[0069] Step S202: Obtain steering wheel angle data, vehicle speed data, and braking depth data from the vehicle sensor data interface; collect electrocardiogram (ECG) data, skin conductance data, and respiratory rate data from physiological sensors; normalize the steering wheel angle data, vehicle speed data, and braking depth data and generate driving behavior feature vectors according to time windows; perform noise filtering and signal smoothing on the ECG data, skin conductance data, and respiratory rate data; extract heart rate variability index and skin conductance level index; and combine the heart rate variability index and skin conductance level index to construct a physiological state feature vector.

[0070] Optionally, in view of the problems existing in traditional driver state monitoring, such as incomplete feature extraction, unstable data quality, and insufficient physiological index analysis, this embodiment innovatively designs a state recognition scheme based on multi-dimensional perception. First, this embodiment constructs a multi-modal data acquisition network through the sensor system of the vehicle-mounted terminal, including a high-resolution infrared camera, a microphone array, an in-vehicle CAN bus, and wearable physiological sensors. The system adopts a synchronous acquisition mechanism to achieve the temporal consistency of multi-source data through timestamp alignment: Time_Align = max(|Ti - Tref|) < Threshold, where Ti is the acquisition time of each data source, Tref is the reference time benchmark, and Threshold is the synchronous tolerance threshold, ensuring the temporal correlation of feature extraction.

[0071] This embodiment deeply optimizes the visual feature extraction mechanism. For the video stream collected by the camera, an adaptive frame rate control strategy is adopted to avoid redundant calculations while ensuring real-time performance. The system uses an improved MTCNN algorithm for face detection, which improves the robustness of detection through a multi-scale pyramid structure. Particular attention is paid to the pose changes and lighting changes in the driving scenario, and the detection effect is improved through pose estimation and lighting compensation. In the face key point localization link, a cascade regression method is used to extract the coordinates of 68 feature points, including the eyebrows, eyes, nose, mouth and other key facial regions. A geometric feature descriptor is constructed based on the feature point coordinates, including micro-expression features such as the distance between the eyebrows, the curvature of the mouth corners, and the opening and closing degree of the eyelids. These features have a direct physiological association with the emotional state.

[0072] This embodiment innovatively implements a speech feature analysis strategy. The speech signal collected by the microphone is preprocessed, including frame segmentation, pre-emphasis, and endpoint detection. The system uses an improved MFCC algorithm to extract acoustic features, better simulating the auditory characteristics of the human ear through the Mel frequency scale. The fundamental period is calculated to reflect the pitch change of the speaker, and combined with time-domain features such as short-time energy and zero-crossing rate, a complete acoustic feature set is constructed. Particular attention is paid to the prosodic features of emotional speech, such as speech rate, pause, stress, etc., which often carry rich emotional information. The system maps the acoustic features to the emotional feature space through a deep neural network to generate a feature vector reflecting the emotional state of the speaker.

[0073] This embodiment deeply optimizes the driving behavior analysis mechanism. It acquires driving operation data such as steering wheel angle, vehicle speed, and braking depth in real time via the vehicle's CAN bus. The system uses a sliding time window technique to process time-series data, with the window length dynamically adjusted according to the time scale of the behavioral characteristics. The raw data is normalized: Normalized_Value = (Value - Min) / (Max - Min), where Value is the raw data, and Min and Max are the statistical ranges of historical data. Driving behavior characteristics are extracted through time window analysis, including steering smoothness, speed stability, and braking characteristics. These behavioral characteristics reflect the driver's operating habits and current state, providing important evidence for emotion recognition.

[0074] This embodiment innovatively designs a physiological signal processing scheme. It collects physiological signals such as electrocardiogram (ECG), skin conductance, and respiration using wearable devices. The system employs wavelet transform for signal denoising and uses an adaptive thresholding method to identify and remove motion artifacts. QRS complex detection is performed on the ECG signals, and heart rate variability indices are calculated, including SDNN (standard deviation of adjacent RR intervals) and LF / HF (low-frequency to high-frequency ratio), which reflect the activity state of the autonomic nervous system. Combined with dynamic changes in skin conductance levels, the driver's stress level and emotional activation degree are assessed. The system fuses physiological features through a multilayer perceptron to generate a feature vector reflecting the driver's physiological state.

[0075] This embodiment achieves comprehensive perception of the driver's state through multi-dimensional feature fusion technology. It demonstrates particularly strong feature extraction capabilities and state recognition performance when handling complex driving scenarios. Through the application of deep learning models, the system can accurately capture the driver's behavioral characteristics and physiological state, providing a reliable data foundation for emotion recognition.

[0076] This embodiment's innovative design not only solves the feature extraction and data quality problems of traditional methods but also establishes a continuously optimizeable state monitoring framework. Through continuous optimization of feature extraction strategies and improvements in signal processing methods, the system can continuously enhance its ability to perceive the driver's state, providing strong support for intelligent interaction. This monitoring mechanism based on multi-dimensional perception ensures that the system maintains efficient feature extraction capabilities and reliable state recognition effects when facing complex driving scenarios. In particular, the in-depth analysis of physiological characteristics significantly improves the accuracy and reliability of state recognition.

[0077] In one embodiment of the in-vehicle emotion interaction method based on multi-dimensional recognition in this application, it may further include the following:

[0078] Step S301: Construct a long short-term memory neural network structure. The input layer dimension corresponds to the feature vector dimension. The hidden layer contains multiple memory units. Each memory unit consists of an input gate, a forget gate, and an output gate. The input gate controls the importance of the new input information at the current moment. The forget gate controls the forgetting ratio of historical information. The output gate controls the output degree of information. The labeled historical data is input into the long short-term memory neural network for training to obtain a temporal feature extraction model. The facial expression features, the voice emotion feature vector, the driving behavior feature vector, and the physiological state feature vector are input into the temporal feature extraction model according to the time sequence to generate trend features reflecting the driver's emotions over time.

[0079] Step S302: Construct a multimodal feature fusion network based on attention mechanism as an emotion fusion recognition model. Calculate the attention weights of each time step in the trend features. Sum the features according to the attention weights to obtain the fused feature representation. Input the fused feature representation into a fully connected layer. Output the probability distribution of the emotion state label through a softmax classifier. Select the category with the highest probability as the final emotion state label. Map the fused feature representation to between 0 and 1 using the sigmoid function to obtain the emotion intensity value.

[0080] Optionally, this embodiment addresses the problems of insufficient temporal dependency capture, inadequate multimodal feature fusion, and inaccurate emotion intensity quantification in traditional emotion recognition by innovatively designing a deep learning-based emotion recognition scheme. This embodiment first constructs a multi-layer LSTM network structure, designing corresponding input layers for features of different modalities. Each LSTM unit includes an input gate, a forget gate, and an output gate, controlling the information flow through a gating mechanism: Cell_State = ft*Ct-1 + it*Ct, where ft is the forget gate output, it is the input gate output, Ct-1 is the unit state at the previous time step, and Ct is the current candidate state. This design enables the network to effectively handle long temporal dependencies and capture the gradual change process of emotional states.

[0081] This embodiment deeply optimizes the training mechanism of the LSTM network. A hierarchical memory unit structure is designed to address the characteristics of driving scenarios. The bottom-level units primarily handle rapidly changing facial expressions and speech features, employing shorter time dependencies; the middle-level units focus on changing patterns in driving behavior, with relatively longer time windows; and the top-level units handle slowly changing physiological indicators, possessing the longest memory period. This hierarchical design aligns with the physiological characteristics of human emotional changes; for example, facial expressions can change instantaneously, while physiological indicators such as heart rate have a longer adjustment period. The system optimizes network parameters using the backpropagation algorithm, employs gradient pruning to prevent gradient explosion, and uses the Dropout mechanism to avoid overfitting.

[0082] This embodiment innovatively achieves temporal fusion of multimodal features. The system designs parallel LSTM branch networks, with each branch independently processing the feature sequence of one modality. The feature weights at different time steps are dynamically adjusted through an attention mechanism: Attention_Weight = softmax(tanh(Wh*ht+bh)), where ht is the hidden state, and Wh and bh are learnable parameters. This design can adaptively focus on important time periods, such as giving higher attention weights to moments of emotional abrupt change. The outputs of each branch network are combined through a feature fusion layer to generate trend features reflecting dynamic changes in emotion.

[0083] This embodiment deeply optimizes the design of the attention mechanism. A multi-head attention structure is constructed, with each attention head independently calculating the correlations across different feature dimensions. The system calculates the similarity between features using a scaled dot product attention mechanism and introduces positional encoding to preserve temporal information. Particular attention is paid to the interaction relationships between modalities, such as the collaborative patterns between facial expression changes and speech features, and the correlation between abnormal driving behavior and physiological indicators. This multi-dimensional attention mechanism can comprehensively capture all aspects of emotional expression, improving recognition accuracy.

[0084] This embodiment innovatively designs an emotion state classifier. A multilayer perceptron is used to construct the classification network, with the input layer receiving the fused feature representations. Higher-order features are extracted using the ReLU activation function, and the final layer uses the softmax function to output the probability distribution of emotion categories. The system classifies emotion states into several basic categories, including calm, joy, anxiety, and anger. Probability threshold filtering ensures the reliability of the classification results. Simultaneously, a sigmoid function is used to map features to the 0-1 interval, quantifying the intensity of the emotion and providing accurate judgment criteria for subsequent interaction strategies.

[0085] This embodiment achieves accurate recognition of driver emotions through deep learning technology. It demonstrates particularly strong temporal modeling capabilities and state recognition performance when handling complex emotional changes. By combining an LSTM network and an attention mechanism, the system can accurately capture the dynamic characteristics of emotional changes, providing a reliable perceptual foundation for intelligent interaction.

[0086] This embodiment's innovative design not only solves the temporal dependency and feature fusion problems of traditional methods but also establishes a continuously optimizing emotion recognition framework. Through continuous optimization of the deep learning model and improvements to the attention mechanism, the system can continuously enhance its understanding of emotional states, providing strong support for intelligent interaction. This recognition mechanism based on multimodal analysis ensures that the system maintains high recognition efficiency and reliable prediction performance even in complex driving scenarios. In particular, the precise quantification of emotion intensity significantly improves the targeting and adaptability of interaction strategies.

[0087] In one embodiment of the in-vehicle emotion interaction method based on multi-dimensional recognition in this application, it may further include the following:

[0088] Step S401: Collect rain and snow weather data, visibility data, and temperature and humidity data through the environmental perception sensor of the vehicle terminal; obtain real-time road condition data, traffic flow data, and road surface condition data from the vehicle communication module; perform feature extraction and numerical processing on the rain and snow weather data, visibility data, and temperature and humidity data to obtain weather condition features; perform feature extraction and numerical processing on the real-time road condition data, traffic flow data, and road surface condition data to obtain road condition features; and normalize the weather condition features and the road condition features.

[0089] Step S402: Construct a multilayer perceptron neural network as a scene recognition model. The network input layer receives the weather condition features and the road condition features. The hidden layer uses the ReLU activation function, and the output layer uses the softmax function to output the probability distribution of the scene danger level. The scene recognition model is trained based on historical labeled data. The scene danger level is mapped to a preset interval to obtain the interaction trigger benchmark value. The interaction trigger threshold under different scenarios is set according to the interaction trigger benchmark value.

[0090] Optionally, this embodiment addresses the problems of incomplete feature extraction, inaccurate hazard level assessment, and inflexible triggering mechanisms in traditional vehicle-mounted environmental perception by innovatively designing a scene recognition scheme based on deep learning. This embodiment first constructs a multi-source environmental perception network, collecting comprehensive environmental data through an onboard sensor system. The system integrates environmental information from different sources using a data synchronization mechanism and processes continuously changing environmental parameters using time window analysis technology: Environment_Score = w1Weather + w2Traffic + w3*Road, where Weather, Traffic, and Road represent the scoring indicators for weather, traffic, and road conditions, respectively, and w1, w2, and w3 are dynamically adjusted weighting coefficients, achieving a comprehensive assessment of environmental risks.

[0091] This embodiment deeply optimizes the weather feature extraction mechanism. Differentiated processing strategies are designed for different types of weather data. The system measures visibility levels using optical sensors and assesses visibility conditions using image processing technology. For rainy or snowy weather, a comprehensive index is constructed using precipitation intensity sensors and temperature and humidity sensors to assess the risk of slippery roads. Special attention is paid to the combined effects of temperature and humidity; for example, low temperatures and high humidity easily lead to road icing, while high temperatures and high humidity affect the driver's physiological state. The system uses fuzzy inference methods to map multi-dimensional weather features to a risk space, establishing a correlation model between weather conditions and driving safety.

[0092] This embodiment innovatively implements a road condition analysis strategy. Real-time traffic situation information, including road congestion index, traffic density, and average speed, is acquired through the vehicle communication module. The system employs spatiotemporal data mining methods to analyze traffic flow characteristics and identify potential congestion risks and emergencies. For road surface conditions, multi-sensor fusion technology is used to assess physical characteristics such as road surface adhesion coefficient and smoothness. Special attention is paid to abrupt changes in road conditions, such as accidents ahead or road construction. These road condition characteristics directly affect driving safety and require timely triggering of corresponding interactive prompts.

[0093] This embodiment deeply optimizes the feature normalization mechanism. An adaptive normalization strategy is designed for environmental features of different dimensions. The system adopts a dynamic range adjustment method: Normalized_Value = (Value - Min) / (Max - Min), where Value is the original feature value, and Min and Max are the dynamic boundary values ​​of the feature. The normalization parameters are updated using a sliding window technique to ensure the stability of the feature distribution. For outliers, a robust normalization method is used to avoid data distortion and improve the reliability of feature representation.

[0094] This embodiment innovatively designs a scene recognition model. A deep neural network is constructed using a multilayer perceptron, with the input layer dimension corresponding to the normalized environmental feature dimension. The hidden layers extract nonlinear combinations of features through the ReLU activation function, capturing the complex relationships between environmental factors. The system pays particular attention to the interactive effects of features, such as the combined risks of rain and congestion, and the coupled impact of visibility and traffic density. Through multilayer mapping of the deep network, accurate conversion of environmental features to hazard levels is achieved.

[0095] This embodiment innovatively implements a trigger threshold generation strategy. Based on the hazard level probability distribution output by the scene recognition model, the system designs an adaptive threshold mapping mechanism. The hazard level is mapped to a preset trigger interval: Threshold = Base_Value * Risk_Factor, where Base_Value is the baseline threshold and Risk_Factor is the adjustment factor corresponding to the hazard level. Through this dynamic threshold mechanism, the system can adjust the sensitivity of the interaction according to the level of scene risk, lowering the trigger threshold in high-risk scenarios to provide early warnings, and raising the threshold in low-risk scenarios to avoid excessive disturbance.

[0096] This embodiment achieves accurate recognition of driving scenarios through deep learning technology. It demonstrates particularly strong feature extraction capabilities and risk assessment performance when handling complex environmental conditions. Through the application of a multilayer perceptron model, the system can accurately understand the correlations between environmental features, providing reliable scenario support for emotional interaction.

[0097] This embodiment's innovative design not only solves the feature extraction and risk assessment problems of traditional methods but also establishes a continuously optimizing scene recognition framework. Through continuous optimization of feature processing strategies and improvement of the recognition model, the system can continuously enhance its understanding of the driving environment, providing strong support for intelligent interaction. This deep learning-based recognition mechanism ensures that the system maintains efficient analytical capabilities and reliable prediction results when facing complex driving scenarios. In particular, the introduction of a dynamic threshold mechanism significantly improves the accuracy and adaptability of interaction triggering.

[0098] In one embodiment of the in-vehicle emotion interaction method based on multi-dimensional recognition in this application, it may further include the following:

[0099] Step S501: Construct an emotion interaction strategy model based on deep reinforcement learning. Combine the emotion state label, the emotion intensity value, and the interaction trigger threshold to construct a state vector. Use the interaction tone, interaction content type, and interaction timing as the action space. Design an interaction effect scoring function as the reward signal. Use a deep Q-network to train the emotion interaction strategy model. During the training process, select interaction actions based on the exploration-utilization strategy. Decode the selected interaction actions into interaction strategy parameters. The interaction strategy parameters include speech synthesis parameters, content matching parameters, and push timing parameters.

[0100] Step S502: Select the corresponding voice clone model from the voice synthesis model library according to the voice synthesis parameters, calculate the similarity score of the materials in the interactive content library based on the content matching parameters, select the audio, text and image with the highest similarity score as interactive materials, determine the best interaction time according to the push timing parameters, and apply the voice clone model to the text content in the interactive materials to generate personalized voice broadcast content.

[0101] Optionally, this embodiment addresses the problems of mechanical strategy selection, inaccurate timing of interaction, and insufficient personalization in traditional in-vehicle interaction systems by innovatively designing an intelligent interaction scheme based on deep reinforcement learning. This embodiment first constructs a complete state space representation, fusing the driver's emotional state label, emotional intensity value, and scene trigger threshold: State_Vector = Concat(Emotion_Label, Intensity, Threshold), where each component represents a discrete emotion category, a continuous intensity value, and a dynamic trigger threshold, respectively. This multi-dimensional state representation ensures that the model can comprehensively perceive the current interaction environment.

[0102] This embodiment deeply optimizes the action space design. Targeting the characteristics of in-vehicle interaction, it constructs multi-dimensional action representations, including interaction tone (such as soothing, reminders, warnings, etc.), content type (such as audio, text, images, etc.), and trigger timing. The system maps discrete action selections to a continuous parameter space through an action encoding mechanism: Action_Params = Decoder(Action_Choice), where Action_Choice is the selected action combination, and Action_Params is the decoded policy parameter vector. This design enables the model to generate fine-grained interaction strategies, adapting to the needs of different scenarios. For example, when the driver experiences mild anxiety, a gentle tone and soothing music might be chosen; while when strong anger is detected, a more direct warning tone is used.

[0103] This embodiment innovatively implements a reward function design. A multi-objective scoring mechanism is constructed, comprehensively considering the timeliness, appropriateness, and effectiveness of the interaction. The system combines different evaluation indicators with weights: Reward = w1Timing + w2Appropriateness + w3Effectiveness, where each component represents the timing, content suitability, and emotional improvement, respectively. Special attention is paid to the impact of the interaction strategy on driving safety, incorporating a safety weight factor into the scoring. Through this multi-dimensional reward design, the model is guided to learn the optimal interaction strategy.

[0104] This embodiment deeply optimizes the training mechanism of deep Q-networks. A dual-network structure is employed, including a policy network and a target network, and experience replay mitigates sample correlation. The system implements a priority experience replay mechanism, weighting samples based on the importance of interaction effects. During training, an ε-greedy policy balances exploration and utilization: P(random) = max(ε_min, ε_0 * decay_rate), where ε_min is the minimum exploration probability, ε_0 is the initial exploration probability, and decay_rate is the decay rate. This training strategy ensures that the model can fully explore different interaction methods while gradually converging to the optimal policy.

[0105] This embodiment innovatively designs a strategy parameter decoding mechanism. It converts the action selection output by the deep Q-network into specific interaction parameters. Speech synthesis parameters include dimensions such as timbre features, speech rate, and intonation; content matching parameters define the preference weights for material selection; and push timing parameters specify the optimal interaction time. The system uses a parameter decoder to transform abstract strategy decisions into executable control commands, guiding subsequent interaction execution.

[0106] This embodiment achieves personalized speech synthesis using deep learning technology. Based on a selected speech cloning model, the system can generate speech content with natural emotional nuances. Particular attention is paid to matching vocal features with emotional states; for example, a gentle and calm tone is used in soothing scenarios, while a clear and forceful tone is adopted in warning scenarios. Through content similarity calculation: Similarity = cosine(Content_Vector, Template_Vector), the most matching interactive material is selected to ensure the relevance and appropriateness of the content.

[0107] This embodiment's innovative design not only solves the problems of strategy selection and timing in traditional methods, but also establishes a continuously optimizing interactive decision-making framework. Through continuous training and policy improvement using deep reinforcement learning, the system can continuously enhance its understanding of interactive scenarios, providing strong support for intelligent interaction. This learning-based decision-making mechanism ensures that the system maintains efficient strategy generation capabilities and reliable interaction effects when facing complex driving scenarios. In particular, the introduction of personalized voice synthesis significantly improves the naturalness and acceptability of the interaction.

[0108] This embodiment achieves an intelligent upgrade of the interactive system by establishing a complete closed loop of state perception, decision learning, and execution control. The system can dynamically adjust its interaction strategy based on real-time state, avoiding the limitations of traditional fixed rules. Through deep reinforcement learning and personalized synthesis, the accuracy and naturalness of emotional interaction are significantly improved, providing strong emotional support for driving safety. This intelligent interaction mechanism demonstrates strong scene adaptability and user satisfaction in practical applications.

[0109] In one embodiment of the in-vehicle emotion interaction method based on multi-dimensional recognition in this application, it may further include the following:

[0110] Step S601: Construct an interaction effect evaluation model based on behavior sequence analysis. Collect driver's click operation data, voice command data, and interruption behavior data on the interactive content. Perform time-series encoding on the click operation data, voice command data, and interruption behavior data to obtain an interaction behavior sequence. Input the interaction behavior sequence into a bidirectional gated recurrent unit network to extract interaction feedback features. Construct a scoring function based on the interaction feedback features to calculate the interaction satisfaction score. Combine the interaction scenario information with the interaction satisfaction score to form an evaluation sample pair.

[0111] Step S602: Construct an online learning module based on gradient boosting tree. Input the evaluation sample pairs into the online learning module in chronological order. Maintain the most recent sample set using a sliding window method. Calculate the importance score of each parameter in the emotion interaction strategy model based on the sample set. Adjust the parameter update step size according to the importance score. Update the network weights of the emotion interaction strategy model online using stochastic gradient descent. Save the updated network weights to the strategy model library.

[0112] Optionally, this embodiment addresses the problems of incomplete feedback evaluation, untimely parameter optimization, and unsatisfactory learning effects in traditional interactive systems by innovatively designing an adaptive optimization scheme based on behavior analysis. This embodiment first constructs a multi-dimensional interactive feedback collection mechanism, collecting user feedback data through the multimodal interface of the in-vehicle terminal. The system designs a behavior scoring function: Feedback_Score = w1Click + w2Voice + w3*Interrupt, where Click, Voice, and Interrupt represent click responsiveness, voice cooperation, and interruption frequency, respectively, and w1, w2, and w3 are dynamically adjusted weighting coefficients. This multi-dimensional feedback collection mechanism ensures a comprehensive evaluation of the interaction effect.

[0113] This embodiment deeply optimizes the behavior sequence analysis mechanism. A unified temporal encoding framework is designed for different types of feedback behaviors. The system uses event encoding technology to convert discrete behaviors into continuous temporal features: Event_Vector = Encoder(Action_Type, Time_Stamp, Duration), where each parameter represents the behavior type, occurrence time, and duration, respectively. Special attention is paid to the continuity of behaviors; for example, continuous positive feedback indicates that the interaction strategy meets user expectations, while frequent interruptions suggest that the strategy needs adjustment. The encoded behavior sequence is processed through a bidirectional GRU network to capture long- and short-term dependencies.

[0114] This embodiment innovatively implements an interactive feedback feature extraction strategy. A bidirectional gated recurrent unit network is constructed, and the temporal patterns of behavioral sequences are analyzed through forward and backward hidden layer states. The system pays particular attention to the contextual relationships of behaviors; for example, feedback patterns in similar scenarios often exhibit consistency. An attention mechanism dynamically adjusts the feature weights at different time steps to highlight the impact of key behaviors. This deep learning-based feature extraction method can accurately capture the user's true attitude towards the interaction strategy.

[0115] This embodiment deeply optimizes the satisfaction assessment mechanism. Based on extracted interactive feedback features, a multi-level scoring model is constructed. The system considers the influence of scene features and adopts differentiated scoring standards for different emotional states and risk levels. A fuzzy inference method is used to map multi-dimensional scores to a unified satisfaction space: Satisfaction = f(Feedback, Context), where Feedback is the feedback feature, Context is the scene feature, and f is a non-linear mapping function. This scene-aware assessment mechanism ensures the accuracy of satisfaction calculation.

[0116] This embodiment innovatively designs an online learning framework. A gradient boosting tree model is used to construct a parameter importance analyzer, identifying key parameters by ranking feature importance. The system maintains a dynamic sample window to ensure the learning process reflects the latest interaction effects. Special attention is paid to the timeliness of parameters, with more recent samples having higher learning weights. The model parameters are continuously optimized through online learning methods to adapt to dynamic changes in user preferences.

[0117] This embodiment achieves adaptive parameter optimization through gradient boosting. The system dynamically adjusts the learning rate based on parameter importance: Learning_Rate = Base_Rate * Importance_Score, where Base_Rate is the base learning rate and Importance_Score is the parameter importance score. Stochastic gradient descent is used to update network weights, and batch training improves optimization efficiency. A parameter update verification mechanism is specifically designed to ensure the stability of the optimization process.

[0118] This embodiment's innovative design not only solves the feedback evaluation and parameter optimization problems in traditional methods but also establishes a continuously evolving learning framework. Through the combination of behavioral analysis and online learning, the system can accurately understand user needs and continuously optimize interaction strategies. This deep learning-based optimization mechanism ensures that the system maintains high learning efficiency and reliable optimization results even in complex scenarios.

[0119] This embodiment achieves continuous evolution of the interactive system by establishing a complete evaluation-optimization-update closed loop. The system can dynamically adjust strategy parameters based on actual interaction effects, avoiding the limitations of traditional fixed strategies. Through real-time feedback and rapid optimization, the accuracy and adaptability of emotional interaction are significantly improved, providing more reliable emotional support for driving safety. This adaptive learning mechanism demonstrates strong scenario adaptability and user satisfaction in practical applications.

[0120] In one embodiment of the in-vehicle emotion interaction method based on multi-dimensional recognition in this application, it may further include the following:

[0121] Step S701: Construct a version control-based strategy model update mechanism, package the optimized network weights and model structure information into a model snapshot file, calculate the integrity check code of the model snapshot file, write the model snapshot file and the integrity check code into the temporary storage area of ​​the strategy model library, decompress and verify the model snapshot file, and after successful verification, transfer it to the formal storage area, update the version identifier of the strategy model, delete expired historical version files, and generate a model update log;

[0122] Step S702: Based on the interaction strategy parameters, the interactive materials are combined and processed, and the voice broadcast content, text prompt content, and image prompt content are encapsulated into a unified message format. The encapsulated message is compressed and encrypted, and the encrypted message is sent to the display control module and the audio control module through the message queue service of the vehicle terminal. The display control module displays the text and image content in a designated area of ​​the vehicle terminal display screen, and the audio control module plays the voice content through the vehicle terminal speaker.

[0123] Optionally, this embodiment addresses the problems of unreliable model updates, non-standard content organization, and imprecise push control in traditional in-vehicle interaction systems by innovatively designing a version control-based model deployment and content distribution scheme. This embodiment first constructs a complete model version management framework, employing an incremental update strategy to record model changes. The system designs a model integrity verification mechanism: Checksum = Hash(Model_Weights + Structure_Info), where Model_Weights is the network weight, Structure_Info is the model structure information, and Hash is a secure hash function. This mechanism ensures the reliability of model updates and prevents model files from being corrupted during transmission and storage.

[0124] This embodiment deeply optimizes the model snapshot management mechanism. Differential compression technology is used to reduce storage overhead for model parameters of different versions. The system implements a two-stage version verification process: first, integrity verification is performed in a temporary storage area; only after successful verification is the model transferred to the official storage area. Special attention is paid to model rollback requirements, preserving complete snapshots of critical version nodes. A version chain management strategy is used: Version_ID = Base_Version + Increment_ID, where Base_Version is the base version number and Increment_ID is the incremental update identifier, achieving orderly management of model versions. This design ensures the traceability and reversibility of the model update process.

[0125] This embodiment innovatively implements an interactive content organization strategy. Based on interaction strategy parameters, the system adopts a modular design to organize multimodal interactive materials. A unified message encapsulation format is designed for different types of content, including a message header (version information, timestamp, priority, etc.) and a message body (multimodal content and its metadata). Special consideration is given to the unique characteristics of in-vehicle scenarios, such as reducing the complexity of visual content while driving at high speeds and displaying richer interactive information while waiting at a stop. The display strategy is dynamically adjusted through a content adaptation engine to ensure both interactive effectiveness and driving safety.

[0126] This embodiment deeply optimizes the content security transmission mechanism. It employs a multi-layered data protection strategy, including compression encoding and encryption. The system uses efficient data compression algorithms to reduce transmission load while ensuring data security through strong encryption. Special attention is paid to real-time requirements, and a lightweight encryption / decryption scheme is designed to reduce processing latency while maintaining security. Reliable content transmission is achieved through a message queue service, supporting message priority management and failure retransmission mechanisms.

[0127] This embodiment innovatively designs a display control scheme. An adaptive display layout system is constructed, dynamically adjusting the content layout based on the hardware characteristics of the in-vehicle terminal. The system employs a zoned display strategy, placing important information in positions easily accessible to the driver. An attention prediction model is used to assess the visual salience of the content, optimizing the display position of key information. Dynamic changes in driving scenarios are specifically considered, such as automatically adjusting display brightness when lighting conditions change and increasing audio volume in noisy environments.

[0128] This embodiment achieves synchronized audio and video control through multi-threading technology. The system employs a precise timing control mechanism to ensure consistency between voice announcements and visual cues. Different playback strategies are used for different types of interactive content. For example, emergency alerts are played interruptively, while routine information is played in a queued manner. A feedback monitoring mechanism evaluates playback performance in real time and supports dynamic adjustment of playback parameters.

[0129] This embodiment's innovative design not only solves the model deployment and content distribution problems of traditional methods but also establishes a reliable interactive execution framework. Through version control and secure transmission mechanisms, the system can reliably update interaction strategies, providing solid technical support for intelligent interaction. This multi-layered protection-based deployment mechanism ensures that the system maintains stable operation even in complex in-vehicle environments.

[0130] This embodiment achieves reliable operation of the interactive system by establishing a complete control chain for model deployment, content distribution, and display. The system can safely and efficiently update interaction strategies and precisely control content display, providing drivers with a smooth and natural interactive experience. Through multiple protection mechanisms and intelligent control strategies, the reliability and user experience of the in-vehicle interactive system are significantly improved. This comprehensive control mechanism demonstrates strong environmental adaptability and operational stability in practical applications.

[0131] To effectively address the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, and to significantly improve the intelligence level and user experience of in-vehicle emotion interaction, this application provides an embodiment of an in-vehicle emotion interaction device based on multi-dimensional recognition, which implements all or part of the aforementioned in-vehicle emotion interaction method based on multi-dimensional recognition. See [link to embodiment]. Figure 2 The in-vehicle emotion interaction device based on multi-dimensional recognition specifically includes the following:

[0132] The data preprocessing module 10 is used to collect camera video data, microphone voice data, vehicle sensor data, and physiological sensor data from the vehicle terminal. It extracts facial expression features from the camera video data, voice emotion feature vectors from the voice data, driving behavior feature vectors from the vehicle sensor data, and physiological state feature vectors from the physiological sensor data. It constructs a temporal feature extraction model based on a long short-term memory neural network, inputs the facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors into the temporal feature extraction model, generates emotion change trend features, inputs the emotion change trend features into an emotion fusion recognition model, and the emotion fusion recognition model outputs the driver's emotion state label and emotion intensity value.

[0133] The emotion interaction module 20 is used to acquire external environmental data, extract weather and road condition features, construct a scene recognition model, output a scene danger level, set an interaction trigger threshold based on the scene danger level, train an emotion interaction strategy model, generate interaction strategy parameters based on the emotion state label, the emotion intensity value, and the interaction trigger threshold, and select corresponding speech models and interaction materials from the speech synthesis model library and the interaction content library based on the interaction strategy parameters.

[0134] The interaction evaluation module 30 is used to construct an interaction effect evaluation model. The interaction effect evaluation model calculates an interaction satisfaction score based on the driver's feedback behavior to the interaction content. The interaction satisfaction score is input into the online learning module. The online learning module optimizes the parameters of the emotion interaction strategy model based on the interaction satisfaction score, writes the optimized parameters into the strategy model library, generates interaction content based on the interaction strategy parameters, and sends it to the vehicle terminal.

[0135] As described above, the in-vehicle emotion interaction device based on multi-dimensional recognition provided in this application can accurately judge the driver's emotions by innovatively constructing an emotion fusion recognition model and integrating facial expressions, voice emotions, driving behavior, and physiological state features. It designs a scenario-based adaptive interaction strategy, combining external environmental data and hazard levels to establish interaction trigger thresholds for intelligent matching. An interaction effect evaluation mechanism is introduced, continuously optimizing the interaction strategy model through an online learning module to achieve dynamic adjustment of personalized interaction content. This method effectively solves the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, significantly improving the intelligence level and user experience of in-vehicle emotion interaction.

[0136] From a hardware perspective, in order to effectively address the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, and significantly improve the intelligence level and user experience of in-vehicle emotion interaction, this application provides an embodiment of an electronic device for implementing all or part of the in-vehicle emotion interaction method based on multi-dimensional recognition. The electronic device specifically includes the following components:

[0137] The system comprises a processor, memory, a communications interface, and a bus; wherein the processor, memory, and communications interface communicate with each other via the bus; the communications interface is used to realize information transmission between the multi-dimensional recognition-based in-vehicle emotion interaction device and core business systems, user terminals, and related databases and other related devices; the logic controller can be a desktop computer, tablet computer, or mobile terminal, etc., and this embodiment is not limited to these. In this embodiment, the logic controller can be implemented with reference to the embodiments of the multi-dimensional recognition-based in-vehicle emotion interaction method and the multi-dimensional recognition-based in-vehicle emotion interaction device in the embodiments, the content of which is incorporated herein, and repeated details will not be described again.

[0138] It is understood that the user terminal may include smartphones, tablet computers, network set-top boxes, portable computers, desktop computers, personal digital assistants (PDAs), in-vehicle devices, smart wearable devices, etc. Among these, the smart wearable devices may include smart glasses, smartwatches, smart bracelets, etc.

[0139] In practical applications, parts of the in-vehicle emotion interaction method based on multi-dimensional recognition can be executed on the electronic device side as described above, or all operations can be completed in the client device. The choice can be made based on the processing power of the client device and the limitations of the user's usage scenario. This application does not impose any limitations on this. If all operations are completed in the client device, the client device may further include a processor.

[0140] The aforementioned client device may have a communication module (i.e., a communication unit) that can communicate with a remote server to achieve data transmission. The server may include a server on the task scheduling center side; in other implementation scenarios, it may also include a server on an intermediate platform, such as a server on a third-party server platform that has a communication link with the task scheduling center server. The server may include a single computer device, a server cluster consisting of multiple servers, or a distributed server structure.

[0141] Figure 3 This is a schematic block diagram illustrating the system configuration of the electronic device 9600 according to an embodiment of this application. Figure 3 As shown, the electronic device 9600 may include a central processing unit 9100 and a memory 9140; the memory 9140 is coupled to the central processing unit 9100. It is worth noting that... Figure 3 This is an example; other types of structures can also be used to supplement or replace this structure to achieve telecommunications functions or other functions.

[0142] In one embodiment, the in-vehicle emotion interaction method based on multi-dimensional recognition can be integrated into the central processing unit 9100. The central processing unit 9100 can be configured to perform the following control:

[0143] Step S101: Collect camera video data, microphone voice data, vehicle sensor data, and physiological sensor data from the vehicle terminal. Extract facial expression features from the camera video data, extract voice emotion feature vectors from the voice data, extract driving behavior feature vectors from the vehicle sensor data, and extract physiological state feature vectors from the physiological sensor data. Construct a temporal feature extraction model based on a long short-term memory neural network. Input the facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors into the temporal feature extraction model to generate emotion change trend features. Input the emotion change trend features into an emotion fusion recognition model. The emotion fusion recognition model outputs the driver's emotion state label and emotion intensity value.

[0144] Step S102: Acquire external environment data, extract weather and road condition features, construct a scene recognition model, the scene recognition model outputs scene danger level, set interaction trigger threshold based on scene danger level, train emotion interaction strategy model, the emotion interaction strategy model generates interaction strategy parameters based on emotion state label, emotion intensity value, and interaction trigger threshold, and selects corresponding speech model and interaction material from speech synthesis model library and interaction content library based on interaction strategy parameters;

[0145] Step S103: Construct an interaction effect evaluation model. The interaction effect evaluation model calculates an interaction satisfaction score based on the driver's feedback behavior to the interaction content. The interaction satisfaction score is input into an online learning module. The online learning module optimizes the parameters of the emotion interaction strategy model based on the interaction satisfaction score. The optimized parameters are written into the strategy model library. Interaction content is generated based on the interaction strategy parameters and sent to the vehicle terminal.

[0146] As described above, the electronic device provided in this application innovatively constructs an emotion fusion recognition model, integrating facial expressions, voice emotions, driving behavior, and physiological state features to accurately judge the driver's emotions. It designs a scenario-based adaptive interaction strategy, combining external environmental data and hazard levels to establish interaction trigger thresholds for intelligent matching. An interaction effect evaluation mechanism is introduced, continuously optimizing the interaction strategy model through an online learning module to achieve dynamic adjustment of personalized interaction content. This method effectively solves the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, significantly improving the intelligence level and user experience of in-vehicle emotion interaction.

[0147] In another embodiment, the in-vehicle emotion interaction device based on multi-dimensional recognition can be configured separately from the central processing unit 9100. For example, the in-vehicle emotion interaction device based on multi-dimensional recognition can be configured as a chip connected to the central processing unit 9100, and the in-vehicle emotion interaction method function based on multi-dimensional recognition can be implemented through the control of the central processing unit.

[0148] like Figure 3 As shown, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is worth noting that the electronic device 9600 does not necessarily need to include these components. Figure 3 All components shown; in addition, the electronic device 9600 may also include Figure 3 For components not shown, please refer to existing technologies.

[0149] like Figure 3As shown, the central processing unit 9100, sometimes also referred to as a controller or operating control, may include a microprocessor or other processor device and / or logic device, which receives inputs and controls the operation of various components of the electronic device 9600.

[0150] The memory 9140 may be, for example, one or more of a cache, flash memory, hard drive, removable media, volatile memory, non-volatile memory, or other suitable devices. It may store the aforementioned failure-related information, and also store a program for executing that information. The central processing unit 9100 may execute the program stored in the memory 9140 to perform information storage or processing, etc.

[0151] Input unit 9120 provides input to central processing unit 9100. Input unit 9120 may be, for example, a keypad or touch input device. Power supply 9170 provides power to electronic device 9600. Display 9160 displays images and text. Display may be, for example, an LCD display, but is not limited thereto.

[0152] The memory 9140 can be a solid-state memory, such as a read-only memory (ROM), random access memory (RAM), a SIM card, etc. It can also be a memory that retains information even when power is off, can be selectively erased, and contains more data; examples of this type of memory are sometimes referred to as EPROMs. The memory 9140 can also be some other type of device. The memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application / function storage unit 9142 for storing application programs and function programs or processes for executing the operation of the electronic device 9600 via the central processing unit 9100.

[0153] The memory 9140 may also include a data storage unit 9143 for storing data, such as contacts, digital data, pictures, sounds, and / or any other data used by the electronic device. The driver storage unit 9144 of the memory 9140 may include various drivers for the electronic device's communication functions and / or for performing other functions of the electronic device (such as messaging applications, address book applications, etc.).

[0154] The communication module 9110 is a transmitter / receiver that sends and receives signals via the antenna 9111. The communication module 9110 (transmitter / receiver) is coupled to the central processing unit 9100 to provide input signals and receive output signals, which is the same as in a conventional mobile communication terminal.

[0155] Based on different communication technologies, multiple communication modules 9110 can be configured in the same electronic device, such as cellular network modules, Bluetooth modules, and / or wireless LAN modules. The communication module 9110 (transmitter / receiver) is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and receive audio input from the microphone 9132, thereby realizing typical telecommunications functions. The audio processor 9130 may include any suitable buffer, decoder, amplifier, etc. Additionally, the audio processor 9130 is coupled to a central processing unit 9100, enabling on-device recording via the microphone 9132 and on-device playback of stored audio via the speaker 9131.

[0156] Embodiments of this application also provide a computer-readable storage medium capable of implementing all steps of the in-vehicle emotion interaction method based on multi-dimensional recognition, where the execution subject is a server or client, as described in the above embodiments. The computer-readable storage medium stores a computer program that, when executed by a processor, implements all steps of the in-vehicle emotion interaction method based on multi-dimensional recognition, where the execution subject is a server or client, as described in the above embodiments. For example, when the processor executes the computer program, it implements the following steps:

[0157] Step S101: Collect camera video data, microphone voice data, vehicle sensor data, and physiological sensor data from the vehicle terminal. Extract facial expression features from the camera video data, extract voice emotion feature vectors from the voice data, extract driving behavior feature vectors from the vehicle sensor data, and extract physiological state feature vectors from the physiological sensor data. Construct a temporal feature extraction model based on a long short-term memory neural network. Input the facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors into the temporal feature extraction model to generate emotion change trend features. Input the emotion change trend features into an emotion fusion recognition model. The emotion fusion recognition model outputs the driver's emotion state label and emotion intensity value.

[0158] Step S102: Acquire external environment data, extract weather and road condition features, construct a scene recognition model, the scene recognition model outputs scene danger level, set interaction trigger threshold based on scene danger level, train emotion interaction strategy model, the emotion interaction strategy model generates interaction strategy parameters based on emotion state label, emotion intensity value, and interaction trigger threshold, and selects corresponding speech model and interaction material from speech synthesis model library and interaction content library based on interaction strategy parameters;

[0159] Step S103: Construct an interaction effect evaluation model. The interaction effect evaluation model calculates an interaction satisfaction score based on the driver's feedback behavior to the interaction content. The interaction satisfaction score is input into an online learning module. The online learning module optimizes the parameters of the emotion interaction strategy model based on the interaction satisfaction score. The optimized parameters are written into the strategy model library. Interaction content is generated based on the interaction strategy parameters and sent to the vehicle terminal.

[0160] As described above, the computer-readable storage medium provided in this application innovatively constructs an emotion fusion recognition model, integrating facial expressions, voice emotions, driving behavior, and physiological state features to achieve accurate judgment of the driver's emotions. It designs a scenario-based adaptive interaction strategy, combining external environmental data and hazard levels to establish interaction trigger thresholds for intelligent matching. An interaction effect evaluation mechanism is introduced, continuously optimizing the interaction strategy model through an online learning module to achieve dynamic adjustment of personalized interaction content. This method effectively solves the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, significantly improving the intelligence level and user experience of in-vehicle emotion interaction.

[0161] Embodiments of this application also provide a computer program product capable of implementing all steps of the in-vehicle emotion interaction method based on multi-dimensional recognition, where the execution subject is a server or client, as described in the above embodiments. When executed by a processor, this computer program / instruction implements the steps of the in-vehicle emotion interaction method based on multi-dimensional recognition. For example, the computer program / instruction implements the following steps:

[0162] Step S101: Collect camera video data, microphone voice data, vehicle sensor data, and physiological sensor data from the vehicle terminal. Extract facial expression features from the camera video data, extract voice emotion feature vectors from the voice data, extract driving behavior feature vectors from the vehicle sensor data, and extract physiological state feature vectors from the physiological sensor data. Construct a temporal feature extraction model based on a long short-term memory neural network. Input the facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors into the temporal feature extraction model to generate emotion change trend features. Input the emotion change trend features into an emotion fusion recognition model. The emotion fusion recognition model outputs the driver's emotion state label and emotion intensity value.

[0163] Step S102: Acquire external environment data, extract weather and road condition features, construct a scene recognition model, the scene recognition model outputs scene danger level, set interaction trigger threshold based on scene danger level, train emotion interaction strategy model, the emotion interaction strategy model generates interaction strategy parameters based on emotion state label, emotion intensity value, and interaction trigger threshold, and selects corresponding speech model and interaction material from speech synthesis model library and interaction content library based on interaction strategy parameters;

[0164] Step S103: Construct an interaction effect evaluation model. The interaction effect evaluation model calculates an interaction satisfaction score based on the driver's feedback behavior to the interaction content. The interaction satisfaction score is input into an online learning module. The online learning module optimizes the parameters of the emotion interaction strategy model based on the interaction satisfaction score. The optimized parameters are written into the strategy model library. Interaction content is generated based on the interaction strategy parameters and sent to the vehicle terminal.

[0165] As described above, the computer program product provided in this application innovatively constructs an emotion fusion recognition model, integrating facial expressions, voice emotions, driving behavior, and physiological state features to accurately judge the driver's emotions. It designs a scenario-based adaptive interaction strategy, combining external environmental data and hazard levels to establish interaction trigger thresholds for intelligent matching. An interaction effect evaluation mechanism is introduced, continuously optimizing the interaction strategy model through an online learning module to achieve dynamic adjustment of personalized interaction content. This method effectively solves the shortcomings of traditional technologies in emotion recognition, interaction strategies, and effect evaluation, significantly improving the intelligence level and user experience of in-vehicle emotion interaction.

[0166] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0167] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0168] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0169] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0170] Specific embodiments have been used to illustrate the principles and implementation methods of this invention. The descriptions of the embodiments above are only for the purpose of helping to understand the method and core ideas of this invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this invention. Therefore, the content of this specification should not be construed as a limitation of this invention.

Claims

1. A method for in-vehicle emotion interaction based on multi-dimensional recognition, characterized in that, The method includes: The system collects video data from a vehicle-mounted terminal camera, voice data from a microphone, data from vehicle sensors, and data from physiological sensors. Facial expression features are extracted from the video data, voice emotion feature vectors are extracted from the voice data, driving behavior feature vectors are extracted from the vehicle sensor data, and physiological state feature vectors are extracted from the physiological sensor data. A temporal feature extraction model is constructed based on a long short-term memory neural network. The facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors are input into the temporal feature extraction model to generate emotion change trend features. These emotion change trend features are then input into an emotion fusion recognition model, which outputs the driver's emotion state label and emotion intensity value. The process includes: constructing a long short-term memory neural network structure, where the network input layer dimension corresponds to the feature vector dimension, and the hidden layer contains multiple memory units. Each memory unit consists of an input gate, a forget gate, and an output gate, where the input gate controls the current time. The importance of new input information is determined by the forgetting gate, which controls the proportion of historical information forgotten, and the output gate, which controls the output level of information. Labeled historical data is input into the Long Short-Term Memory Neural Network for training to obtain a temporal feature extraction model. Facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors are input into the temporal feature extraction model according to a time sequence to generate trend features reflecting the driver's emotions over time. A multimodal feature fusion network based on an attention mechanism is constructed as an emotion fusion recognition model. The attention weights at each time step in the trend features are calculated, and the features are weighted and summed according to the attention weights to obtain a fused feature representation. The fused feature representation is input into a fully connected layer, and a softmax classifier outputs the probability distribution of the emotion state label. The category with the highest probability is selected as the final emotion state label, and the sigmoid function maps the fused feature representation to a range of 0 to 1 to obtain the emotion intensity value. The process involves acquiring external environmental data, extracting weather and road condition features, constructing a scene recognition model, and outputting a scene hazard level. Based on this hazard level, an interaction trigger threshold is set. The process includes: collecting rain / snow weather data, visibility data, and temperature / humidity data through an onboard terminal's environmental perception sensor; acquiring real-time road condition data, traffic flow data, and road surface condition data from an onboard communication module; performing feature extraction and numerical processing on the rain / snow weather data, visibility data, and temperature / humidity data to obtain weather condition features; performing feature extraction and numerical processing on the real-time road condition data, traffic flow data, and road surface condition data to obtain road condition features; and normalizing the weather and road condition features. A multilayer perceptron neural network is constructed as the scene recognition model. The network input layer receives the weather and road condition features, the hidden layer uses the ReLU activation function, and the output layer uses the softmax function to output the probability distribution of the scene hazard level. The scene recognition model is trained based on historical labeled data, and the scene hazard level is mapped to a preset interval to obtain an interaction trigger baseline value. Interaction trigger thresholds are then set for different scenarios based on the interaction trigger baseline value. The training of an emotion interaction strategy model includes: constructing an emotion interaction strategy model based on deep reinforcement learning; combining the emotion state label, emotion intensity value, and interaction trigger threshold to construct a state vector; using interaction tone, interaction content type, and interaction timing as the action space; designing an interaction effect scoring function as the reward signal; training the emotion interaction strategy model using a deep Q-network; selecting interaction actions based on an exploration-exploitation strategy during training; decoding the selected interaction actions into interaction strategy parameters, which include speech synthesis parameters, content matching parameters, and push timing parameters; selecting a corresponding speech clone model from the speech synthesis model library based on the speech synthesis parameters; calculating the similarity score of materials in the interaction content library based on the content matching parameters; selecting the audio, text, and image with the highest similarity score as interaction materials; determining the optimal interaction time based on the push timing parameters; and applying the speech clone model to the text content in the interaction materials to generate personalized speech broadcast content. An interaction effect evaluation model is constructed. The interaction effect evaluation model calculates the interaction satisfaction score based on the driver's feedback behavior to the interaction content. The interaction satisfaction score is input into an online learning module. The online learning module optimizes the parameters of the emotion interaction strategy model based on the interaction satisfaction score. The optimized parameters are written into the strategy model library. Interaction content is generated based on the interaction strategy parameters and sent to the vehicle terminal.

2. The in-vehicle emotion interaction method based on multi-dimensional recognition according to claim 1, characterized in that, The process involves collecting camera video data, microphone voice data, vehicle sensor data, and physiological sensor data from the vehicle-mounted terminal. Facial expression features are extracted from the camera video data, voice emotion feature vectors are extracted from the voice data, driving behavior feature vectors are extracted from the vehicle sensor data, and physiological state feature vectors are extracted from the physiological sensor data. This includes: The system collects camera video data and microphone voice data through the sensor interface of the vehicle terminal. Frame sequence extraction is performed on the camera video data. Face detection algorithm is used to locate the face region in each frame image. Based on the face key point localization algorithm, the coordinates of 68 facial feature points are extracted to construct a facial geometric feature descriptor. The facial geometric feature descriptor is input into a pre-trained expression classification neural network to extract facial expression features. The microphone voice data is segmented and pre-emphasized. Mel frequency cepstral coefficients and pitch period are calculated. Acoustic features such as sound intensity and pitch are combined to construct a voice emotion feature vector. Steering wheel angle data, vehicle speed data, and braking depth data are acquired from the vehicle sensor data interface. Electrocardiogram (ECG) data, skin conductance data, and respiratory rate data are collected from physiological sensors. The steering wheel angle data, vehicle speed data, and braking depth data are normalized and divided into driving behavior feature vectors according to time windows. The ECG data, skin conductance data, and respiratory rate data are subjected to noise filtering and signal smoothing. Heart rate variability index and skin conductance level index are extracted. The heart rate variability index and skin conductance level index are combined to construct a physiological state feature vector.

3. The in-vehicle emotion interaction method based on multi-dimensional recognition according to claim 1, characterized in that, The process involves constructing an interaction effect evaluation model, which calculates an interaction satisfaction score based on the driver's feedback behavior to the interaction content. This satisfaction score is then input into an online learning module, which optimizes the parameters of the emotion interaction strategy model based on the satisfaction score. This optimization includes: An interaction effect evaluation model based on behavior sequence analysis is constructed. The driver's click operation data, voice command data, and interruption behavior data are collected. The click operation data, voice command data, and interruption behavior data are time-series encoded to obtain the interaction behavior sequence. The interaction behavior sequence is input into a bidirectional gated recurrent unit network to extract interaction feedback features. A scoring function is constructed based on the interaction feedback features to calculate the interaction satisfaction score. The interaction scenario information and the interaction satisfaction score are combined to form an evaluation sample pair. An online learning module based on gradient boosting trees is constructed. The evaluation sample pairs are input into the online learning module in chronological order. The recent sample set is maintained using a sliding window method. The importance scores of each parameter in the emotion interaction strategy model are calculated based on the sample set. The parameter update step size is adjusted according to the importance scores. The network weights of the emotion interaction strategy model are updated online using the stochastic gradient descent method. The updated network weights are saved to the strategy model library.

4. The in-vehicle emotion interaction method based on multi-dimensional recognition according to claim 1, characterized in that, The step of writing the optimized parameters into the strategy model library, generating interactive content based on the interaction strategy parameters, and sending it to the vehicle terminal includes: A version control-based strategy model update mechanism is constructed. The optimized network weights and model structure information are packaged into a model snapshot file. The integrity check code of the model snapshot file is calculated. The model snapshot file and the integrity check code are written into the temporary storage area of ​​the strategy model library. The model snapshot file is decompressed and verified. After the verification is successful, it is transferred to the formal storage area. The version identifier of the strategy model is updated, expired historical version files are deleted, and a model update log is generated. Based on the interaction strategy parameters, the interactive materials are combined and processed, and the voice broadcast content, text prompt content, and image prompt content are encapsulated into a unified message format. The encapsulated message is then compressed and encrypted. The encrypted message is sent to the display control module and the audio control module through the message queue service of the vehicle terminal. The display control module displays the text and image content in a designated area of ​​the vehicle terminal display screen, and the audio control module plays the voice content through the vehicle terminal speaker.

5. A vehicle-mounted emotion interaction device based on multi-dimensional recognition, characterized in that, The device includes: The data preprocessing module is used to collect camera video data, microphone voice data, vehicle sensor data, and physiological sensor data from the vehicle terminal. It extracts facial expression features from the camera video data, voice emotion feature vectors from the voice data, driving behavior feature vectors from the vehicle sensor data, and physiological state feature vectors from the physiological sensor data. A temporal feature extraction model is constructed based on a long short-term memory neural network. The facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors are input into the temporal feature extraction model to generate emotion change trend features. These emotion change trend features are then input into an emotion fusion recognition model, which outputs the driver's emotion state label and emotion intensity value. The module includes: constructing a long short-term memory neural network structure, where the network input layer dimension corresponds to the feature vector dimension, and the hidden layer contains multiple memory units. Each memory unit consists of an input gate, a forget gate, and an output gate. The input gate... The importance of newly input information at the current moment is determined by the forgetting gate, which controls the forgetting ratio of historical information, and the output gate, which controls the output degree of information. Labeled historical data is input into the Long Short-Term Memory Neural Network for training to obtain a temporal feature extraction model. Facial expression features, voice emotion feature vectors, driving behavior feature vectors, and physiological state feature vectors are input into the temporal feature extraction model according to a time sequence to generate trend features reflecting the driver's emotions over time. A multimodal feature fusion network based on an attention mechanism is constructed as an emotion fusion recognition model. The attention weights at each time step in the trend features are calculated, and the features are weighted and summed according to the attention weights to obtain a fused feature representation. The fused feature representation is input into a fully connected layer, and a softmax classifier outputs the probability distribution of the emotion state label. The category with the highest probability is selected as the final emotion state label, and the sigmoid function maps the fused feature representation to a range of 0 to 1 to obtain the emotion intensity value. The emotion interaction module is used to acquire external environmental data, extract weather and road condition features, and construct a scene recognition model. The scene recognition model outputs a scene hazard level, and an interaction trigger threshold is set based on the scene hazard level. This includes: collecting rain / snow weather data, visibility data, and temperature / humidity data through the vehicle terminal's environmental perception sensors; acquiring real-time road condition data, traffic flow data, and road surface condition data from the vehicle communication module; performing feature extraction and numerical processing on the rain / snow weather data, visibility data, and temperature / humidity data to obtain weather condition features; and processing the real-time road condition data and traffic flow data... The road condition data is subjected to feature extraction and numerical processing to obtain road condition features. The weather condition features and the road condition features are normalized. A multilayer perceptron neural network is constructed as a scene recognition model. The network input layer receives the weather condition features and the road condition features, the hidden layer uses the ReLU activation function, and the output layer uses the softmax function to output the probability distribution of scene hazard level. The scene recognition model is trained based on historical labeled data. The scene hazard level is mapped to a preset interval to obtain the interaction trigger benchmark value. The interaction trigger threshold under different scenarios is set according to the interaction trigger benchmark value. The training of an emotion interaction strategy model includes: constructing an emotion interaction strategy model based on deep reinforcement learning; combining the emotion state label, emotion intensity value, and interaction trigger threshold to construct a state vector; using interaction tone, interaction content type, and interaction timing as the action space; designing an interaction effect scoring function as the reward signal; training the emotion interaction strategy model using a deep Q-network; selecting interaction actions based on an exploration-exploitation strategy during training; decoding the selected interaction actions into interaction strategy parameters, which include speech synthesis parameters, content matching parameters, and push timing parameters; selecting a corresponding speech clone model from the speech synthesis model library based on the speech synthesis parameters; calculating the similarity score of materials in the interaction content library based on the content matching parameters; selecting the audio, text, and image with the highest similarity score as interaction materials; determining the optimal interaction time based on the push timing parameters; and applying the speech clone model to the text content in the interaction materials to generate personalized speech broadcast content. An interaction evaluation module is used to construct an interaction effect evaluation model. The interaction effect evaluation model calculates an interaction satisfaction score based on the driver's feedback behavior to the interaction content. The interaction satisfaction score is input into an online learning module. The online learning module optimizes the parameters of the emotion interaction strategy model based on the interaction satisfaction score, writes the optimized parameters into a strategy model library, generates interaction content based on the interaction strategy parameters, and sends it to the vehicle terminal.

6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the in-vehicle emotion interaction method based on multi-dimensional recognition as described in any one of claims 1 to 4.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the steps of the in-vehicle emotion interaction method based on multi-dimensional recognition as described in any one of claims 1 to 4.