Emotion perception robot control method and device based on LLVM core

By constructing a control method for emotion-aware robots based on multimodal feature processing and neural networks with LLVM core, the shortcomings of emotion recognition and expression are solved, and efficient interaction and coordinated expression of emotion-aware robots are realized.

CN122008253BActive Publication Date: 2026-06-12SHENZHEN ZHI HUI LIN NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN ZHI HUI LIN NETWORK TECH CO LTD
Filing Date
2026-04-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing methods for controlling emotion-aware robots are inadequate in terms of multimodal perception and feature fusion, lacking a sound emotional evolution mechanism and attention regulation strategy, resulting in inaccurate emotional state recognition and affecting interaction performance.

Method used

By employing a method based on the LLVM core, multimodal feature extraction and temporal alignment processing are performed using cameras and microphones to construct an emotion feature mapper and an emotion recognition neural network. Combined with an emotion evolution predictor and a language model modulator, this approach achieves accurate identification and coordinated expression of emotional states.

Benefits of technology

It effectively solves the shortcomings of traditional technologies in emotion recognition, language processing and emotion expression, realizes efficient interaction of emotion-aware robots, and ensures the coordination and reliability of emotion expression.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122008253B_ABST
    Figure CN122008253B_ABST
Patent Text Reader

Abstract

The embodiment of the application provides a kind of based on the emotional perception robot control method and device of LLVM core, and the effective perception of emotion is realized by multimodal feature and neural network.Construct language processing mechanism, combine emotional regulation and dialogue understanding, establish reliable interaction strategy.Introduce expression control, ensure the coordination of emotional expression through action planning and speech synthesis.The method effectively solves the deficiency of traditional technology in emotion recognition, language processing and emotional expression, and provides technical support for emotional perception robot.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of embodied robots, specifically to a control method and device for emotion-aware robots based on the LLVM core. Background Technology

[0002] Existing methods for controlling emotion-aware robots have significant shortcomings. Traditional systems perform poorly in multimodal perception and feature fusion, failing to accurately identify emotional states and thus impacting interaction effectiveness.

[0003] Furthermore, existing technologies face bottlenecks in emotion modeling and language processing. Most systems lack robust emotion evolution mechanisms and attention regulation strategies, resulting in unnatural dialogue comprehension.

[0004] Existing systems have technical shortcomings in emotional expression. The lack of in-depth coordination between action and speech makes it difficult to achieve efficient emotional transmission through multimodal output, thus impacting the interactive experience. Solving these problems is crucial for improving the capabilities of emotion-aware robots. Summary of the Invention

[0005] To address the problems in existing technologies, this application provides an emotion-aware robot control method and device based on the LLVM core, which can effectively solve the shortcomings of traditional technologies in emotion recognition, language processing and emotion expression, and provide technical support for emotion-aware robots.

[0006] To solve at least one of the above problems, this application provides the following technical solution:

[0007] Firstly, this application provides a control method for emotion-aware robots based on the LLVM core, including:

[0008] Multimodal feature vectors are obtained by extracting features from the image data stream captured by the camera and the voice data stream captured by the microphone. The multimodal feature vectors are then subjected to temporal alignment to generate a standardized feature matrix. An emotion feature mapper is constructed based on the standardized feature matrix to obtain a low-dimensional representation vector. The low-dimensional representation vector is then input into an emotion recognition neural network to train an emotion evaluation model. An emotion state vector is calculated based on the emotion evaluation model. Finally, an emotion evolution predictor is obtained by establishing a memory cache pool based on the emotion state vector.

[0009] The emotional state vector and the output of the emotional evolution predictor are input into the language model regulator to adjust the attention parameters to obtain an emotional enhancement language model. The user input information and the emotional state vector are fused to obtain an emotional perception input vector. The emotional perception input vector is input into the dialogue intent understanding module to generate a dialogue history vector. An interaction strategy vector is generated based on the dialogue history vector and the emotional state vector. The interaction strategy vector is adjusted for emotional intensity to obtain an emotional expression control vector.

[0010] The emotion expression control vector is parsed into actuator driving parameters and speech synthesis parameters. Collision detection is performed on the actuator driving parameters to obtain a safe trajectory. The robot actuator is controlled to output actions according to the safe trajectory. The speech synthesis parameters are input into the speech prosody modulator to generate a speech waveform. Based on the speech waveform and the output action, the robot's collaborative emotion expression is realized.

[0011] Furthermore, it also includes: performing edge detection and region segmentation on the image data stream to obtain image feature regions; extracting local descriptive operators from the image feature regions to generate an image feature description set; converting the speech data stream into a spectrogram sequence and performing time-frequency analysis to obtain an acoustic feature description set; performing feature vector quantization on the image feature description set and the acoustic feature description set to generate a multimodal feature coding matrix; and constructing a cross-modal alignment network based on the multimodal feature coding matrix to obtain an alignment feature mapper.

[0012] The multimodal feature encoding matrix is ​​input into the aligned feature mapper for temporal alignment processing to obtain a standardized feature sequence. Principal component analysis is performed on the standardized feature sequence according to a preset dimensionality reduction rule to generate a feature projection matrix. A feature dimensionality reduction network is trained based on the feature projection matrix to obtain an emotion feature mapper. The standardized feature sequence input in real time is input into the emotion feature mapper to generate a low-dimensional representation vector.

[0013] Furthermore, it also includes: dividing the low-dimensional representation vector into training sequences and validation sequences according to the time-series segmentation rules; performing data augmentation processing on the training sequences to generate a training sample set; constructing a multilayer perceptron network structure based on the training sample set to obtain an emotion recognition model prototype; iteratively optimizing and training the emotion recognition model prototype according to a preset loss function to obtain an emotion evaluation model; and inputting the validation sequence into the emotion evaluation model for cross-validation to obtain a model evaluation index.

[0014] The real-time low-dimensional representation vector is input into the emotion assessment model for forward computation to obtain the emotion state vector. The emotion state vector is then subjected to temporal sliding sampling to generate an emotion state sequence. A recurrent neural network is constructed based on the emotion state sequence to obtain an emotion evolution predictor. The emotion evolution predictor is deployed as a memory cache pool for real-time state maintenance and prediction updates.

[0015] Furthermore, it also includes: concatenating the emotional state vector with the output of the emotional evolution predictor to obtain an emotional regulation vector; normalizing the emotional regulation vector to generate a weight distribution matrix; constructing an attention regulation network based on the weight distribution matrix to obtain a parameter regulation model; and remapping the output of the parameter regulation model with a pre-trained language model to obtain an emotional enhancement language model.

[0016] User input information is converted into a text sequence and segmented to obtain a word sequence. Multi-head attention is then performed on the word sequence and the sentiment state vector to generate a fusion feature matrix. The fusion feature matrix is ​​then input into a bidirectional encoder to obtain a sentiment perception input vector. An intent recognition classifier is then constructed based on the sentiment perception input vector to generate a dialogue history vector.

[0017] Furthermore, it also includes: concatenating the dialogue history vector and the emotion state vector to obtain a multimodal state vector; performing hierarchical encoding on the multimodal state vector to generate a state representation matrix; constructing a policy generation network based on the state representation matrix to obtain an interaction decision model; and optimizing the interaction decision model using Monte Carlo tree search to obtain an interaction policy vector.

[0018] The interaction strategy vector is decoupled and decomposed according to emotion type to obtain an intensity parameter set. An adaptive regulator is constructed based on the intensity parameter set to generate an adjustment coefficient matrix. The adjustment coefficient matrix and the interaction strategy vector are component concatenated to obtain an emotion expression control vector.

[0019] Furthermore, it also includes: performing dimensional decomposition on the emotion expression control vector according to a preset parsing rule to obtain an action parameter matrix; constructing a kinematic mapping network based on the action parameter matrix to generate a joint space mapper; performing constraint optimization on the output of the joint space mapper to obtain actuator driving parameters; and performing interpolation smoothing on the actuator driving parameters to generate an initial trajectory sequence.

[0020] The initial trajectory sequence is input into the collision detection module for spatial interference analysis to obtain an obstacle avoidance path set. Based on the obstacle avoidance path set, a trajectory optimizer is constructed to generate a safe trajectory. The safe trajectory is then inversely solved according to the robot's kinematics model to obtain joint drive commands. The actuator is controlled to complete the action output according to the joint drive commands.

[0021] Furthermore, it also includes: decomposing the speech synthesis parameters according to the phoneme structure to obtain a prosodic feature set, performing emotion mapping transformation on the prosodic feature set to generate a prosodic control vector, constructing an acoustic parameter generation network based on the prosodic control vector to obtain a speech synthesis model, and processing the output of the speech synthesis model through a vocoder to obtain a speech waveform;

[0022] The speech waveform is subjected to time-series analysis to obtain a speech timestamp sequence. The speech timestamp sequence is synchronized and aligned with the action execution time sequence to generate a collaborative control command. Based on the collaborative control command, the robot actuator and the speech player are scheduled to complete multimodal emotional expression.

[0023] Secondly, this application provides an emotion-aware robot control device based on the LLVM core, comprising:

[0024] The model building module is used to extract features from the image data stream captured by the camera and the voice data stream captured by the microphone to obtain multimodal feature vectors, perform temporal alignment processing on the multimodal feature vectors to generate a standardized feature matrix, construct an emotion feature mapper based on the standardized feature matrix to obtain a low-dimensional representation vector, input the low-dimensional representation vector into the emotion recognition neural network to train an emotion evaluation model, calculate the emotion state vector based on the emotion evaluation model, and establish a memory cache pool based on the emotion state vector to obtain an emotion evolution predictor.

[0025] The emotion perception module is used to input the emotion state vector and the output of the emotion evolution predictor into the language model regulator to adjust the attention parameters to obtain an emotion-enhanced language model, perform feature fusion on the user input information and the emotion state vector to obtain an emotion perception input vector, input the emotion perception input vector into the dialogue intent understanding module to generate a dialogue history vector, and generate an interaction strategy vector based on the dialogue history vector and the emotion state vector, and adjust the emotion intensity of the interaction strategy vector to obtain an emotion expression control vector.

[0026] The robot control module is used to parse the emotion expression control vector into actuator driving parameters and speech synthesis parameters, perform collision detection on the actuator driving parameters to obtain a safe trajectory, control the robot actuator to output actions according to the safe trajectory, input the speech synthesis parameters into a speech prosody modulator to generate a speech waveform, and realize the robot's collaborative emotion expression based on the speech waveform and output actions.

[0027] Thirdly, this application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the LLVM-based emotion-aware robot control method.

[0028] Fourthly, this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the described LLVM-based emotion-aware robot control method.

[0029] Fifthly, this application provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the described LLVM-based emotion-aware robot control method.

[0030] As can be seen from the above technical solution, this application provides an emotion-aware robot control method and device based on the LLVM core, which achieves effective emotion perception through multimodal features and neural networks. A language processing mechanism is constructed, combining emotion regulation and dialogue understanding to establish a reliable interaction strategy. Expression control is introduced, ensuring the coordination of emotion expression through action planning and speech synthesis. This method effectively solves the shortcomings of traditional technologies in emotion recognition, language processing, and emotion expression, providing technical support for emotion-aware robots. Attached Figure Description

[0031] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0032] Figure 1 This is a flowchart illustrating the emotion-aware robot control method based on the LLVM core in the embodiments of this application;

[0033] Figure 2 This is a structural diagram of the emotion-sensing robot control device based on the LLVM core in the embodiments of this application;

[0034] Figure 3 This is a schematic diagram of the structure of the electronic device in the embodiments of this application.

[0035] Figure label:

[0036] Electronic device 9600, central processing unit 9100, memory 9140, communication module 9110, input unit 9120, audio processor 9130, display 9160, power supply 9170, buffer memory 9141, application / function storage unit 9142, data storage unit 9143, driver storage unit 9144, antenna 9111, speaker 9131, microphone 9132. Detailed Implementation

[0037] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0038] The acquisition, storage, use, and processing of data in this application comply with relevant laws and regulations.

[0039] To address the shortcomings of existing technologies, this application provides a control method and device for emotion-aware robots based on the LLVM core. This method achieves effective emotion perception through multimodal features and neural networks. A language processing mechanism is constructed, combining emotion regulation and dialogue understanding to establish a reliable interaction strategy. Expression control is introduced, ensuring the coordination of emotional expression through action planning and speech synthesis. This method effectively solves the deficiencies of traditional technologies in emotion recognition, language processing, and emotional expression, providing technical support for emotion-aware robots.

[0040] To effectively address the shortcomings of traditional technologies in emotion recognition, language processing, and emotion expression, and to provide technical support for emotion-aware robots, this application provides an embodiment of an emotion-aware robot control method based on the LLVM core. See [link to relevant documentation]. Figure 1 The emotion-aware robot control method based on the LLVM core specifically includes the following:

[0041] Step S101: Extract features from the image data stream captured by the camera and the voice data stream captured by the microphone to obtain a multimodal feature vector. Perform temporal alignment processing on the multimodal feature vector to generate a standardized feature matrix. Construct an emotion feature mapper based on the standardized feature matrix to obtain a low-dimensional representation vector. Input the low-dimensional representation vector into the emotion recognition neural network to train an emotion evaluation model. Calculate the emotion state vector based on the emotion evaluation model. Establish a memory cache pool based on the emotion state vector to obtain an emotion evolution predictor.

[0042] First, the camera image data stream and microphone audio data stream are simultaneously sampled and processed. Based on timestamp comparison, the two data streams are aligned and frame extraction is performed, and missing segments are interpolated using nearby time windows. On the image side, candidate regions are obtained through edge detection and region segmentation, and local descriptive operators are extracted within these regions to form an image feature description set. On the audio side, continuous audio is segmented into fixed-length segments and converted into a spectrogram sequence, which is then combined with time-frequency analysis to obtain an acoustic feature description set. These two types of descriptions are uniformly encoded into multimodal feature vectors of consistent length and arranged in the sampling order to form an initial multimodal sequence, which serves as the input for subsequent temporal alignment.

[0043] Next, cross-modal temporal alignment is performed based on the initial multimodal sequence. A sliding time window is used to dynamically register image vectors and acoustic vectors within the same window, outputting a standardized feature matrix. To handle cross-modal delay, alignment offsets and interpolation ratios are recorded at the window boundaries, and outlier frame removal markers are written to the same row of metadata columns. The standardized feature matrix is ​​organized with a fixed step size and uniform dimensions, preserving both temporal indexes and confidence information to ensure that the learning intensity of subsequent models for uncertain samples is controllable.

[0044] The standardized feature matrix is ​​then input into the construction process of the sentiment feature mapper. This process first compresses collinear dimensions using linear projection and outputs transition features, then learns the cooperative change patterns between the image and acoustics using nonlinear embedding, ultimately generating a low-dimensional representation vector. This low-dimensional representation vector is cached time-by-time and indexed with the aforementioned alignment offset for sample weighting in subsequent training phases. To avoid insufficient representation due to dimensionality collapse, the mapper introduces an early stopping strategy during training, monitoring the reconstruction consistency index on the validation set to ensure that the compressed components can still distinguish the main sentiment categories.

[0045] Based on the aforementioned low-dimensional representation vector, an emotion recognition neural network, known in Chinese as an emotion recognizer, is constructed and trained. Training data comes from historical segments labeled with emotion, and diverse samples are generated through speech rate and illumination perturbations. Sample weights are related to the aforementioned alignment offset and interpolation ratio; samples with lower weights have a weaker impact on parameter updates. To clarify the training objective and stability constraints, a cost function is set:

[0046] Q = a1×U + a2×V,

[0047] Where Q is the training objective; U is the classification loss based on sentiment labels and model output; V is the regularization term for the smoothness of changes in the low-dimensional representation vector; a1 and a2 are non-negative weights, selected within a preset range based on the performance on the validation set. The sentiment assessment model is obtained by minimizing Q. This model outputs a sentiment state vector at each time step during the inference phase, carrying a time index.

[0048] After the sentiment state vector is output, a memory cache pool is established, and a sentiment evolution predictor is constructed based on it. The cache pool maintains an active window based on the most recent time intervals, storing the state components, source confidence, and time index for each time interval. When a new state is written, an out-of-bounds check is performed on the differences between adjacent entries, and abrupt changes are marked as to be suppressed. The sentiment evolution predictor uses a loop structure to read the active window of the cache pool, provides a trend estimate for the next time interval based on the context of the nearest neighbor states, and packages the trend and the current state into a state entry and writes it back to the cache pool.

[0049] Preferably, after the state entries are formed, two types of reading methods are provided for subsequent inter-layer calls. The first is batch retrieval by time window, which allows the dialogue intent understanding module to read continuous state trajectories when constructing dialogue history vectors; the second is single-point retrieval by the latest moment, which allows the language model regulator to generate attention adjustment weights in real time. The confidence flags in the state entries are also referenced by the control link to limit the upper limit of action and speech intensity, thereby suppressing overreaction when data is uncertain.

[0050] Finally, the emotion assessment model and the emotion evolution predictor are registered as a common input interface between the upper and lower layers. The upper-layer language model regulator reads the emotion state vector and trend estimate, constructs an emotion regulation vector, and participates in the adjustment of attention parameters; the lower-layer execution control reads the confidence marker before generating action parameters and speech parameters, deciding whether to reduce intensity or delay output. This connection ensures that the product of step S101 is continuously used in subsequent steps, forming a stable link from multimodal perception to emotion-driven control.

[0051] Step S102: Input the emotional state vector and the output of the emotional evolution predictor into the language model regulator to adjust the attention parameters to obtain an emotional enhancement language model. Perform feature fusion on the user input information and the emotional state vector to obtain an emotional perception input vector. Input the emotional perception input vector into the dialogue intent understanding module to generate a dialogue history vector. Generate an interaction strategy vector based on the dialogue history vector and the emotional state vector. Adjust the emotional intensity of the interaction strategy vector to obtain an emotional expression control vector.

[0052] First, the sentiment state vector output from step S101 and the trend estimate from the sentiment evolution predictor are read and concatenated at the same time index to form a sentiment regulation vector. Based on this sentiment regulation vector, an input batch for the language model regulator is constructed, organized into a fixed-length sequence according to window order, and the weights of entries with low confidence labels are reduced. Subsequently, a parameter mapping unit is inserted before the attention layer to map the sentiment components to scaling factors and bias terms of multi-head attention, outputting attention adjustment parameters that can be directly read by the pre-trained language model, thus obtaining the sentiment-enhanced language model.

[0053] Next, the user input information is transcribed into a text sequence, and the word sequence is segmented and positionally encoded to form a language-side basic vector. The sentiment state vector and the language-side basic vector are aligned by time index, and the sentiment component is used as an additional key value encoding on the query side in multi-head attention to generate a fusion feature matrix. To suppress extreme bias caused by abrupt sentiment changes, the aforementioned confidence marker is used during the fusion process, and an upper limit gate is set on the attention weights to prevent a small number of abnormal samples from dominating the output.

[0054] The fused feature matrix is ​​then input into the dialogue intent understanding module, also known as the intent understander. The intent understander consists of a bidirectional encoder and a classification head; the former extracts cross-sentence dependencies, while the latter provides the distribution within the intent category space. At the output, the model retains intermediate states aligned to the input sequence and aggregates the encoded states from several consecutive steps in chronological order into a dialogue history vector. This dialogue history vector, along with the source time index, is registered to provide context for subsequent policy generation.

[0055] Based on the dialogue history vector, an interaction decision input set is assembled and concatenated with the sentiment state vector to obtain a multimodal state vector. This vector is fed into a policy generation network, also known as an interaction policy generator, which outputs an interaction policy vector after hierarchical encoding. To ensure that the policy output is consistent with the sentiment, the interaction policy generator introduces a constraint term in the loss that is consistent with the sentiment component, and applies the aforementioned trend estimation during inference to impose a smoothing requirement on the response intensity of future segments, reducing oscillations in the range of rapid sentiment changes.

[0056] Based on the aforementioned interaction strategy vector, an emotion intensity modulation process is triggered. This process first decouples the strategy components according to emotion type, obtaining two sets: action-related components and language-related components. Then, it generates modulation coefficients based on the emotion components and confidence markers. To clarify the modulation rules, a one-time expression is defined:

[0057] Z = b1×C + b2×D b3×G.

[0058] In the formula, Z is the intensity index used for adjustment; C is the weighted aggregation of the current intensity component set from the sentiment state vector; D is the aggregation of the short-term rise rate from the trend estimate; G is the uncertainty obtained by combining the alignment offset and the interpolation ratio; b1, b2, and b3 are non-negative weights, whose values ​​are selected within the offline verification range. Z is used in this section to calculate the adjustment coefficient of the component level and scales the two sets of policy components respectively to form the sentiment expression control vector.

[0059] After the emotion expression control vector is generated, the vector and time index are written back to the shared buffer, providing a reading entry point for the downstream control chain. The upstream emotion enhancement language model retains a snapshot of the parameters of this attention adjustment for easy reuse in subsequent rounds; the downstream action and speech generation read the control vector and confidence flags simultaneously before entering parsing and synthesis, and set the upper limits of action amplitude and speech prosody accordingly to ensure consistent execution of emotion constraints in subsequent steps.

[0060] Step S103: The emotion expression control vector is parsed into actuator driving parameters and speech synthesis parameters. Collision detection is performed on the actuator driving parameters to obtain a safe trajectory. The robot actuator is controlled to output actions according to the safe trajectory. The speech synthesis parameters are input into the speech prosody modulator to generate a speech waveform. Based on the speech waveform and the output action, the robot's collaborative emotion expression is realized.

[0061] First, the emotion expression control vector output in step S102 is read and accessed through the parsing process according to its time index. Based on preset parsing rules, the control vector is decomposed dimensionally into two sets: action-related components and speech-related components, retaining the confidence markers and trend estimation indices from upstream. The action-related components are mapped to a target set of end-effector pose, velocity, and compliance parameters; the speech-related components are mapped to a control set of phoneme duration, fundamental frequency envelope, and energy envelope. The two sets are then fed into two parallel pipelines, kinematic and acoustic, respectively, and are subsequently kept consistent with timestamps.

[0062] Next, the motion-related components are input into a kinematic mapping network. Candidate joint configurations are first obtained through geometric inverse kinematics, and then velocity and acceleration constraints are applied in joint space to generate an unconstrained sequence of driving parameters. To suppress velocity jumps caused by sudden upstream emotions, this sequence is subjected to temporal interpolation and cubic spline smoothing, and weight labels corresponding to the confidence markers are backfilled at each sampling point. The aforementioned driving parameter sequence is output at a fixed period as input for collision detection.

[0063] Then, collision detection is performed on the driving parameter sequence. Specifically, the robot body and the environment model are spatially interfered and compared under a unified occupant representation. An obstacle avoidance candidate set is generated for segments with potential intrusion, and the corresponding time windows are recorded. Based on the candidate set, a trajectory cost is constructed. Taking into account path length, joint change amplitude, and boundary violation penalties, a trajectory corrector is invoked to replan within a local time window, resulting in a safe trajectory that satisfies geometric and dynamic constraints. The safe trajectory corresponds one-to-one with the original time index and includes an executable marker for the actuator to read.

[0064] Based on the established safety trajectory, the trajectory is distributed to each actuator according to the control cycle. The servo side performs position or torque closed-loop control according to the joint drive command, and the compliance parameter acts as an external adjustment quantity on the impedance model to limit the amplitude of movement when the emotional intensity is high. To cope with the deviation of sensor feedback during operation, the execution layer performs online fine-tuning of the safety trajectory, but must not exceed the trajectory constraint boundary; once the risk of exceeding the boundary is triggered, the movement intensity is reduced proportionally, and the event is recorded as an anomaly under the same time index for the next round of upstream reading.

[0065] Based on the aforementioned speech-related components, the speech pipeline is initiated. Phoneme duration, fundamental frequency envelope, and energy envelope are mapped to prosodic control vectors, which are then input into the speech prosodic modulator to generate an acoustic parameter sequence. This sequence is fed into the acoustic generation network to output the spectrum, which is then synthesized into a speech waveform via a vocoder. To ensure consistency with the action side, the speech pipeline uses a time index consistent with the safety trajectory; when the upstream confidence marker is low, the prosodic modulator sets an upper limit on the rate of change of fundamental frequency and energy to avoid excessive jitter when emotion is uncertain.

[0066] Then, timestamps are extracted from the speech waveform to form a speech event sequence, which is then synchronized with the key frame moments of the safety trajectory. The alignment strategy uses key postures of the action as anchor points, aligning stressed segments in the speech to the vicinity of the posture acceleration peak; if there are segments with alignment offsets exceeding the threshold condition, the tempo is adjusted on the speech side first to maintain the safety boundary of the action. After alignment, a cooperative control command is generated, including the actuator target, speech playback command, and synchronization marker.

[0067] Under the coordinated control command, the robot actuator and the voice player operate on the same timeline. During execution, the controller continuously monitors tactile and positional feedback, which, together with the voice amplitude envelope, participate in a synchronization check. If a lag in actual movement is detected, the voice playback rate is reduced within a limited range, and the correction ratio is recorded. This ratio, along with an anomaly flag, is written back to the shared buffer for the language model regulator in step S102 to read during the next round of attention parameter generation.

[0068] Finally, the execution results of the collaborative emotion expression are packaged into receipt entries by time index, including the executed safety trajectory segment, the corresponding speech waveform segment, and synchronization error statistics. These receipt entries are sent to the memory cache pool of the emotion evolution predictor to update the short-term state trend, and are read by the action parsing unit to revise the speed and compliance parameter ranges for the next round. Thus, the product and feedback of step S103 form a closed loop upstream and downstream, enabling subsequent rounds to output stably under existing emotion constraints.

[0069] As described above, the emotion-aware robot control method based on the LLVM core provided in this application can effectively perceive emotions through multimodal features and neural networks. It constructs a language processing mechanism, combining emotion regulation and dialogue understanding to establish a reliable interaction strategy. It introduces expression control, ensuring the coordination of emotional expression through action planning and speech synthesis. This method effectively addresses the shortcomings of traditional technologies in emotion recognition, language processing, and emotional expression, providing technical support for emotion-aware robots.

[0070] In one embodiment of the emotion-aware robot control method based on the LLVM core of this application, it may further include the following:

[0071] Step S201: Perform edge detection and region segmentation on the image data stream to obtain image feature regions, extract local descriptive operators from the image feature regions to generate an image feature description set, convert the speech data stream into a spectrogram sequence and perform time-frequency analysis to obtain an acoustic feature description set, perform feature vector quantization on the image feature description set and the acoustic feature description set to generate a multimodal feature coding matrix, and construct a cross-modal alignment network based on the multimodal feature coding matrix to obtain an alignment feature mapper;

[0072] Step S202: Input the multimodal feature encoding matrix into the alignment feature mapper for temporal alignment processing to obtain a standardized feature sequence. Perform principal component analysis on the standardized feature sequence according to the preset dimensionality reduction rule to generate a feature projection matrix. Train a feature dimensionality reduction network based on the feature projection matrix to obtain an emotion feature mapper. Input the real-time input standardized feature sequence into the emotion feature mapper to generate a low-dimensional representation vector.

[0073] First, the image data stream is input and preprocessed at a fixed step size. After obtaining candidate contours through edge detection, region segmentation is performed, outputting image feature regions covering highly correlated areas such as the face, hands, and upper limbs. Based on these image feature regions, local descriptor operators are extracted, including keypoint locations, directional gradients, and texture histograms. Figure 3 The system identifies components and forms an image feature description set. Frame-level occlusion and motion blur are then attached to the corresponding description entries as tags. Simultaneously, the speech data stream is segmented into continuous subframes and converted into a spectrogram sequence. Time-frequency analysis is performed using short-time windows and bandpass filtering to extract components such as fundamental frequency, formants, and energy dynamics, forming an acoustic feature description set. The time-domain labels for silent segments and abrupt transition segments are also recorded.

[0074] Feature vector quantization is performed based on the aforementioned image feature description set and acoustic feature description set. Specifically, an updatable codebook is used to perform vector quantization on both types of descriptions, generating index and residual pairs. The quantized vectors from both paths are then concatenated into a segment code of uniform length under the same time index. All segment codes are stacked in chronological order to form a multimodal feature coding matrix. The rows of the matrix correspond to time segments, and the columns consist of both image quantization vectors and acoustic quantization vectors. Three confidence marker columns—occlusion, silence, and abrupt change—are retained to provide a basis for subsequent alignment and weighting.

[0075] Next, a cross-modal alignment network, also known as a cross-modal aligner, is constructed based on the multimodal feature encoding matrix. The cross-modal aligner uses dual-channel temporal encoding as its backbone, processing the image and acoustic channels separately. A learnable delay offset encoding is then introduced in the fusion layer to explicitly model the relative time delays across modalities. During the training phase, the synchronicity of similar events is used as a weak supervision signal to minimize the temporal differences in cross-channel representations, and the weights of samples with lower confidence labels are reduced. The trained cross-modal aligner acts as an alignment feature mapper, providing the transformation from fragment encoding to aligned representation.

[0076] The multimodal feature encoding matrix is ​​then input into the aligned feature mapper, which outputs an aligned frame-level representation and resamples it with a fixed step size to obtain a standardized feature sequence. This sequence retains the time index and alignment offset, and adds an interpolation ratio to the interpolated segments for subsequent dimensionality reduction and sample weighting during the training phase. To avoid the amplification effect of outlier segments, the sequence undergoes amplitude truncation and mean backfilling before output, ensuring that all dimensions fall within a uniform scale range.

[0077] A dimensionality reduction process is performed based on the standardized feature sequence. First, principal component analysis is conducted according to a preset dimensionality reduction rule to obtain an energy-dominant feature projection matrix, and the variance contribution ratio and source label of each principal component are recorded. Then, using the feature projection matrix as initialization, a feature dimensionality reduction network, also known as an emotion feature mapper, is trained. This network performs secondary compression of the principal component space using shallow nonlinear units, learning a nonlinear structure coupled across modalities. During training, gradient weights are reduced for samples with large alignment offsets to minimize the interference of time lag on the low-dimensional representation. After the network converges, the parameters are fixed, and it serves as a mapper for online inference.

[0078] Based on this, the aforementioned sentiment feature mapper transforms the real-time input standardized feature sequence frame by frame into a low-dimensional representation vector, which is then output along with the time index. To ensure the robustness and controllability of the dimensionality reduction process, this embodiment introduces a one-time objective during the training period: M = c1×P + c2×S.

[0079] In the formula, M represents the overall training objective; P is a metric based on reconstruction error, evaluating the consistency between projection and inverse projection; S is a regularization term based on temporal smoothing, constraining the rate of change of low-dimensional vectors in adjacent frames; c1 and c2 are non-negative weights, selected based on validation set performance. M is minimized during the dimensionality reduction training phase, and the output mapping parameters remain unchanged during the inference phase. The low-dimensional representation vector is then sent to the emotion recognition neural network as input, participating in the training and inference of the emotion assessment model, and continuing to serve as a direct data source for generating emotion state vectors in subsequent steps.

[0080] In one embodiment of the emotion-aware robot control method based on the LLVM core of this application, it may further include the following:

[0081] Step S301: Divide the low-dimensional representation vector into training sequence and validation sequence according to the temporal segmentation rule, perform data augmentation processing on the training sequence to generate training sample set, construct a multilayer perceptron network structure based on the training sample set to obtain an emotion recognition model prototype, perform iterative optimization training on the emotion recognition model prototype according to the preset loss function to obtain an emotion evaluation model, and input the validation sequence into the emotion evaluation model for cross-validation to obtain the model evaluation index.

[0082] Step S302: Input the real-time low-dimensional representation vector into the emotion assessment model for forward computation to obtain the emotion state vector, perform temporal sliding sampling on the emotion state vector to generate an emotion state sequence, construct a recurrent neural network based on the emotion state sequence to obtain an emotion evolution predictor, and deploy the emotion evolution predictor as a memory cache pool for real-time state maintenance and prediction updates.

[0083] First, after the low-dimensional representation vector output in step S202 arrives at the training end, a temporal segmentation rule is set according to the acquisition time axis and event boundaries to divide continuous segments into training sequences and validation sequences. During segmentation, the preceding alignment offset and interpolation ratio are read, and abnormal segments are placed in the low-confidence partition and their proportion on the validation side is limited to avoid biased evaluation. Subsequently, data augmentation is performed on the training sequences, including three types of operations: time-shift perturbation, amplitude scaling, and noise injection. The augmented segments inherit the original time index and confidence markers to form a training sample set, which serves as the input to the prototype of the emotion recognition model.

[0084] Next, a multilayer perceptron network structure, known in Chinese as the emotion recognition prototype, is constructed based on the training sample set. This structure takes time slices of low-dimensional representation vectors as input, employs frame-by-frame forward propagation, and sets two headers at the output: an emotion category component for classification and a confidence regression component for adaptive adjustment of the input confidence. During training, batch organization is used, maintaining temporal consistency within each batch to facilitate subsequent concatenation with the state sequence. To avoid overfitting, random deactivation is applied to the hidden layers, and paradigm constraints are imposed on the weights.

[0085] Then, the emotion recognizer prototype is iteratively optimized according to a preset loss function to obtain the emotion evaluation model. To clearly define the training objective and robustness terms, this embodiment uses a single objective function:

[0086] Y = d1×L + d2×K,

[0087] Where Y is the overall training objective; L is the cross-entropy term based on the labeled data and classification output; K is the bias constraint term based on confidence regression and input confidence labels; d1 and d2 are non-negative weights, selected within the validation performance range. Y is minimized until convergence, and the converged parameters are fixed to generate the sentiment assessment model for subsequent forward inference. This formula is only defined and used in this section; subsequent references will refer to it as "the result of minimizing Y".

[0088] After the sentiment assessment model is trained, the validation sequence is input into the model to perform cross-validation, outputting three types of model evaluation metrics: accuracy, confusion level, and robustness statistics. These metrics maintain the same time index as the segmentation rules, facilitating the backtracking of the failure source for specific segments. If the robustness statistics are weak in low-confidence partitions, the previous step is returned to update the enhancement strategy, increasing the types of perturbations in that partition, and training is repeated until the evaluation metrics stabilize.

[0089] Based on the aforementioned training convergence results, the process proceeds to the online phase. Real-time low-dimensional representation vectors are input into the sentiment assessment model in arrival order for forward computation, yielding a sentiment state vector carrying a time index, and the model-side confidence is output simultaneously. When an input segment contains low-confidence markers from alignment offsets and interpolation, the model-side confidence is suppressed, serving as the gating basis for the downstream sampling strategy. This state vector, as a time-level output in a unified format, is provided to both sliding sampling and memory maintenance branches.

[0090] Temporal sliding sampling is performed based on the sentiment state vector. The sliding window advances with a fixed step size, and within each window, a statistical summary of the state components is calculated to form a sentiment state sequence. This sequence retains the start and end indices and the average confidence value of each window for subsequent temporal modeling. If the average confidence value of consecutive windows is lower than a set range, a "weak evidence" label is added to the sequence to limit its impact on the update of temporal parameters.

[0091] Then, using the emotional state sequence as input, a recurrent neural network, known in Chinese as an emotional evolution predictor, is constructed. This predictor reads the state context of adjacent windows, outputs an estimate of the emotional trend for the next window, and maintains short-term memory internally. The memory unit reduces the state transition gain for windows marked with "weak evidence," avoiding erroneous cumulative drift when evidence is insufficient. The predictor is trained using historical sequences and the true states of delayed windows as targets, employing a time-expansion approach to optimize parameters.

[0092] After the sentiment evolution predictor converges, it is deployed as a memory cache pool. The cache pool maintains an active window queue, storing the latest sentiment state vectors, trend estimates, and confidence records. When a new entry is written, the oldest entry enters the history area and triggers a consistency check. If the difference between two consecutive active entries exceeds a threshold condition, a suppression flag is generated in the active area for the upper-layer language model regulator and the lower-layer control link to read. The cache pool supports two types of interfaces: batch queries by time index and single-point queries by the latest time.

[0093] Finally, the emotional state vector and the trend estimate of the emotional evolution predictor are synchronously exposed as an upstream interface for direct reading by the language model regulator in step S102. Simultaneously, the confidence records and suppression markers in the cache pool are invoked by the action parsing and speech prosody regulator in downstream step S103 to limit the upper limits of amplitude, tempo, and speed. Thus, the outputs of steps S301 to S302 form a traceable temporal dependency in subsequent language generation and execution control stages.

[0094] In one embodiment of the emotion-aware robot control method based on the LLVM core of this application, it may further include the following:

[0095] Step S401: The emotional state vector and the output of the emotional evolution predictor are concatenated to obtain the emotional regulation vector. The emotional regulation vector is normalized to generate a weight distribution matrix. An attention regulation network is constructed based on the weight distribution matrix to obtain a parameter regulation model. The output of the parameter regulation model is remapped with the pre-trained language model to obtain an emotional enhancement language model.

[0096] Step S402: Convert the user input information into a text sequence and perform word segmentation to obtain a word sequence. Perform multi-head attention calculation on the word sequence and the sentiment state vector to generate a fusion feature matrix. Input the fusion feature matrix into a bidirectional encoder to obtain a sentiment perception input vector. Based on the sentiment perception input vector, construct an intent recognition classifier to generate a dialogue history vector.

[0097] First, the sentiment state vector output from step S302 and the trend estimate from the sentiment evolution predictor are read and concatenated under the same time index to form a sentiment regulation vector. To avoid weight bias caused by different units, the sentiment regulation vector is normalized according to component range and historical distribution, outputting a weight distribution matrix, while retaining confidence records and suppression labels from upstream. The weight distribution matrix is ​​cached in chronological order to provide direct input for subsequent fine-grained adjustment by head and by level in the attention layer.

[0098] Next, an attention modulator network, also known as an attention modulator, is constructed based on the weight distribution matrix. The attention modulator uses a hierarchical gating structure to generate scaling factors and biases for multi-head attention. The input is the weight distribution matrix and suppression label at the current time index, and the output is the scaling coefficient and bias vector for each attention head. To suppress excessive amplification caused by aberrant emotional shifts, the attention modulator imposes an upper limit constraint on the components with suppression labels during the generation phase, and increases the constraint strength as the upstream confidence record decreases, ensuring that the modulation amplitude is consistent with the strength of the evidence.

[0099] Then, the output of the attention modulator is remapped with the parameters of the pre-trained language model. Specifically, mapping units are inserted before the self-attention and cross-attention layers, and scaling coefficients and bias vectors are injected layer by layer. A slight offset is made to the target statistics of the layer normalization, so that the scope of attention is guided by the sentiment modulation vector without disrupting the original semantic structure. After injection, a sentiment-enhanced language model is obtained, and a modulation snapshot is retained in the current conversation context for reuse and rollback in subsequent rounds.

[0100] Based on the aforementioned sentiment-enhanced language model, the input-side fusion process begins. User input is transcribed into a text sequence and segmented into words, generating a word sequence and positional encoding. To maintain temporal consistency with the sentiment side, sentiment state vectors are extracted at the same time index and their correlation with the word sequence is jointly calculated using multi-head attention to obtain a fusion feature matrix. For time points with "suppression markers," gating is applied before attention weight normalization, ensuring that sentiment components only function within local contexts and avoiding unnecessary impact on long-distance dependencies. The fusion results are organized according to sequence step size, providing stable input for subsequent encoding and classification.

[0101] Based on the fused feature matrix, a bidirectional encoder, also known as a bidirectional interpreter, is input. The bidirectional interpreter reads the preceding and following context and outputs step-level hidden representations. It aggregates the representations from several consecutive steps according to conversation rounds to form an emotion-aware input vector. This vector carries both a time index and a source marker, indicating the emotion component referenced during fusion and the moderating snapshot number. If the upstream confidence record is low, the weight of the corresponding step is reduced during aggregation to maintain robustness in intent determination.

[0102] Then, using the aforementioned emotion perception input vector as input, an intent recognition classifier is constructed. The intent classifier outputs two results: intent category distribution and key trigger representation. It then writes back the high-weighted trigger representations to the explanation field under the current time index. To clarify the numerical relationship between regulation and classification in this section, a one-time expression is introduced here:

[0103] R = e1×S + e2×T e3×U.

[0104] In the formula, R is the internal index used to control the intensity of sentiment injection; S is the attention aggregation of sentiment-related channels in the fusion feature matrix; T is the context consistency measure of the bidirectional understander output; U is the uncertainty composed of the upstream inhibition label and low-confidence record; e1, e2, and e3 are non-negative weights, the range of which is set in the offline verification stage. R is used to adjust the activation threshold of the last layer before classification, so that the sentiment signal is enhanced when the evidence is sufficient and the context is consistent, and weakened when the evidence is insufficient.

[0105] Finally, the intent classifier outputs a dialogue history vector at the end of each session round. This vector consists of three parts: intent category distribution, key trigger representation, and time index, along with a statistical summary of the current round's R. This vector is simultaneously sent to the policy generation stage as upstream input to the interaction policy vector; concurrently, its time index and adjustment snapshot number are written to a shared buffer for the sentiment enhancement language model to read in the next round of inference, achieving a closed-loop connection between "sentiment adjustment—semantic understanding—historical accumulation."

[0106] In one embodiment of the emotion-aware robot control method based on the LLVM core of this application, it may further include the following:

[0107] Step S501: Concatenate the dialogue history vector and the emotion state vector to obtain a multimodal state vector. Perform hierarchical encoding on the multimodal state vector to generate a state representation matrix. Construct a policy generation network based on the state representation matrix to obtain an interaction decision model. Optimize the interaction decision model using Monte Carlo tree search to obtain an interaction policy vector.

[0108] Step S502: Decouple and decompose the interaction strategy vector according to the emotion type to obtain the intensity parameter set, construct an adaptive regulator based on the intensity parameter set to generate the adjustment coefficient matrix, and concatenate the adjustment coefficient matrix with the interaction strategy vector to obtain the emotion expression control vector.

[0109] First, the dialogue history vector output in step S402 and the sentiment state vector output in step S302 are read and concatenated under the same time index to form a multimodal state vector. To reduce bias caused by different units, the multimodal state vector is normalized and detrended according to channel range and historical distribution, and the source confidence and suppression labels are retained as parallel auxiliary channels. The multimodal state vectors processed above are packaged according to the conversation rounds and used as unified inputs on the policy side.

[0110] Next, hierarchical encoding is performed on the multimodal state vector. The first encoding layer extracts local context, capturing short-range dependencies between sentiment intensity and phrase triggering; the second encoding layer aggregates across time, establishing an intention evolution representation at the dialogue turn level; the third encoding layer fuses auxiliary channels, transforming source confidence and suppression labels into coefficient fields of suppression gates to suppress the contribution of anomalous mutations. The outputs of the three layers are concatenated along the channel dimension and subjected to dimensionality reduction mapping to generate a state representation matrix. This matrix records the policy-related representation and corresponding time index at each time step, which is used for subsequent reading by the decision network.

[0111] Then, a policy generation network, known as the Interactive Decision Model, is constructed based on the state representation matrix. This model employs two head outputs to generate response action candidates and gesture / phrase selections, respectively, and internally shares a value estimation branch for evaluating long-term gains. During training, historical task completion rates and user feedback ratings are used as weak supervision signals, while a contrast constraint based on sentiment consistency is introduced to ensure the policy distribution remains separable under different sentiment components. The model's forward pass outputs policy probabilities and value assessments, providing initial values ​​for search optimization.

[0112] Based on the aforementioned interactive decision-making model, a Monte Carlo tree-based search optimization is initiated. Using the current dialogue state as the root node, simulations are performed in the candidate action space. During node expansion, policy probabilities are read as priors, and value assessments are used as feedback gains. To reflect the constraint of sentiment on policy exploration, an exploration upper bound is applied to paths with suppression labels during the search process, and the sampling frequency of high-confidence sentiment components is increased. After the search, the visit counts are normalized by temperature, and the component with the highest expected reward is extracted to obtain the interaction policy vector. The key statistics and time index of the search trajectory are then output together.

[0113] Based on the interaction strategy vector, a decoupling process is initiated. First, the strategy components are divided into action-oriented and language-oriented groups according to sentiment type. The action-oriented group focuses on gesture amplitude and velocity, while the language-oriented group focuses on tone, pauses, and stress. To avoid interference from cross-group coupling, both groups only share the time index and source confidence; other parameters are processed independently. Then, the baseline intensity and rate of change of each group are extracted to form an intensity parameter set, and the acknowledgment summary from the previous round is retained for comparison with the current component.

[0114] Then, an adaptive modulator is constructed based on the intensity parameter set. The adaptive modulator reads the current sentiment intensity, short-term rise rate, and source confidence, outputs a component-level modulation coefficient matrix, and sets a modulation upper limit for time points where suppression markers are present. To clarify the internal compositional relationships, a one-time expression is introduced here:

[0115] W = f1×A + f2×B f3×C.

[0116] In the formula, W is the intensity score within the modulator; A is the weighted aggregation of the current emotional intensity components; B is the weighted aggregation of the intensity change rate; C is the combined penalty of the inverse measure of source confidence and the inhibition label; f1, f2, and f3 are non-negative weights, the range of which is set during the offline verification stage. W is used to map to the scaling coefficients of the component levels and subsequently form the modulation coefficient matrix.

[0117] Based on the aforementioned adjustment coefficient matrix, it is sequentially concatenated and scaled with the interaction strategy vector at the component level to obtain the emotion expression control vector. For the action-oriented component, the scaled amplitude and speed are indirectly constrained by the aforementioned W to avoid excessive posture jumps during uncertain periods; for the language-oriented component, the scaled tone intensity and rhythm change rate are limited to an executable range to maintain consistency with upstream emotion evidence. All scaled components are aligned by time index and backfilled with the current round of search statistics and adjustment scores to form traceable control entries.

[0118] Finally, the emotion expression control vector, along with the time index, is written into a shared buffer, providing a direct reading entry point for the action parsing and speech prosody pipeline in step S103. The action parsing unit reads the action component generator driving parameters and enters collision detection, while the speech prosody modulator reads the language component generator acoustic parameters and synthesizes a speech waveform; both refer to the current round of adjustment score limit amplitude and rate of change. Thus, the outputs of steps S501 and S502 are closely linked to subsequent execution, realizing a continuous link from policy generation to multimodal expression.

[0119] In one embodiment of the emotion-aware robot control method based on the LLVM core of this application, it may further include the following:

[0120] Step S601: Decompose the emotion expression control vector into motion parameter matrix according to preset parsing rules, construct a kinematic mapping network based on the motion parameter matrix to generate a joint space mapper, perform constraint optimization on the output of the joint space mapper to obtain actuator driving parameters, and perform interpolation smoothing on the actuator driving parameters to generate an initial trajectory sequence.

[0121] Step S602: Input the initial trajectory sequence into the collision detection module to perform spatial interference analysis to obtain an obstacle avoidance path set. Based on the obstacle avoidance path set, construct a trajectory optimizer to generate a safe trajectory. Perform inverse kinematics analysis on the safe trajectory according to the robot kinematics model to obtain joint drive commands. Control the actuator to complete the action output according to the joint drive commands.

[0122] First, the emotion expression control vector output from step S502 is read, and its dimensionality is decomposed at the same time index according to preset parsing rules to obtain a set of action-related components. This set is then rearranged into three channels: end-effector pose, velocity expectation, and compliance parameters, and organized into a motion parameter matrix. Source confidence and suppression labels are retained as auxiliary columns for subsequent constraint strength adjustment. The aforementioned motion parameter matrix serves as the unified input format for this kinematic pipeline segment.

[0123] Next, a kinematic mapping network, named the joint space mapper, is constructed based on the motion parameter matrix. This mapper takes the end-effector pose and velocity expectations as primary inputs, and compliance parameters and auxiliary columns as adjustment inputs. Through forward propagation, it outputs a sequence of candidate joint vectors and their corresponding feasibility scores. To improve the handling of mechanism redundancy, a multi-branch inverse kinematics unit is incorporated within the mapper, outputting several candidate solutions and sorting them according to their feasibility scores. The sorting results, along with the time index, are output for use in the next constraint optimization step.

[0124] The output of the joint space mapper is then input into the constraint optimization module. Constraint optimization selects several candidate solutions at each time step as initial points, establishes an objective function including joint position boundaries, velocity and acceleration limits, adjacent step smoothing terms, and compliance parameter upper limits, and solves for the actuator driving parameters. To ensure the constraint strength is consistent with the evidence, constraint optimization reads the source confidence and suppression flags in the auxiliary column, increasing the weight of the smoothing terms at time steps with low confidence or suppression flags to restrict abrupt transitions. After optimization, the joint objective and torque upper limits are output in time order, forming a sequence of actuator driving parameters.

[0125] Based on the aforementioned actuator drive parameter sequence, interpolation and smoothing are performed before trajectory generation. Specifically, cubic spline interpolation is used for joint targets in adjacent time steps to ensure that velocity and acceleration are continuous within the segment; for segments with suppression markers, the easing-in and easing-out duration during switching is increased to reduce the peak value of instantaneous acceleration. After interpolation, an amplitude pruning is performed on the entire sequence to ensure that each joint does not exceed the boundary, finally obtaining the initial trajectory sequence, which maintains a one-to-one correspondence with the original time index for the collision detection module to read.

[0126] Next, the initial trajectory sequence is input into the collision detection module for spatial interference analysis. Collision detection performs a sample-by-sample comparison of the robot body, end effector, and known environment under a unified occupant model. Local time windows are formed for potentially intrusive sampling points, and an obstacle avoidance path set is generated. Each path candidate is accompanied by a cost term, which consists of the path length, the deviation from the initial trajectory, and a penalty for approaching the boundary. Simultaneously, source confidence is backfilled to adjust the cost weights, making low-confidence periods more biased towards conservative rewriting.

[0127] Then, a trajectory optimizer, also known as a safe trajectory solver, is constructed based on the obstacle avoidance path set. The safe trajectory solver uses the initial trajectory as a reference and performs local replanning within the local time window that triggers the interference, minimizing the comprehensive cost term while adhering to joint dynamics constraints to generate a safe trajectory. To clarify the internal composition relationship, a one-time expression is introduced here:

[0128] N = g1×J + g2×K g3×H.

[0129] In the formula, N is the internal evaluation quantity used for path selection; J is a weighted term for deviation from the initial trajectory, limiting excessive deviation; K is a weighted term for obstacle avoidance margin, encouraging maintaining a safe distance from obstacles; H is a risk compensation term obtained by combining source confidence and suppression labels, increasing conservatism when evidence is insufficient; g1, g2, and g3 are non-negative weights, set within the offline range. N is minimized when comparing candidate paths, and the path corresponding to its minimum value is written into the safe trajectory.

[0130] After the safe trajectory is determined, the inverse kinematics model is used to generate discrete joint drive commands. During the inverse kinematics phase, the target position, velocity, and necessary torque upper limit are calculated for each control cycle and aligned with the actuator's sampling cycle. To ensure execution consistency, a limit check is performed on the inverse kinematics results. If the limits are exceeded, the system reverts to a suboptimal path or reduces the velocity target, and a degradation marker is recorded at this time for later backtracking.

[0131] Based on the aforementioned joint drive commands, the controller executes a position and torque closed loop within the servo circuit, using compliance parameters as external inputs to the impedance model to adjust the interaction stiffness and damping. During execution, feedback joint errors and contact signals are collected as online monitoring quantities; once continuous error accumulation is detected, the controller triggers micro-amplitude timing stretching to fine-tune the command phase for subsequent cycles, while maintaining the safety trajectory boundary from being breached. Relevant adjustment ratios and degradation flags are synchronously written back to the shared buffer.

[0132] Finally, the executed safety trajectory segments and controller receipts are packaged by time index to form a synchronous record on the action side, which is then aligned with the voice timestamp on the speech side in step S103. The aligned collaborative control command drives the actuator to output actions and synchronizes with voice playback, achieving posture and rhythm consistent with emotional expression. The safety trajectory and joint drive commands produced in this step are continuously read by the downstream state management module to update short-term trends and provide boundary references for the next round of control.

[0133] In one embodiment of the emotion-aware robot control method based on the LLVM core of this application, it may further include the following:

[0134] Step S701: Decompose the speech synthesis parameters according to the phoneme structure to obtain a prosodic feature set, perform emotion mapping transformation on the prosodic feature set to generate a prosodic control vector, construct an acoustic parameter generation network based on the prosodic control vector to obtain a speech synthesis model, and process the output of the speech synthesis model through a vocoder to obtain a speech waveform.

[0135] Step S702: Perform time-series analysis on the speech waveform to obtain a speech timestamp sequence, synchronize and align the speech timestamp sequence with the action execution sequence to generate a collaborative control command, and schedule the robot actuator and the speech player to complete multimodal emotional expression based on the collaborative control command.

[0136] First, the speech-related components in the emotion expression control vector generated in step S502 are read and mapped to a phoneme-level sequence according to a pre-defined phoneme dictionary. Phoneme boundary segmentation and initial duration estimation are performed on this sequence to form a prosodic feature set containing phoneme category, initial duration, fundamental frequency direction, and energy envelope. To ensure consistency with upstream evidence, conservative upper limits are set for the initial duration and energy change rate at time points with suppression markers and low-confidence records. These limits are then attached to the corresponding phoneme entries as markers for subsequent model reading.

[0137] Next, an emotion mapping transformation is performed based on the prosodic feature set. Specifically, the emotion state vectors at the same time index are read, and component-level gains and offsets are generated according to the mapping table of emotion dimension and phoneme category. These are then applied to the fundamental frequency and energy channels to output a prosodic control vector. To suppress jumps caused by abrupt changes in emotion, cross-boundary smoothing is performed on the fundamental frequency endpoints and energy zero-crossing points of adjacent phonemes, while retaining the smoothing intensity parameter. The prosodic control vector and smoothing parameter together serve as the control input on the acoustic side, and the upstream suppression and confidence markers are retained in the entries.

[0138] Then, an acoustic parameter generation network, called an acoustic generator, is constructed based on the prosodic control vector. The acoustic generator takes phoneme category, duration, fundamental frequency, and energy channel as its main inputs, and suppression markers and smoothing parameters as adjustment inputs. It outputs a sequence of acoustic parameters on a time-frequency raster, including formant trajectories and bandwidth estimates. During the training phase, paired corpus segments and target spectra are used as supervision signals. Gradient weights are reduced for samples with suppression markers to avoid overfitting in regions with insufficient evidence. During online inference, the acoustic generator outputs a parameter raster frame-by-frame, aligned with the time index.

[0139] After the acoustic parameter sequence is generated, a vocoder is invoked for waveform synthesis. The vocoder reads the formant trajectory, fundamental frequency curve, and energy envelope, outputs a speech waveform consistent with the control period, and writes the speech frame energy and period consistency as quality-side statistics back to a record with the same time index. If the detected period consistency is lower than the set range, the vocoder reduces the fundamental frequency swing and extends the duration of the phoneme tail, forming a soft degradation to avoid hoarseness and abruptness, while recording a degradation marker for reference during subsequent alignment.

[0140] Based on the aforementioned speech waveform, temporal analysis is performed to obtain a speech timestamp sequence. Specifically, this involves extracting event time points such as phoneme boundaries, stress peaks, pauses, and energy peaks from the waveform, merging them into a timestamp sequence, and establishing a one-to-one correspondence with the upstream time index. To address rhythm drift caused by degradation at the synthesis end, the offset from the target beat is calculated on the timestamp sequence, and the offset value, along with the degradation marker, is written into the same entry, providing a basis for cross-modal alignment.

[0141] Then, the voice timestamp sequence is synchronized and aligned with the action execution timing. The alignment process uses the safety trajectory keyframe output in step S602 as anchor points, prioritizing matching accent peaks to the vicinity of attitude acceleration peaks, and matching pauses to within the attitude stillness window. For entries with offsets, small compression / extension is initially allowed on the voice side to absorb beat differences; if this still exceeds the allowable range, adjacent relaxed ease-in / ease segments are selected on the action side for fine-tuning to ensure the trajectory boundaries are not breached. After alignment, a collaborative control command containing playback instructions, actuator keyframe indices, and synchronization markers is generated.

[0142] Driven by the coordinated control commands, the robot actuator and the voice player operate collaboratively along a unified timeline. The actuator performs position and torque control based on keyframe interpolation, while the voice player triggers segment playback and beat control based on synchronization markers. When sensor feedback indicates a lag in actual movement, the playback rate is reduced within a limited range to maintain consistency between movement and speech. The synchronization loop calculates the residual once per control cycle and writes the residual statistics and corresponding suppression markers back to the shared buffer for reference by the upstream policy generator and language model regulator in the next round.

[0143] Finally, the played audio waveform segments and executed keyframe indices are packaged chronologically to form a collaborative receipt between the audio and action sides. This collaborative receipt carries three types of fields: beat offset, degradation marker, and residual statistics. These fields are read from the emotion evolution predictor's memory cache to revise short-term trends; simultaneously, they are read by the parameter management unit of the control link to update the allowed duration compression / expansion and action easing-in / easing-out ranges for the next round. Through this closed loop, the audio and action maintain a rhythm and amplitude consistent with the emotional state in subsequent rounds, achieving stable implementation of multimodal emotional expression.

[0144] To effectively address the shortcomings of traditional technologies in emotion recognition, language processing, and emotion expression, and to provide technical support for emotion-aware robots, this application provides an embodiment of an LLVM-based emotion-aware robot control device for implementing all or part of the aforementioned LLVM-based emotion-aware robot control method. See [link to embodiment]. Figure 2 The emotion-aware robot control device based on the LLVM core specifically includes the following components:

[0145] The model building module 10 is used to extract features from the image data stream captured by the camera and the voice data stream captured by the microphone to obtain a multimodal feature vector, perform temporal alignment processing on the multimodal feature vector to generate a standardized feature matrix, construct an emotion feature mapper based on the standardized feature matrix to obtain a low-dimensional representation vector, input the low-dimensional representation vector into the emotion recognition neural network to train an emotion evaluation model, calculate the emotion state vector according to the emotion evaluation model, and establish a memory cache pool based on the emotion state vector to obtain an emotion evolution predictor.

[0146] The emotion perception module 20 is used to input the emotion state vector and the output result of the emotion evolution predictor into the language model regulator to adjust the attention parameters to obtain an emotion-enhanced language model, perform feature fusion on the user input information and the emotion state vector to obtain an emotion perception input vector, input the emotion perception input vector into the dialogue intent understanding module to generate a dialogue history vector, and generate an interaction strategy vector based on the dialogue history vector and the emotion state vector, and adjust the emotion intensity of the interaction strategy vector to obtain an emotion expression control vector.

[0147] The robot control module 30 is used to parse the emotion expression control vector into actuator driving parameters and speech synthesis parameters, perform collision detection on the actuator driving parameters to obtain a safe trajectory, control the robot actuator to output actions according to the safe trajectory, input the speech synthesis parameters into the speech prosody modulator to generate a speech waveform, and realize the robot's collaborative emotion expression based on the speech waveform and output actions.

[0148] As described above, the emotion-perceiving robot control device based on the LLVM core provided in this application can effectively perceive emotions through multimodal features and neural networks. It constructs a language processing mechanism, combining emotion regulation and dialogue understanding to establish a reliable interaction strategy. It introduces expression control, ensuring the coordination of emotional expression through action planning and speech synthesis. This method effectively addresses the shortcomings of traditional technologies in emotion recognition, language processing, and emotional expression, providing technical support for emotion-perceiving robots.

[0149] From a hardware perspective, in order to effectively address the shortcomings of traditional technologies in emotion recognition, language processing, and emotion expression, and to provide technical support for emotion-aware robots, this application provides an embodiment of an electronic device for implementing all or part of the aforementioned LLVM-based emotion-aware robot control method. The electronic device specifically includes the following components:

[0150] The system comprises a processor, memory, a communication interface, and a bus; wherein the processor, memory, and communication interface communicate with each other via the bus; the communication interface is used to realize information transmission between the LLVM-based emotion-sensing robot control device and core business systems, user terminals, and related databases and other related devices; the logic controller can be a desktop computer, tablet computer, or mobile terminal, etc., and this embodiment is not limited to these. In this embodiment, the logic controller can be implemented with reference to the embodiments of the LLVM-based emotion-sensing robot control method and the LLVM-based emotion-sensing robot control device described in the embodiments, the content of which is incorporated herein, and repeated details will not be described again.

[0151] It is understood that the user terminal may include smartphones, tablet computers, network set-top boxes, portable computers, desktop computers, personal digital assistants (PDAs), in-vehicle devices, smart wearable devices, etc. Among these, the smart wearable devices may include smart glasses, smartwatches, smart bracelets, etc.

[0152] In practical applications, parts of the emotion-aware robot control method based on the LLVM core can be executed on the electronic device side as described above, or all operations can be completed in the client device. The choice can be made based on the processing power of the client device and the limitations of the user's usage scenario. This application does not impose any limitations on this. If all operations are completed in the client device, the client device may further include a processor.

[0153] The aforementioned client device may have a communication module (i.e., a communication unit) that can communicate with a remote server to achieve data transmission with the server. The server may include a server on the task scheduling center side; in other implementation scenarios, it may also include a server on an intermediate platform, such as a server on a third-party server platform that has a communication link with the task scheduling center server. The server may include a single computer device, a server cluster consisting of multiple servers, or a distributed server structure.

[0154] Figure 3 This is a schematic block diagram illustrating the system configuration of the electronic device 9600 according to an embodiment of this application. Figure 3 As shown, the electronic device 9600 may include a central processing unit 9100 and a memory 9140; the memory 9140 is coupled to the central processing unit 9100. It is worth noting that... Figure 3 This is an example; other types of structures can also be used to supplement or replace this structure to achieve telecommunications functions or other functions.

[0155] In one embodiment, the emotion-aware robot control method based on the LLVM core can be integrated into the central processing unit 9100. The central processing unit 9100 can be configured to perform the following control:

[0156] Step S101: Extract features from the image data stream captured by the camera and the voice data stream captured by the microphone to obtain a multimodal feature vector. Perform temporal alignment processing on the multimodal feature vector to generate a standardized feature matrix. Construct an emotion feature mapper based on the standardized feature matrix to obtain a low-dimensional representation vector. Input the low-dimensional representation vector into the emotion recognition neural network to train an emotion evaluation model. Calculate the emotion state vector based on the emotion evaluation model. Establish a memory cache pool based on the emotion state vector to obtain an emotion evolution predictor.

[0157] Step S102: Input the emotional state vector and the output of the emotional evolution predictor into the language model regulator to adjust the attention parameters to obtain an emotional enhancement language model. Perform feature fusion on the user input information and the emotional state vector to obtain an emotional perception input vector. Input the emotional perception input vector into the dialogue intent understanding module to generate a dialogue history vector. Generate an interaction strategy vector based on the dialogue history vector and the emotional state vector. Adjust the emotional intensity of the interaction strategy vector to obtain an emotional expression control vector.

[0158] Step S103: The emotion expression control vector is parsed into actuator driving parameters and speech synthesis parameters. Collision detection is performed on the actuator driving parameters to obtain a safe trajectory. The robot actuator is controlled to output actions according to the safe trajectory. The speech synthesis parameters are input into the speech prosody modulator to generate a speech waveform. Based on the speech waveform and the output action, the robot's collaborative emotion expression is realized.

[0159] As described above, the electronic device provided in this application embodiment achieves effective emotion perception through multimodal features and neural networks. It constructs a language processing mechanism, combining emotion regulation and dialogue understanding to establish a reliable interaction strategy. Expression control is introduced, ensuring the coordination of emotional expression through action planning and speech synthesis. This method effectively addresses the shortcomings of traditional technologies in emotion recognition, language processing, and emotional expression, providing technical support for emotion-perceiving robots.

[0160] In another embodiment, the emotion-sensing robot control device based on the LLVM core can be configured separately from the central processing unit 9100. For example, the emotion-sensing robot control device based on the LLVM core can be configured as a chip connected to the central processing unit 9100, and the function of the emotion-sensing robot control method based on the LLVM core can be realized through the control of the central processing unit.

[0161] like Figure 3 As shown, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is worth noting that the electronic device 9600 does not necessarily need to include these components. Figure 3 All components shown; in addition, the electronic device 9600 may also include Figure 3 For components not shown, please refer to existing technologies.

[0162] like Figure 3 As shown, the central processing unit 9100, sometimes also referred to as a controller or operating control, may include a microprocessor or other processor device and / or logic device, which receives inputs and controls the operation of various components of the electronic device 9600.

[0163] The memory 9140 may be, for example, one or more of a cache, flash memory, hard drive, removable media, volatile memory, non-volatile memory, or other suitable devices. It may store the aforementioned failure-related information, and also store a program for executing that information. The central processing unit 9100 may execute the program stored in the memory 9140 to perform information storage or processing, etc.

[0164] Input unit 9120 provides input to central processing unit 9100. Input unit 9120 may be, for example, a keypad or touch input device. Power supply 9170 provides power to electronic device 9600. Display 9160 displays images and text. Display may be, for example, an LCD display, but is not limited thereto.

[0165] The memory 9140 can be a solid-state memory, such as a read-only memory (ROM), random access memory (RAM), a SIM card, etc. It can also be a memory that retains information even when power is off, can be selectively erased, and contains more data; examples of this type of memory are sometimes referred to as EPROMs. The memory 9140 can also be some other type of device. The memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application / function storage unit 9142 for storing application programs and function programs or processes for executing the operation of the electronic device 9600 via the central processing unit 9100.

[0166] The memory 9140 may also include a data storage unit 9143 for storing data, such as contacts, digital data, pictures, sounds, and / or any other data used by the electronic device. The driver storage unit 9144 of the memory 9140 may include various drivers for the electronic device for communication functions and / or for performing other functions of the electronic device (such as messaging applications, address book applications, etc.).

[0167] The communication module 9110 is a transmitter / receiver that sends and receives signals via the antenna 9111. The communication module 9110 (transmitter / receiver) is coupled to the central processing unit 9100 to provide input signals and receive output signals, which is the same as in a conventional mobile communication terminal.

[0168] Based on different communication technologies, multiple communication modules 9110 can be configured in the same electronic device, such as cellular network modules, Bluetooth modules, and / or wireless LAN modules. The communication module 9110 (transmitter / receiver) is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and receive audio input from the microphone 9132, thereby realizing typical telecommunications functions. The audio processor 9130 may include any suitable buffer, decoder, amplifier, etc. Additionally, the audio processor 9130 is coupled to a central processing unit 9100, enabling on-device recording via the microphone 9132 and on-device playback of stored audio via the speaker 9131.

[0169] Embodiments of this application also provide a computer-readable storage medium capable of implementing all steps of the LLVM-based emotion-sensing robot control method with a server or client as the execution subject in the above embodiments. The computer-readable storage medium stores a computer program that, when executed by a processor, implements all steps of the LLVM-based emotion-sensing robot control method with a server or client as the execution subject in the above embodiments. For example, when the processor executes the computer program, it implements the following steps:

[0170] Step S101: Extract features from the image data stream captured by the camera and the voice data stream captured by the microphone to obtain a multimodal feature vector. Perform temporal alignment processing on the multimodal feature vector to generate a standardized feature matrix. Construct an emotion feature mapper based on the standardized feature matrix to obtain a low-dimensional representation vector. Input the low-dimensional representation vector into the emotion recognition neural network to train an emotion evaluation model. Calculate the emotion state vector based on the emotion evaluation model. Establish a memory cache pool based on the emotion state vector to obtain an emotion evolution predictor.

[0171] Step S102: Input the emotional state vector and the output of the emotional evolution predictor into the language model regulator to adjust the attention parameters to obtain an emotional enhancement language model. Perform feature fusion on the user input information and the emotional state vector to obtain an emotional perception input vector. Input the emotional perception input vector into the dialogue intent understanding module to generate a dialogue history vector. Generate an interaction strategy vector based on the dialogue history vector and the emotional state vector. Adjust the emotional intensity of the interaction strategy vector to obtain an emotional expression control vector.

[0172] Step S103: The emotion expression control vector is parsed into actuator driving parameters and speech synthesis parameters. Collision detection is performed on the actuator driving parameters to obtain a safe trajectory. The robot actuator is controlled to output actions according to the safe trajectory. The speech synthesis parameters are input into the speech prosody modulator to generate a speech waveform. Based on the speech waveform and the output action, the robot's collaborative emotion expression is realized.

[0173] As described above, the computer-readable storage medium provided in this application embodiment achieves effective emotion perception through multimodal features and neural networks. It constructs a language processing mechanism, combining emotion regulation and dialogue understanding to establish a reliable interaction strategy. Expression control is introduced, ensuring the coordination of emotional expression through action planning and speech synthesis. This method effectively addresses the shortcomings of traditional technologies in emotion recognition, language processing, and emotional expression, providing technical support for emotion-perceiving robots.

[0174] Embodiments of this application also provide a computer program product capable of implementing all steps in the LLVM-based emotion-sensing robot control method, where the execution subject is a server or client, as described in the above embodiments. When executed by a processor, this computer program / instruction implements the steps of the LLVM-based emotion-sensing robot control method. For example, the computer program / instruction implements the following steps:

[0175] Step S101: Extract features from the image data stream captured by the camera and the voice data stream captured by the microphone to obtain a multimodal feature vector. Perform temporal alignment processing on the multimodal feature vector to generate a standardized feature matrix. Construct an emotion feature mapper based on the standardized feature matrix to obtain a low-dimensional representation vector. Input the low-dimensional representation vector into the emotion recognition neural network to train an emotion evaluation model. Calculate the emotion state vector based on the emotion evaluation model. Establish a memory cache pool based on the emotion state vector to obtain an emotion evolution predictor.

[0176] Step S102: Input the emotional state vector and the output of the emotional evolution predictor into the language model regulator to adjust the attention parameters to obtain an emotional enhancement language model. Perform feature fusion on the user input information and the emotional state vector to obtain an emotional perception input vector. Input the emotional perception input vector into the dialogue intent understanding module to generate a dialogue history vector. Generate an interaction strategy vector based on the dialogue history vector and the emotional state vector. Adjust the emotional intensity of the interaction strategy vector to obtain an emotional expression control vector.

[0177] Step S103: The emotion expression control vector is parsed into actuator driving parameters and speech synthesis parameters. Collision detection is performed on the actuator driving parameters to obtain a safe trajectory. The robot actuator is controlled to output actions according to the safe trajectory. The speech synthesis parameters are input into the speech prosody modulator to generate a speech waveform. Based on the speech waveform and the output action, the robot's collaborative emotion expression is realized.

[0178] As described above, the computer program product provided in this application achieves effective emotion perception through multimodal features and neural networks. It constructs a language processing mechanism, combining emotion regulation and dialogue understanding to establish a reliable interaction strategy. It introduces expression control, ensuring the coordination of emotional expression through action planning and speech synthesis. This method effectively addresses the shortcomings of traditional technologies in emotion recognition, language processing, and emotional expression, providing technical support for emotion-perceiving robots.

[0179] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0180] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0181] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0182] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0183] Specific embodiments have been used to illustrate the principles and implementation methods of this invention. The descriptions of the embodiments above are only for the purpose of helping to understand the method and core ideas of this invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this invention. Therefore, the content of this specification should not be construed as a limitation of this invention.

Claims

1. A control method for emotion-aware robots based on the LLVM core, characterized in that, The method includes: Multimodal feature vectors are obtained by extracting features from the image data stream captured by the camera and the voice data stream captured by the microphone. The multimodal feature vectors are then subjected to temporal alignment to generate a standardized feature matrix. An emotion feature mapper is constructed based on the standardized feature matrix to obtain a low-dimensional representation vector. The low-dimensional representation vector is then input into an emotion recognition neural network to train an emotion evaluation model. An emotion state vector is calculated based on the emotion evaluation model. Finally, an emotion evolution predictor is obtained by establishing a memory cache pool based on the emotion state vector. The emotional state vector and the output of the emotional evolution predictor are input into the language model regulator to adjust the attention parameters to obtain an emotional enhancement language model. The user input information and the emotional state vector are fused to obtain an emotional perception input vector. The emotional perception input vector is input into the dialogue intent understanding module to generate a dialogue history vector. An interaction strategy vector is generated based on the dialogue history vector and the emotional state vector. The interaction strategy vector is adjusted for emotional intensity to obtain an emotional expression control vector. The emotion expression control vector is parsed into actuator driving parameters and speech synthesis parameters. Collision detection is performed on the actuator driving parameters to obtain a safe trajectory. The robot actuator is controlled to output actions according to the safe trajectory. The speech synthesis parameters are input into the speech prosody modulator to generate a speech waveform. Based on the speech waveform and the output action, the robot's collaborative emotion expression is realized.

2. The emotion-sensing robot control method based on LLVM core according to claim 1, characterized in that, The process involves extracting features from the image data stream captured by the camera and the voice data stream captured by the microphone to obtain a multimodal feature vector, performing temporal alignment processing on the multimodal feature vector to generate a standardized feature matrix, and constructing an emotion feature mapper based on the standardized feature matrix to obtain a low-dimensional representation vector, including: Edge detection and region segmentation are performed on the image data stream to obtain image feature regions. Local descriptive operators are extracted from the image feature regions to generate an image feature description set. The speech data stream is converted into a spectrogram sequence and time-frequency analysis is performed to obtain an acoustic feature description set. Feature vector quantization is performed on the image feature description set and the acoustic feature description set to generate a multimodal feature coding matrix. A cross-modal alignment network is constructed based on the multimodal feature coding matrix to obtain an alignment feature mapper. The multimodal feature encoding matrix is ​​input into the aligned feature mapper for temporal alignment processing to obtain a standardized feature sequence. Principal component analysis is performed on the standardized feature sequence according to a preset dimensionality reduction rule to generate a feature projection matrix. A feature dimensionality reduction network is trained based on the feature projection matrix to obtain an emotion feature mapper. The standardized feature sequence input in real time is input into the emotion feature mapper to generate a low-dimensional representation vector.

3. The emotion-sensing robot control method based on LLVM core according to claim 1, characterized in that, The process of inputting the low-dimensional representation vector into an emotion recognition neural network to train an emotion evaluation model, calculating an emotion state vector based on the emotion evaluation model, and establishing a memory cache pool based on the emotion state vector to obtain an emotion evolution predictor includes: The low-dimensional representation vector is divided into training sequence and validation sequence according to the time sequence segmentation rule. The training sequence is subjected to data augmentation processing to generate a training sample set. A multilayer perceptron network structure is constructed based on the training sample set to obtain an emotion recognition model prototype. The emotion recognition model prototype is iteratively optimized and trained according to a preset loss function to obtain an emotion evaluation model. The validation sequence is input into the emotion evaluation model for cross-validation to obtain the model evaluation index. The real-time low-dimensional representation vector is input into the emotion assessment model for forward computation to obtain the emotion state vector. The emotion state vector is then subjected to temporal sliding sampling to generate an emotion state sequence. A recurrent neural network is constructed based on the emotion state sequence to obtain an emotion evolution predictor. The emotion evolution predictor is deployed as a memory cache pool for real-time state maintenance and prediction updates.

4. The emotion-sensing robot control method based on LLVM core according to claim 1, characterized in that, The process involves inputting the emotional state vector and the output of the emotional evolution predictor into a language model regulator to adjust attention parameters, thereby obtaining an emotionally enhanced language model; fusing user input information with the emotional state vector to obtain an emotional perception input vector; and inputting the emotional perception input vector into a dialogue intent understanding module to generate a dialogue history vector, including: The emotional state vector and the output of the emotional evolution predictor are concatenated to obtain the emotional regulation vector. The emotional regulation vector is normalized to generate a weight distribution matrix. An attention regulation network is constructed based on the weight distribution matrix to obtain a parameter regulation model. The output of the parameter regulation model is remapped with the pre-trained language model to obtain an emotional enhancement language model. User input information is converted into a text sequence and segmented to obtain a word sequence. Multi-head attention is then performed on the word sequence and the sentiment state vector to generate a fusion feature matrix. The fusion feature matrix is ​​then input into a bidirectional encoder to obtain a sentiment perception input vector. An intent recognition classifier is then constructed based on the sentiment perception input vector to generate a dialogue history vector.

5. The emotion-sensing robot control method based on LLVM core according to claim 1, characterized in that, The step of generating an interaction strategy vector based on the dialogue history vector and the emotional state vector, and adjusting the emotional intensity of the interaction strategy vector to obtain an emotional expression control vector, includes: The dialogue history vector and the emotion state vector are concatenated to obtain a multimodal state vector. The multimodal state vector is hierarchically encoded to generate a state representation matrix. A policy generation network is constructed based on the state representation matrix to obtain an interaction decision model. The interaction decision model is optimized by Monte Carlo tree search to obtain an interaction policy vector. The interaction strategy vector is decoupled and decomposed according to emotion type to obtain an intensity parameter set. An adaptive regulator is constructed based on the intensity parameter set to generate an adjustment coefficient matrix. The adjustment coefficient matrix and the interaction strategy vector are component concatenated to obtain an emotion expression control vector.

6. The emotion-sensing robot control method based on LLVM core according to claim 1, characterized in that, The process of parsing the emotion expression control vector into actuator driving parameters and speech synthesis parameters, performing collision detection on the actuator driving parameters to obtain a safe trajectory, and controlling the robot actuator to output actions based on the safe trajectory includes: The emotion expression control vector is decomposed into motion parameter matrix according to a preset parsing rule. A kinematic mapping network is constructed based on the motion parameter matrix to generate a joint space mapper. The output of the joint space mapper is constrained and optimized to obtain actuator driving parameters. The actuator driving parameters are interpolated and smoothed to generate an initial trajectory sequence. The initial trajectory sequence is input into the collision detection module for spatial interference analysis to obtain an obstacle avoidance path set. Based on the obstacle avoidance path set, a trajectory optimizer is constructed to generate a safe trajectory. The safe trajectory is then inversely solved according to the robot's kinematics model to obtain joint drive commands. The actuator is controlled to complete the action output according to the joint drive commands.

7. The emotion-sensing robot control method based on LLVM core according to claim 1, characterized in that, The step of inputting the speech synthesis parameters into a speech prosody modulator to generate a speech waveform, and realizing the robot's collaborative emotional expression based on the speech waveform and output actions, includes: The speech synthesis parameters are decomposed according to the phoneme structure to obtain the prosodic feature set. The prosodic feature set is then subjected to emotion mapping transformation to generate a prosodic control vector. An acoustic parameter generation network is constructed based on the prosodic control vector to obtain a speech synthesis model. The output of the speech synthesis model is then processed by a vocoder to obtain a speech waveform. The speech waveform is subjected to time-series analysis to obtain a speech timestamp sequence. The speech timestamp sequence is synchronized and aligned with the action execution time sequence to generate a collaborative control command. Based on the collaborative control command, the robot actuator and the speech player are scheduled to complete multimodal emotional expression.

8. A control device for emotion-sensing robots based on the LLVM core, characterized in that, The device includes: The model building module is used to extract features from the image data stream captured by the camera and the voice data stream captured by the microphone to obtain multimodal feature vectors, perform temporal alignment processing on the multimodal feature vectors to generate a standardized feature matrix, construct an emotion feature mapper based on the standardized feature matrix to obtain a low-dimensional representation vector, input the low-dimensional representation vector into the emotion recognition neural network to train an emotion evaluation model, calculate the emotion state vector based on the emotion evaluation model, and establish a memory cache pool based on the emotion state vector to obtain an emotion evolution predictor. The emotion perception module is used to input the emotion state vector and the output of the emotion evolution predictor into the language model regulator to adjust the attention parameters to obtain an emotion-enhanced language model, perform feature fusion on the user input information and the emotion state vector to obtain an emotion perception input vector, input the emotion perception input vector into the dialogue intent understanding module to generate a dialogue history vector, and generate an interaction strategy vector based on the dialogue history vector and the emotion state vector, and adjust the emotion intensity of the interaction strategy vector to obtain an emotion expression control vector. The robot control module is used to parse the emotion expression control vector into actuator driving parameters and speech synthesis parameters, perform collision detection on the actuator driving parameters to obtain a safe trajectory, control the robot actuator to output actions according to the safe trajectory, input the speech synthesis parameters into a speech prosody modulator to generate a speech waveform, and realize the robot's collaborative emotion expression based on the speech waveform and output actions.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the emotion-aware robot control method based on the LLVM core as described in any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the steps of the emotion-aware robot control method based on the LLVM core as described in any one of claims 1 to 7.