Multimodal interaction methods, apparatuses, devices, vehicles, media, and program products
By acquiring multimodal data in the smart cockpit, determining modal weights based on modal confidence, environmental adaptability, and user dependence, and resolving conflicts using cross-modal feature fusion and the Transformer model, the problem of multimodal interaction response delay and misjudgment was solved, achieving higher recognition accuracy and user satisfaction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU AUTOMOBILE GROUP CO LTD
- Filing Date
- 2026-03-02
- Publication Date
- 2026-06-26
AI Technical Summary
Existing smart cockpits lack a unified modal fusion framework for multimodal interaction, resulting in high response delays and misjudgment rates in multimodal interaction.
By acquiring multimodal data, modal weight coefficients are determined based on modal confidence, environmental adaptability, and user dependence. Data fusion is performed using a cross-modal feature fusion network, and instruction conflicts are resolved by combining a Transformer model to finally determine the target interaction instruction.
It improved the recognition accuracy of multimodal interaction commands, reduced response latency, and increased user interaction satisfaction.
Smart Images

Figure CN122284809A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of human-computer interaction technology, and more specifically, to a multimodal interaction method, apparatus, device, vehicle, medium, and program product. Background Technology
[0002] Intelligent cockpits in vehicles can support multimodal interactions. For example, users can interact with the intelligent cockpit through voice, gestures, and touch. Currently, intelligent cockpits process data from each modality independently, lacking a unified modal fusion framework. For instance, when a user simultaneously inputs interactive commands via voice and gestures, the intelligent cockpit cannot integrate the voice and gesture data, resulting in high latency and misjudgment rates in multimodal interaction responses. Summary of the Invention
[0003] This application provides a multimodal interaction method, apparatus, device, vehicle, medium, and program product to solve the aforementioned technical problems.
[0004] In a first aspect, embodiments of this application provide a multimodal interaction method, the method comprising: acquiring multimodal data, the multimodal data including at least two of voice modal data, gesture modal data, touch modal data, and visual modal data; determining modal weight coefficients for each modality based on modal confidence, environmental adaptability, and user dependence, wherein the modal confidence is used to quantify modal recognition accuracy, the environmental adaptability is used to quantify the degree of interference of the current environment on the modality, and the user dependence is used to quantify the user's modal preference; fusing the multimodal data according to the modal weight coefficients to obtain fused data; determining a target interaction instruction based on the fused data; and executing the target interaction instruction.
[0005] Secondly, embodiments of this application provide a multimodal interaction device, comprising: a data acquisition module for acquiring multimodal data, wherein the multimodal data includes at least two of voice modal data, gesture modal data, touch modal data, and visual modal data; a weight determination module for determining modal weight coefficients for each modality based on modal confidence, environmental adaptability, and user dependence; a data fusion module for fusing the multimodal data based on the modal weight coefficients to obtain fused data; an instruction determination module for determining a target interaction instruction based on the fused data; and an instruction execution module for executing the target interaction instruction.
[0006] Thirdly, embodiments of this application provide an electronic device, which includes a memory and a processor. The memory stores an application program that, when invoked by the processor, causes the processor to execute the method provided in the embodiments of this application.
[0007] Fourthly, embodiments of this application provide a vehicle that includes electronic equipment as described in the third aspect.
[0008] Fifthly, embodiments of this application provide a computer-readable storage medium storing program code, which, when invoked by a processor, causes the processor to execute the method provided in embodiments of this application.
[0009] Sixthly, embodiments of this application provide a computer program product, which, when invoked by a processor, causes the processor to execute the method provided in embodiments of this application.
[0010] The multimodal interaction method, apparatus, device, vehicle, medium, and program product provided in this application have the following technical effects: Based on the attention mechanism, the modal weight coefficients of each modality are determined according to the modal confidence, environmental adaptability, and user dependence of each modality, and multimodal data is fused according to the modal weight coefficients of each modality. The target interaction command is determined based on the fused data, which can improve the accuracy of recognizing multimodal interaction commands and reduce the response latency of multimodal interaction. Attached Figure Description
[0011] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments and drawings obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0012] Figure 1 This is a flowchart of a multimodal interaction method provided in an embodiment of this application; Figure 2 This is a flowchart of step S120 provided in an embodiment of this application; Figure 3 This is a flowchart of step S130 provided in an embodiment of this application; Figure 4 This is a flowchart of step S140 provided in an embodiment of this application; Figure 5 This is a flowchart of step S143 provided in an embodiment of this application; Figure 6 This is a partial flowchart of a multimodal interaction method provided in another embodiment of this application; Figure 7 This is a flowchart of a multimodal interaction method provided in another embodiment of this application; Figure 8 This is a structural block diagram of a multimodal interaction device provided in an embodiment of this application; Figure 9 This is a structural block diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0013] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
[0014] The multimodal interaction method in this application can be applied to multimodal interaction devices, electronic devices, or vehicles. The multimodal interaction device can be deployed in an electronic device or vehicle. The vehicle includes a smart cockpit, microphone, camera, touchscreen, and driver monitoring system (DMS). Users can interact with the smart cockpit through various methods such as voice, gesture, touch, and vision. The vehicle can include, but is not limited to, gasoline vehicles or new energy vehicles. New energy vehicles can include electric vehicles, which can include, but are not limited to, pure electric vehicles, hybrid electric vehicles, or fuel cell vehicles. The multimodal interaction method of this application will be described below using a vehicle as an example.
[0015] See Figure 1 , Figure 1 This is a flowchart of a multimodal interaction method provided in an embodiment of this application. The multimodal interaction method may include steps S110 to S150.
[0016] Step S110: Acquire multimodal data, which includes at least two of the following: voice modal data, gesture modal data, touch modal data, and visual modal data.
[0017] The intelligent cockpit can collect multi-modal data from inside the vehicle, such as at least two of the following: voice modality data, gesture modality data, touch modality data, and visual modality data. After collecting the multi-modal data, each modality data can be preprocessed to obtain the feature matrix or feature vector of each modality.
[0018] For speech modalities: Speech signals can be acquired using a microphone array, and features of the speech signals can be extracted using Mel-Frequency Cepstral Coefficients (MFCCs) to obtain the speech feature matrix. ,in, For time step, For dimensional features.
[0019] For gesture modalities: Gesture images can be captured using an in-vehicle camera, normalized to 224×224 pixels, and deblurred using Gaussian filtering (σ=1.0). The coordinates of 21 hand joints are extracted using the MediaPipe gesture point detection model, and these joint coordinates are mapped to the [0,1] interval to construct a gesture feature vector. (x / y coordinates).
[0020] For touch-based modes: Touch coordinates are collected via the touchscreen, and the touch coordinates (x, y) are mapped to the screen resolution normalized range [0,1]. The press duration of the touch coordinates is limited to a preset press duration. Construct touch feature vectors within the range of [0 seconds, 3 seconds]. , .
[0021] For the visual modality: Facial region images of the driver are acquired via DMS, and facial region images are detected using a Multi-Task Convolutional Neural Network (MTCNN). The facial region images are then cropped to 112×112 pixels, and expression features are extracted using a Residual Network (ResNet). In the preprocessing stage, as shown in expression (1), the three sigma (3σ) criterion can also be used to remove outliers in the multimodal data. Outliers refer to modal data that are outside the reasonable range, such as touch coordinates that exceed the preset press duration, touch coordinates that exceed the screen resolution of the touch screen, and gesture joint coordinate deviations greater than 3σ.
[0022] Among them, the mean μ and standard deviation σ of the 3σ criterion can be calculated based on the most recent 100 frames of real-time sliding window data. If a certain frame of data... If the data exceeds [μ-3σ, μ+3σ], the data in this frame will be replaced with the valid data from the previous frame to avoid data loss.
[0023] Step S120: Determine the modality weight coefficient of each modality based on the modality confidence, environment adaptability, and user dependence of each modality. The modality confidence is used to quantify the modality recognition accuracy, the environment adaptability is used to quantify the degree of interference of the current environment on the modality, and the user dependence is used to quantify the user's modality preference.
[0024] To quantify the real-time confidence and scenario adaptability of each modality, this application constructs modal weight coefficients for each modality. The modal weight coefficients range from [0,1], and are determined by the modal confidence level. Environmental adaptability User dependence The result is obtained through a three-factor weighted calculation. See details below. Figure 2 , Figure 2 This is a flowchart of step S120 provided in an embodiment of this application. Step S120 includes steps S121 to S124.
[0025] Step S121: Determine the modality confidence level of each modality based on the recognition accuracy of each modality.
[0026] Modal confidence It is a real-time quantized value based on the modality recognition accuracy.
[0027] For speech modalities: the confidence level output by the Automatic Speech Recognition (ASR) module can be directly obtained. The ASR module converts human speech into text or commands that a computer can process and outputs the speech recognition confidence level.
[0028] For gesture modalities: the output probability of the classifier for gesture modalities can be obtained, as shown in expression (2). Based on the output probability of the classifier for gesture modalities, the modal confidence of gesture modalities can be calculated.
[0029] in, These are preset coefficients; It is the output probability of the classifier for each modality.
[0030] For touch mode: the output probability of the classifier for touch mode can be obtained, as shown in expression (2). Based on the output probability of the classifier for touch mode, the mode confidence of touch mode can be calculated.
[0031] For visual modalities: the output probability of the classifier for the visual modalities can be obtained, as shown in expression (2). Based on the output probability of the classifier for the visual modalities, the modal confidence of the visual modalities can be calculated.
[0032] Step S122: Determine the environmental adaptability of each mode based on the degree of interference of the current environment on each mode.
[0033] Environmental adaptability is a quantitative value of the degree of interference of the current cockpit environment on each mode.
[0034] For the speech modality: The speech modality is affected by noise, as shown in expression (3), which can be determined based on the current noise signal-to-noise ratio. and signal-to-noise ratio when there is no interference Calculate the environmental adaptability of the speech modality.
[0035] For gesture modalities: Gesture modalities are affected by illumination, as shown in expression (4), and can be determined based on the current illumination intensity. and optimal light intensity Calculate the environmental adaptability of gesture modalities.
[0036] For touch mode: Touch mode is affected by temperature and humidity. Optionally, as shown in expression (5), it can be based on the current ambient temperature. and optimal ambient temperature Calculate the environmental adaptability of the touch mode, or, as shown in expression (6), it can be based on the current device humidity. And the humidity of the equipment when there is no interference Calculate the environmental adaptability of the touch mode, or, as shown in expression (7), calculate the environmental adaptability of the touch mode based on the current ambient temperature, the optimal ambient temperature, the current device humidity, and the device humidity when there is no interference.
[0037] in, , It can be a weight coefficient that is dynamically adapted to the scene. .
[0038] For visual modalities: Visual modalities are affected by illumination, as shown in expression (4), and can be determined based on the current illumination intensity. and optimal light intensity Calculate the environmental adaptability of the visual modality.
[0039] Step S123: Determine the user dependency of each modality based on the historical usage frequency of each modality.
[0040] User dependence is a quantified value of a user's preference for (or frequency of use) a modality. If multimodal includes four types: voice, gesture, touch, and vision, then as shown in expression (8), the frequency of user use of each modality can be statistically analyzed based on the user's historical interaction data to determine the user dependence of each modality.
[0041] in, It represents the number of times modality m was used in the user's historical interaction data.
[0042] Step S124: Perform weighted calculations on the modality confidence, environment adaptability, and user dependence of each modality to obtain the modality weight coefficients for each modality.
[0043] As shown in expression (9), the modal confidence level of each modality can be calculated. Environmental adaptability and user dependence Weighted calculations are performed to obtain the modal weight coefficients for each mode.
[0044] in, , , All are weighting coefficients. , , , It can dynamically adapt to different scenarios.
[0045] Steps S121 to S124 have the following technical effects: calculating the modal weight coefficients based on the attention mechanism can improve the accuracy of recognizing multimodal commands.
[0046] Step S130: Based on the modal weight coefficients of each modality, fuse the multimodal data to obtain fused data.
[0047] This application utilizes a three-stage fusion architecture of "attention weighting → feature concatenation → dimensionality compression" in a Cross-Modality Feature Fusion (CMFN) network to fuse multimodal features into a unified interactive feature vector. For details, see Figure 3 , Figure 3 This is a flowchart of step S130 provided in an embodiment of this application. Step S130 includes steps S131 to S133.
[0048] Step S131: Use the modal weight coefficients of each modality to perform weighted calculations on the data of each modality.
[0049] The modality weight coefficients of each modality can be used to apply attention weights to the features of each modality. Specifically, as shown in expression (10), the modality weight coefficients of each modality can be used. For each modal feature Perform weighted calculations to obtain the weighted modal features. Highlighting the contribution of high-confidence modes.
[0050] Step S132: The weighted modal data are spliced together to obtain spliced data.
[0051] Cross-modal feature concatenation can be performed on the weighted modal data. Specifically, as shown in expression (11), the weighted modal features can be concatenated into a high-dimensional feature matrix. .
[0052] in, This represents the feature dimension after concatenation.
[0053] Step S133: Compress the spliced data to obtain the fused data.
[0054] Dimensional compression and speech fusion can be performed on the spliced data. Specifically, as shown in expressions (12) and (13), the dimensionality of the spliced data can be compressed using a 2-layer fully connected (FC) network and a rectified linear unit (ReLU) function to output fused features.
[0055] in, This is the weight matrix; , For bias terms, This is the final fusion feature.
[0056] Steps S131 to S133 have the following technical effects: The three-order fusion architecture of “attention weighting → feature splicing → dimensionality compression” based on CMFN network fuses multimodal features, which can significantly improve the accuracy of multimodal instruction recognition.
[0057] Step S140: Determine the target interaction command based on the fused data.
[0058] This application uses the Transformer model to classify the interaction commands corresponding to each modality based on the fused data. In the event of command conflicts, a conflict resolution mechanism is first used to resolve the conflicts, and then the final target interaction command is determined based on the conflict resolution results. For details, see [link to details]. Figure 4 , Figure 4 This is a flowchart of step S140 provided in an embodiment of this application. Step S140 includes steps S141 to S144.
[0059] Step S141: Identify the interaction commands corresponding to each modality based on the fused data.
[0060] The Transformer model can be used to identify the interaction commands corresponding to each modality based on the fused data. Specifically, as shown in expression (14), the fused features can be... Input a Transformer model and obtain the interaction commands and command classification probabilities corresponding to each modality output by the Transformer model. The output layer of the Transformer model uses the softmax activation function.
[0061] in, , The output layer weights corresponding to interactive instruction I. , K represents the output layer bias corresponding to the interaction command I, and K represents the total number of cockpit interaction command categories.
[0062] Step S142: Detect whether there are command conflicts between the interaction commands corresponding to each modality.
[0063] Command conflict refers to two contradictory commands. For example, "turn on the air conditioner" and "turn off the air conditioner" conflict. Similarly, "use navigation voice" and "play music" conflict. Command conflicts need to be resolved; otherwise, they can cause errors or errors in the execution of the commands, thus reducing the user experience.
[0064] Based on the execution device and content of the interactive commands, it is possible to detect whether there are command conflicts between pairs of interactive commands corresponding to each modality. Specifically, for any two interactive commands, it can first be determined whether the execution devices of the two interactive devices are the same; if the execution devices are different, it is determined that there is no command conflict between the two interactive devices (for example, "turn off the air conditioner" and "play music" do not conflict); if the execution devices are the same, it is further determined whether the command content of the two interactive commands is contradictory; if the command content is contradictory, it is determined that there is a command conflict between the two interactive commands (for example, "turn on the air conditioner" and "turn off the air conditioner" conflict); if the command content is not contradictory, it is determined that there is no command conflict between the two interactive commands (for example, "turn on the air conditioner" and "set the air conditioner temperature to 25 degrees" do not conflict).
[0065] If there is a conflict between the interactive instructions corresponding to each modality, the conflict is resolved, and the target interactive instruction is determined based on the conflict resolution result (step S143).
[0066] If there is no conflict between the interaction instructions corresponding to each modality, then all interaction instructions are determined as the target interaction instructions (step S144).
[0067] Step S143: Resolve the instruction conflict and determine the target interaction instruction based on the instruction conflict resolution result.
[0068] See Figure 5 , Figure 5 This is a flowchart of step S143 provided in an embodiment of this application. Step S143 includes steps S1431 to S1433.
[0069] Step S1431: Determine the command conflict situation between the interactive commands corresponding to each modality.
[0070] When detecting whether there are command conflicts between the interactive commands corresponding to each modality, if a command conflict is detected, the interactive commands with command conflict relationships can be marked in order to determine the command conflict status of all interactive commands.
[0071] Assuming that the multimodal interaction in the smart cockpit includes voice mode, gesture mode, touch mode and visual mode, then all interaction commands include the interaction commands corresponding to the voice mode (①), the interaction commands corresponding to the gesture mode (②), the interaction commands corresponding to the touch mode (③), and the interaction commands corresponding to the visual mode (④). As an example, the command conflict situation may have several situations as shown in Table 1: (1) Two groups of interaction commands in the same group have command conflicts; (2) All interaction commands have command conflicts between each other; (3) Two of the three interaction commands have command conflicts between each other, and the remaining one interaction command does not have command conflicts; (4) Two interaction commands have command conflicts, and the remaining two interaction commands do not have command conflicts.
[0072] Table 1 Step S1432: If the instruction conflict occurs in all interactive instructions, then resolve the instruction conflict and determine the result of the instruction conflict resolution as the target interactive instruction.
[0073] Step S1433: If the instruction conflict situation is that some interactive instructions have instruction conflicts and other interactive instructions do not have instruction conflicts, then eliminate the instruction conflict and determine the result of instruction conflict resolution and the interactive instructions that do not have instruction conflicts as the target interactive instructions.
[0074] In this embodiment of the application, resolving instruction conflicts includes: for each interactive instruction in the interactive instructions that have the instruction conflict, determining the conflict cost of each interactive instruction; determining interactive instructions with instruction conflict relationship as a group, and selecting the interactive instruction with the minimum conflict cost from the interactive instructions in the same group as the instruction conflict resolution result.
[0075] As an example, if the instruction conflict situation is that all interactive instructions conflict, such as the case in Table 1 where "①, ②, ③, and ④ conflict with each other", then the conflict cost of each interactive instruction ①, ②, ③, and ④ can be calculated separately. The interactive instructions ①, ②, ③, and ④ with instruction conflict relationship are determined as a group. The interactive instruction ① with the smallest conflict cost in the same group is selected as the instruction conflict resolution result. That is, interactive instruction ① is the target interactive instruction.
[0076] As an example, if the instruction conflict situation is that all interactive instructions conflict, such as the case shown in Table 1 where "① and ② conflict, and ③ and ④ conflict", then the conflict cost of each of the interactive instructions ①, ②, ③, and ④ can be calculated separately. Interactive instructions ① and ② with instruction conflict relationship are determined as one group, and interactive instructions ③ and ④ with instruction conflict relationship are determined as another group. From the interactive instructions in the same group, interactive instructions ① and ③ with the smallest conflict cost are selected as the instruction conflict resolution result. That is, interactive instructions ① and ③ are the target interactive instructions.
[0077] As an example, if the instruction conflict situation involves some interactive instructions conflicting while others do not, such as the case shown in Table 1 where "①, ②, and ③ all have instruction conflicts in pairs, while ④ does not have an instruction conflict," then the conflict costs of the interactive instructions ①, ②, and ③ that have instruction conflicts can be calculated separately. The interactive instructions ①, ②, and ③ that have instruction conflict relationships are grouped together. From the interactive instructions in the same group, the interactive instruction ① with the lowest conflict cost is selected as the instruction conflict resolution result. The instruction conflict resolution result ① and the interactive instruction ④ that does not have an instruction conflict are both determined as the target interactive instructions.
[0078] As an example, if the instruction conflict situation involves some interactive instructions conflicting while others do not, such as the case shown in Table 1 where "① and ② conflict, while ③ and ④ do not conflict", then the conflict costs of the conflicting interactive instructions ①, ②, and ③ can be calculated separately. The interactive instructions ① and ② with the conflict relationship are grouped together. From the interactive instructions in the same group, the interactive instruction ① with the lowest conflict cost is selected as the instruction conflict resolution result. The instruction conflict resolution result ①, as well as the interactive instructions ③ and ④ without conflict, are all determined as the target interactive instructions.
[0079] In this embodiment of the application, for each interactive instruction with instruction conflict, determining the conflict cost of each interactive instruction includes: determining the conflict cost of each interactive instruction based on the user cost, security cost, and system cost of each interactive instruction, wherein the user cost is used to quantify the deviation between the interactive instruction and user habits, the security cost is used to quantify the impact of the interactive instruction on vehicle driving safety, and the system cost is used to quantify the system overhead of executing the interactive instruction.
[0080] In this embodiment of the application, for each interactive instruction with instruction conflict, the conflict cost of each interactive instruction is determined based on the user cost, security cost, and system cost of each interactive instruction, including: as shown in expression (15), determining the user cost of each interactive instruction based on the modal weight coefficient corresponding to each interactive instruction. As shown in expression (16), the security cost of each interaction instruction is determined based on the operation time of the modality corresponding to each interaction instruction. As shown in expression (17), the system cost of each interactive instruction is determined based on the energy consumption requirement corresponding to each interactive instruction. As shown in expression (18), the user cost, security cost, and system cost of each interactive instruction are weighted and calculated to obtain the conflict cost of each interactive instruction. .
[0081] in, For indicator functions; For user-preferred modalities, such as the user's frequently used voice, then The modality of user preferences can be determined based on the user's historical interaction data. Expression (15) can calculate the deviation between conflicting instructions and user habits based on the user's historical interaction preferences. The operation time for mode m; The maximum time threshold for safe operation; The difference in energy consumption for executing conflicting instructions. This represents the maximum energy consumption for a single operation. , , These are weighting coefficients used to calculate conflict costs that can be dynamically adapted to the scenario. (Safety-first scenario) ).
[0082] This application quantifies the deviation between interaction commands and user habits by incurring user costs, allowing for the selection of interaction commands based on user preferences as much as possible, thereby improving user satisfaction. It also quantifies the impact of interaction commands on vehicle driving safety by incurring safety costs (e.g., touch operation is more dangerous than voice operation in high-speed scenarios), reducing the incidence of danger in specific scenarios (e.g., high-speed scenarios) and ensuring driving safety. Finally, it quantifies the system overhead of executing interaction commands by incurring system costs (e.g., the energy consumption of frequent air conditioning start-stop), saving system overhead. Therefore, by determining the conflict cost of each interaction command based on its user cost, safety cost, and system cost, the application can balance user preferences, vehicle safety, and vehicle system overhead, improving the user experience. In experiments, the inventors applied the multimodal interaction method of this application, and the safety cost quantification reduced the incidence of dangerous operations (such as complex touch controls) in high-speed scenarios by ≤70%.
[0083] Step S144: Determine all interactive instructions as target interactive instructions.
[0084] Steps S141 to S144 have the following technical effects: by detecting whether there is a conflict between interactive commands, and resolving the conflict first and then determining the target interactive command based on the conflict resolution result, the occurrence rate of command conflicts between multimodal interactions can be reduced and user interaction satisfaction can be improved.
[0085] Step S150: Execute the target interactive instruction.
[0086] The intelligent cockpit can control the corresponding execution device to execute the target interaction command based on the target interaction command. For example, if the target interaction command is "turn on the air conditioner", the intelligent cockpit can control the air conditioner to turn on.
[0087] Steps S110 to S150 have the following technical effects: Based on the attention mechanism, the modal weight coefficients of each modality are determined according to the modal confidence, environmental adaptability and user dependence of each modality, and multimodal data fusion is performed according to the modal weight coefficients of each modality. The target interaction command is determined based on the fused data, which can improve the accuracy of recognizing multimodal interaction commands and reduce the response latency of multimodal interaction.
[0088] See Figure 6 , Figure 6 This is a partial flowchart of a multimodal interaction method provided in another embodiment of this application. In some embodiments, in addition to the steps S110 to S150 described above, the multimodal interaction method may also include steps S210 to S220.
[0089] Step S210: After executing the target interaction command, collect interaction feedback information, including the response delay of this interaction, the number of times the user corrected the command, and the user satisfaction.
[0090] This application is based on a reinforcement learning personalized adaptation module. With user interaction satisfaction as the goal, it optimizes modal response strategies through reinforcement learning and learns user preference parameters θ (e.g., gesture recognition threshold for left-handed users, priority of voice commands in commuting scenarios).
[0091] Step S220: Update the modal weight coefficients of each modality based on the interactive feedback information.
[0092] Rewards can be calculated based on the interactive feedback information; based on the rewards, the modal weight coefficients of each modality are updated using the Proximal Policy Optimization (PPO) algorithm. This application updates the modal weight coefficients using the PPO algorithm, ensuring a moderate update magnitude and avoiding both excessively large updates that lead to policy instability and excessively small updates that fail to induce policy change.
[0093] As shown in expression (18), the state space S involved in the PPO algorithm includes the cabin environment state (noise SNR, illumination Illum, ambient temperature, equipment humidity) and the user state (driving time). (expression exp), modal states (modal weights) ).
[0094] As shown in expression (19), the action space A involved in the PPO algorithm includes modal response strategy adjustment (e.g., increasing speech weights and optimizing gesture recognition thresholds).
[0095] in, This is the adjustment amount for the modal weighting coefficients. This is the adjustment amount for the modality recognition threshold.
[0096] As shown in expression (20), the reward function R involved in the PPO algorithm is based on the number of times the user corrects the instruction c (the fewer the better) and the response delay of this interaction. (The shorter the better) is the penalty item, and the user's satisfaction rating s (user-initiated rating) is the reward item.
[0097] in, The total time consumed in the process of "modal data acquisition + CMFN network feature fusion + instruction execution feedback" is... The calculation formula is shown in expression (21).
[0098] in, For data collection time, The time required for CMFN fusion To determine the execution response time, the target control It needs to be less than 0.5 seconds.
[0099] This application updates the policy network parameters θ using the PPO algorithm. The objective function of the PPO algorithm is expression (22), and the goal of reinforcement learning is to optimize the policy. Maximize the accumulated reward R.
[0100] in, For the new strategy (current strategy); This is the old strategy; The strategy ratio, = This represents the ratio of the new strategy to the old strategy, indicating the degree of strategy change. The dominant function; For the clip (cropping) parameter, Limited to [ Within this scope, if the strategy ratio exceeds the editing range, it will be penalized to ensure that the update range is moderate, avoiding excessively large updates that lead to instability in the updated strategy and excessively small updates that result in no change in the strategy.
[0101] Specifically, the PPO algorithm update strategy process includes steps S1 to S6.
[0102] Step S1: After executing the target interaction instruction using the old strategy, collect the state space, action space, and reward.
[0103] Step S2: Use Generalized Advantage Estimation (GAE) to calculate the advantage function (advantage value) for this interaction based on the reward, and calculate the target reward based on the reward; use the advantage function to guide policy improvement.
[0104] Step S3: Calculate the policy ratio of the improved new policy to the old policy.
[0105] Step S4: If the policy ratio is outside the pruning range, it will be penalized to ensure that the update range is moderate, avoiding excessive updates that could lead to policy instability and excessively small updates that could result in no policy change.
[0106] Step S5: Update the value function: Update the value function by minimizing the mean squared error between the current state value function and the target reward, where the cumulative reward is calculated based on the reward. Step S6: Repeat steps S1 to S5 for each interaction to gradually optimize the strategy until the strategy converges (the average cumulative reward no longer increases significantly, or the KL divergence of the strategy remains in a small range, or the total number of iterations reaches the iteration threshold).
[0107] Steps S210 to S220 have the following technical effects: Based on the response delay of this interaction, the number of times the user corrects the command, and the user satisfaction, updating the modal weight coefficients of each modality can learn user preferences, improve the accuracy of recognizing multimodal commands and reduce interaction response delay. For example, it can improve the accuracy of gesture recognition for left-handed users and reduce voice response delay in commuting scenarios.
[0108] See Figure 7 , Figure 7 This is a flowchart of a multimodal interaction method provided in another embodiment of this application. The multimodal interaction method includes steps S310 to S3110.
[0109] Step S310: Collect multimodal data using a microphone, camera, touchscreen, and DMS.
[0110] Step S320: Preprocess the multimodal data by using the 3σ rule to remove outliers and obtain multimodal features.
[0111] Step S330: Based on the attention mechanism, the modality confidence, environment adaptability, and user dependence of each modality are weighted and calculated to obtain the modality weight coefficient of each modality.
[0112] Step S340: Through the three-order fusion architecture of "attention weighting - feature concatenation - dimensionality compression" of the CMFN network, multimodal features are fused into a unified interactive feature vector.
[0113] By calculating the modal confidence of each modality based on the attention mechanism using an attention weight model and fusing features across modalities through a CMFN network, the accuracy of multimodal feature fusion and multimodal command recognition can be improved. In experiments, the inventors applied the multimodal interaction method of this application, and the accuracy of multimodal command recognition was improved by ≥25% (from 70% in the prior art to over 95%).
[0114] Step S350: Input the unified interaction feature vector into the Transformer model to obtain the interaction instructions corresponding to each modality output by the Transformer model.
[0115] Step S360: Detect whether there are command conflicts in the interaction commands corresponding to each modality.
[0116] If there is a conflict between the interactive instructions corresponding to each mode, the conflict is resolved and the target interactive instruction is determined based on the result of the conflict (step S370).
[0117] If there is no instruction conflict between the interaction instructions corresponding to each modality, then the interaction instructions corresponding to each modality are all determined as the target interaction instructions (step S380).
[0118] Step S370: Resolve instruction conflicts and determine the target interactive instruction based on the result of the instruction conflict.
[0119] This application uses a conflict cost model to resolve modal command conflicts, which can reduce the occurrence rate of modal conflicts and improve user interaction satisfaction. In experiments, the inventors applied the multimodal interaction method of this application, reducing the occurrence rate of modal conflicts by ≤80% and improving user interaction satisfaction by ≥20%.
[0120] Step S380: Determine the interaction instructions corresponding to each modality as the target interaction instructions.
[0121] Step S390: Execute the target interactive instruction.
[0122] Step S3100: Collect interaction feedback information, including the response delay of this interaction, the number of times the user corrected the command, and the user satisfaction.
[0123] Step S3110: Based on the interactive feedback information, update the user preference parameters through the PPO algorithm and optimize the modal weights and response strategies for the next round.
[0124] This application aims to improve user interaction satisfaction by using the PPO algorithm to optimize modal response strategies through reinforcement learning. It learns user preference parameters θ (e.g., gesture recognition threshold for left-handed users, voice command priority in commuting scenarios), enabling personalized adaptation of multimodal interactions. This makes multimodal interactions more user-preferred, thereby increasing user satisfaction. In experiments, the inventors applied this multimodal interaction method, achieving a ≥30% improvement in gesture recognition accuracy for left-handed users and a ≤40% reduction in voice response latency in commuting scenarios.
[0125] For a detailed description of steps S310 to S3110, please refer to the relevant sections above. They are not repeated in this embodiment.
[0126] Steps S310 to S3110 have the following technical effects: by quantifying modal value → dynamically fusing features → resolving interaction conflicts → learning user preferences, a multimodal interaction full-process optimization method is formed, which can completely solve the core pain points of existing technologies such as "shallow fusion, low accuracy, chaotic conflicts, and lack of personalization". It can be widely applied to the intelligent cockpit human-machine interaction system of vehicles and has significant industrial value and feasibility of implementation.
[0127] As an example, taking a high-speed driving scenario where the user simultaneously issues the voice command "Set the air conditioner to 24℃" and the gesture "Increase the fan speed by 1 level," the execution process of the multimodal interaction method in this application is as follows: 1. Data Acquisition: Extracting features from the speech signal using MFCC. Gesture key point coordinates extracted via MediaPipe .
[0128] 2. Modal weighting coefficient calculation: High-speed scene noise SNR=20dB, therefore =0.5; the user used voice 60 times in their last 100 interactions, therefore... =0.6; ASR recognition confidence level =0.9, therefore =0.89; taking α=0.2, β=0.5, γ=0.3, the calculation yields... =0.608; similarly calculated =0.35.
[0129] 3. Feature fusion and instruction recognition: weighted features , After splicing, the fused features are output through CMFN. The Transformer model recognizes the interaction command as "Air conditioner 24℃, fan speed +1".
[0130] 4. Conflict-free execution: Interactive commands are conflict-free. The system executes the interactive command "Air conditioner 24℃, fan speed +1" and returns interactive feedback information.
[0131] 5. Strategy Update: If the user does not correct the instruction, the satisfaction rating s=5, and a reward is given. The PPO algorithm update strategy increases the speech weight in high-speed scenarios. =0.65.
[0132] As another example, in an urban commuting scenario (SNR=30dB, Illum=500lux), a left-handed user simultaneously issues the voice command "play music" and the gesture "switch navigation route".
[0133] 1. Data Acquisition: Extracting features from the speech signal using MFCC. Gesture key point coordinates extracted via MediaPipe .
[0134] 2. Calculation of modal weighting coefficients: Speech modality: , ; , The user used voice 40 times in their last 100 interactions. =0.4; taking α=0.4, β=0.2, γ=0.4, the calculation yields .
[0135] Gesture modality: =0.85 (left-handed users saw a 15% improvement after personalized training). =0.83; The user used gestures 30 times in their last 100 interactions. =0.3; taking α=0.4, β=0.2, γ=0.4, the calculation yields .
[0136] 3. Feature fusion and instruction recognition: weighted features , After splicing, the fused features are output through CMFN. The Transformer model recognizes the interaction command as "play music, switch navigation routes".
[0137] 4. Conflict cost calculation and instruction execution: Interactive commands corresponding to the voice modality "Play music": (Users prefer voice); =0.5s, ; , Cost of Conflict .
[0138] Interaction commands corresponding to gesture modality "Switch navigation route": ; , ; , Cost of Conflict .
[0139] < The system executes interactive commands. "Play music" to return interactive feedback information.
[0140] 5. Policy Update: User did not correct the instruction (c=0), resulting in response delay. =0.35s, satisfaction s=5, reward After the PPO algorithm updates its strategy, the weight of left-handed users' gestures will be increased in the next round. =0.59, further improving the gesture recognition accuracy to 88%.
[0141] See Figure 8 , Figure 8 This is a structural block diagram of a multimodal interaction device provided in an embodiment of this application. The multimodal interaction device 100 includes a data acquisition module 110, a weight determination module 120, a data fusion module 130, an instruction determination module 140, and an instruction execution module 150. The data acquisition module 110 is used to acquire multimodal data, which includes at least two of the following: voice modality data, gesture modality data, touch modality data, and visual modality data. The weight determination module 120 is used to determine the modality weight coefficient of each modality based on the modality confidence, environmental adaptability, and user dependence of each modality. The data fusion module 130 is used to fuse the multimodal data according to the modality weight coefficients of each modality to obtain fused data. The instruction determination module 140 is used to determine the target interaction instruction based on the fused data.
[0142] The instruction execution module 150 is used to execute target interactive instructions.
[0143] In some embodiments, the weight determination module 120 is further configured to determine the modality confidence of each modality based on the recognition accuracy of each modality; determine the environmental adaptability of each modality based on the degree of interference of the current environment on each modality; determine the user dependence of each modality based on the historical usage frequency of each modality; and perform weighted calculation on the modality confidence, environmental adaptability and user dependence of each modality to obtain the modality weight coefficient of each modality.
[0144] In some embodiments, the data fusion module 130 is further configured to use the modal weight coefficients of each modality to perform weighted calculations on each modal data respectively; to splice the weighted modal data to obtain spliced data; and to compress the spliced data to obtain fused data.
[0145] In some embodiments, the instruction determination module 140 is further configured to identify the interaction instructions corresponding to each modality based on the fused data; detect whether there is an instruction conflict between the interaction instructions corresponding to each modality; if there is an instruction conflict between the interaction instructions corresponding to each modality, resolve the instruction conflict and determine the target interaction instruction based on the instruction conflict resolution result; if there is no instruction conflict between the interaction instructions corresponding to each modality, determine all interaction instructions as the target interaction instruction.
[0146] In some embodiments, the instruction determination module 140 is further configured to determine the instruction conflict situation between the interactive instructions corresponding to each modality; if the instruction conflict situation is that all interactive instructions have the instruction conflict, then the instruction conflict is resolved, and the instruction conflict resolution result is determined as the target interactive instruction; if the instruction conflict situation is that some interactive instructions have the instruction conflict and another part of the interactive instructions do not have the instruction conflict, then the instruction conflict is eliminated, and the instruction conflict resolution result and the interactive instructions that do not have the instruction conflict are both determined as the target interactive instructions.
[0147] In some embodiments, the instruction determination module 140 is further configured to determine the conflict cost of each interactive instruction for each interactive instruction in which the instruction conflict exists; determine the interactive instructions with instruction conflict relationship as a group, and select the interactive instruction with the smallest conflict cost from the interactive instructions in the same group as the instruction conflict resolution result.
[0148] In some embodiments, the instruction determination module 140 is further configured to determine the conflict cost of each interaction instruction based on the user cost, security cost, and system cost of each interaction instruction, wherein the user cost is used to quantify the deviation between the interaction instruction and user habits, the security cost is used to quantify the impact of the interaction instruction on vehicle driving safety, and the system cost is used to quantify the system overhead of executing the interaction instruction.
[0149] In some embodiments, the instruction determination module 140 is further configured to determine the user cost of each interactive instruction based on the modal weight coefficient corresponding to each interactive instruction; determine the security cost of each interactive instruction based on the operation time of the modality corresponding to each interactive instruction; determine the system cost of each interactive instruction based on the energy consumption requirement corresponding to each interactive instruction; and perform a weighted calculation on the user cost, security cost, and system cost of each interactive instruction to obtain the conflict cost of each interactive instruction.
[0150] In some embodiments, the multimodal interaction device 100 may further include a reinforcement learning module. The reinforcement learning module is used to collect interaction feedback information after executing a target interaction instruction, the interaction feedback information including the response delay of the current interaction, the number of times the user corrected the instruction, and user satisfaction; and to update the modal weight coefficients of each modality based on the interaction feedback information.
[0151] In some embodiments, the reinforcement learning module is further configured to calculate a reward based on the interaction feedback information; and update the modality weight coefficients of each modality using a proximal policy optimization algorithm based on the reward. Those skilled in the art will clearly understand that the apparatus provided in the embodiments of this application can implement the methods provided in the embodiments of this application. The specific working process of the described apparatus and modules can be found in the corresponding processes of the methods in the embodiments of this application, and will not be repeated here.
[0152] In the embodiments provided in this application, the coupling, direct coupling, or communication connection between the modules shown or discussed may be indirect coupling or communication coupling through some interfaces, devices, or modules, and may be electrical, mechanical, or other forms. The embodiments of this application do not impose specific limitations on this.
[0153] Furthermore, the functional modules in the embodiments of this application can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module. The integrated modules described above can be implemented in hardware or as software functional modules.
[0154] See Figure 9 , Figure 9 This is a structural block diagram of an electronic device provided in an embodiment of this application. The electronic device 200 may include a memory 210 and a processor 220. The memory 210 stores an application program, which is configured to cause the processor 220 to execute the method provided in the embodiment of this application when invoked by the processor 220.
[0155] The memory 210 may include random access memory (RAM) or read-only memory (ROM). The memory 210 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 210 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for implementing at least one function, instructions for implementing the various method embodiments described above, etc. The data storage area may store data created by the electronic device 200 during use.
[0156] Processor 220 may include one or more processing cores. Processor 220 uses various interfaces and lines to connect to various parts of the entire electronic device 200, and is used to run or execute instructions, programs, code sets or instruction sets stored in memory 210, as well as to call and run or execute data stored in memory 210, and perform various functions of electronic device 200 and process data.
[0157] The processor 220 can be implemented using at least one of the following hardware forms: Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 220 can integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem can also be implemented separately as a communication chip, without being integrated into the processor 220.
[0158] This application provides a vehicle that includes electronic equipment 200.
[0159] This application provides a computer-readable storage medium. The computer-readable storage medium stores program code configured to cause the processor to execute the methods provided in this application when invoked by a processor. The computer-readable storage medium may be an electronic storage device such as flash memory, electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), hard disk, or ROM. In some embodiments, the computer-readable storage medium includes a non-volatile computer-readable storage medium (Non-TCRSM). The computer-readable storage medium has storage space for program code that performs any of the method steps described above. This program code can be read from or written to one or more computer program products. The program code may be compressed in an appropriate form.
[0160] This application also provides a computer program product, which includes a computer program that, when invoked by a processor, causes the processor to execute the method provided in the embodiments of this application.
[0161] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A multimodal interaction method, characterized in that, include: Acquire multimodal data, wherein the multimodal data includes at least two of the following: voice modal data, gesture modal data, touch modal data, and visual modal data; Based on the modality confidence, environment adaptability, and user dependence of each modality, the modality weight coefficient of each modality is determined. The modality confidence is used to quantify the modality recognition accuracy, the environment adaptability is used to quantify the degree of interference of the current environment on the modality, and the user dependence is used to quantify the user's modality preference. Based on the mode weight coefficients of each mode, the multimodal data are fused to obtain fused data; Based on the fused data, determine the target interaction command; Execute the target interactive command.
2. The method according to claim 1, characterized in that, The determination of the modality weight coefficient for each modality based on its modality confidence, environmental adaptability, and user dependence includes: Based on the recognition accuracy of each modality, determine the modality confidence level of each modality; Determine the environmental adaptability of each mode based on the degree of interference of the current environment on each mode; Determine the user dependency on each modality based on its historical usage frequency; The modality confidence, environmental adaptability, and user dependence of each modality are weighted and calculated to obtain the modality weight coefficient of each modality.
3. The method according to claim 1, characterized in that, The process of fusing the multimodal data based on the modality weight coefficients of each modality to obtain fused data includes: The modal weighting coefficients of each modality are used to perform weighted calculations on the data of each modality. The weighted modal data are then concatenated to obtain the concatenated data. The spliced data is compressed to obtain the merged data.
4. The method according to claim 1, characterized in that, The step of determining the target interaction command based on the fused data includes: Based on the fused data, identify the interaction commands corresponding to each modality; Detect whether there are command conflicts between the interaction commands corresponding to each modality; If there are command conflicts among the interaction commands corresponding to each modality, the command conflicts are resolved, and the target interaction command is determined based on the command conflict resolution result. If there is no conflict between the interaction instructions corresponding to each modality, then all interaction instructions are determined as the target interaction instruction.
5. The method according to claim 4, characterized in that, The process of resolving the instruction conflict and determining the target interaction instruction based on the instruction conflict resolution result includes: Determine the command conflicts between the interaction commands corresponding to each modality; If the instruction conflict occurs in all interactive instructions, then the instruction conflict is resolved, and the result of the instruction conflict resolution is determined as the target interactive instruction. If the instruction conflict situation is that some interactive instructions have the instruction conflict and other interactive instructions do not have the instruction conflict, then the instruction conflict is eliminated, and the result of the instruction conflict resolution and the interactive instructions without the instruction conflict are both determined as the target interactive instructions.
6. The method according to claim 5, characterized in that, The resolution of the instruction conflict includes: For each interactive instruction among those with conflicting instructions, determine the conflict cost of each interactive instruction; Interactive instructions with conflicting relationships are grouped together, and the interactive instruction with the lowest conflict cost is selected from the interactive instructions in the same group as the result of instruction conflict resolution.
7. The method according to claim 6, characterized in that, Determining the conflict cost of each interaction instruction includes: Based on the user cost, security cost, and system cost of each interaction instruction, the conflict cost of each interaction instruction is determined. The user cost is used to quantify the deviation between the interaction instruction and user habits, the security cost is used to quantify the impact of the interaction instruction on vehicle driving safety, and the system cost is used to quantify the system overhead of executing the interaction instruction.
8. The method according to claim 7, characterized in that, The step of determining the conflict cost of each interaction instruction based on the user cost, security cost, and system cost of each interaction instruction includes: The user cost of each interaction instruction is determined based on the modal weight coefficient corresponding to each interaction instruction. The security cost of each interaction instruction is determined based on the operation time of the modality corresponding to each interaction instruction. The system cost of each interactive instruction is determined based on the energy consumption requirement corresponding to each interactive instruction. The user cost, security cost, and system cost of each interaction instruction are weighted and calculated to obtain the conflict cost of each interaction instruction.
9. The method according to claim 1, characterized in that, The method further includes: After executing the target interaction command, collect interaction feedback information, which includes the response delay of this interaction, the number of times the user corrected the command, and the user satisfaction. Based on the interactive feedback information, update the modal weight coefficients of each modality.
10. The method according to claim 9, characterized in that, The step of updating the modal weight coefficients of each modality based on the interactive feedback information includes: Calculate the reward based on the interactive feedback information; Based on the reward, the modality weight coefficients of each modality are updated using a proximal policy optimization algorithm.
11. A multimodal interaction device, characterized in that, include: The data acquisition module is used to acquire multimodal data, which includes at least two of the following: voice modal data, gesture modal data, touch modal data, and visual modal data. The weight determination module is used to determine the modality weight coefficient of each modality based on the modality confidence, environmental adaptability, and user dependence of each modality. The data fusion module is used to fuse the multimodal data according to the modality weight coefficients of each modality to obtain fused data; The instruction determination module is used to determine the target interaction instruction based on the fused data; The instruction execution module is used to execute target interactive instructions.
12. An electronic device, characterized in that, include: A memory and a processor, wherein the memory stores an application program that, when invoked by the processor, causes the processor to perform the method as described in any one of claims 1-5.
13. A vehicle, characterized in that, Includes the electronic device as described in claim 12.
14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores program code that, when invoked by a processor, causes the processor to perform the method as described in any one of claims 1-10.
15. A computer program product, characterized in that, The computer program product includes a computer program that, when invoked by a processor, causes the processor to perform the method as described in any one of claims 1-10.