Vehicle-mounted intelligent cockpit dialogue method and system based on multi-modal interaction

By collecting driver eye and gesture data and combining it with real-time traffic information, precise dialogue guidance is generated, which solves the problem of inappropriate dialogue guidance in complex driving scenarios in existing technologies, and improves driving safety and natural interaction.

CN122232648APending Publication Date: 2026-06-19BEIJING DAFANG YUNTU TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING DAFANG YUNTU TECH CO LTD
Filing Date
2026-03-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to provide accurate and timely dialogue guidance in complex driving scenarios, potentially increasing safety risks.

Method used

The system collects driver eye and gesture data through sensors in the in-vehicle smart cockpit, generates gaze direction vectors and driving behavior characteristics, and combines them with real-time road condition information for matching and evaluation to generate guiding dialogue content related to the current road conditions.

Benefits of technology

It enables precise triggering and content customization of dialogue guidance in complex driving environments, improving driving safety and natural interaction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122232648A_ABST
    Figure CN122232648A_ABST
Patent Text Reader

Abstract

This application provides a method and system for dialogue in an in-vehicle intelligent cockpit based on multimodal interaction, relating to the field of intelligent cockpit interaction technology. This application processes eye data to track gaze and generate a gaze direction vector; this vector, along with gesture data, is converted into driving behavior features with semantic labels, and the temporal change patterns of these features are analyzed to generate dynamic risk indicators; simultaneously, structured road condition information of the road ahead is received and parsed in real time through the vehicle communication module; the dynamic risk indicators are matched and evaluated with the current driving behavior features and real-time road condition information; based on the evaluation results, when the dynamic risk indicators are too high and the driver's behavior does not meet road condition expectations, guiding dialogue content related to the current road condition is automatically generated and broadcast. This allows for accurate voice guidance in complex driving scenarios by integrating driver status and real-time road conditions, improving driving safety and the naturalness of human-machine interaction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of intelligent cockpit interaction technology, and in particular to an in-vehicle intelligent cockpit dialogue method and system based on multimodal interaction. Background Technology

[0002] In-vehicle intelligent cockpit dialogue methods improve the efficiency of information acquisition and function control during driving through natural human-machine interaction, and have the potential to improve driving experience and safety.

[0003] In existing technologies, in-vehicle cameras are used to capture the driver's head turning angle, and voice recognition is combined to infer the driver's operating intentions in order to initiate dialogue, which to some extent achieves interaction without manual operation.

[0004] However, when vehicles face complex and rapidly changing road conditions, drivers are in a state of high tension and focus. This method struggles to accurately determine the driver's true focus of attention and cognitive load at this time, and the initiated dialogue may become disruptive due to inappropriate timing or content, even increasing safety risks. Therefore, existing technologies suffer from the technical problem of failing to provide accurate and timely dialogue guidance in complex driving scenarios. Summary of the Invention

[0005] The purpose of this application is to provide a method and system for dialogue in an in-vehicle intelligent cockpit based on multimodal interaction, so as to solve the problem that existing technologies are unable to provide accurate and timely dialogue guidance in complex driving scenarios.

[0006] To address the aforementioned technical problems, in a first aspect, this application provides a vehicle-mounted intelligent cockpit dialogue method based on multimodal interaction, comprising:

[0007] The driver's eye and gesture data are collected through the sensors in the in-vehicle smart cockpit.

[0008] The eye data is subjected to anti-interference processing and eye tracking to generate a gaze direction vector;

[0009] The gaze direction vector and the gesture data are converted into driving behavior features with semantic labels, and dynamic risk indicators are generated based on the temporal patterns of the driving behavior features.

[0010] The vehicle communication module receives and parses structured road condition information of the road ahead in real time. The road condition information describes road events or states that are about to occur within a specific distance range and that the driver needs to anticipate.

[0011] The dynamic risk indicators are matched and evaluated with the current active driving behavior characteristics and the road condition information to obtain the evaluation results:

[0012] Based on the evaluation results, when the dynamic risk index exceeds a preset threshold and the driving behavior characteristics do not meet the expected behavior corresponding to the road condition information, guiding dialogue content associated with the current road condition is generated and output.

[0013] Optionally, the eye data is subjected to anti-interference processing and eye tracking to generate a gaze direction vector, including:

[0014] A facial image sequence is acquired using a facial camera facing the driver, and an eye region image containing the pupil region and corneal reflection point is extracted from the facial image sequence.

[0015] Based on the image of the eye region, the geometric center point of the pupil region is determined as the pupil center point, and the corneal reflection point formed on the cornea by the near-infrared light source in the vehicle intelligent cockpit is located.

[0016] Based on the relative positional relationship between the center point of the pupil and the corneal reflection point, the relative line of sight of the driver relative to the facial camera is calculated;

[0017] Acquire head posture data from the inertial measurement unit in the in-vehicle intelligent cockpit, the head posture data including the rotation angle of the head;

[0018] Based on the head posture data, the relative gaze direction is transformed from a coordinate system based on the facial camera to a global coordinate system with the vehicle as the origin, resulting in a gaze direction vector.

[0019] Optionally, the gaze direction vector and the gesture data are converted into driving behavior features with semantic labels, and a dynamic risk indicator is generated based on the temporal pattern of the driving behavior features, including:

[0020] The gaze direction vector is matched with a preset orientation interval, and each matched orientation interval is assigned a semantic label describing spatial orientation to generate a visual gaze pattern with semantic labels.

[0021] The gesture data is categorized into predefined gesture intent categories, and each gesture intent category is assigned a semantic label describing the operation intent, so as to generate an operation intent pattern with semantic labels;

[0022] The visual gaze pattern and the operational intention pattern at the same moment are combined to form driving behavior characteristics that describe the driver's comprehensive behavior at the corresponding moment.

[0023] Within a set time window, multiple consecutively occurring driving behavior features are arranged in chronological order to form a behavior feature sequence, and the changing patterns between different features in the behavior feature sequence are analyzed.

[0024] Based on the aforementioned pattern of change, a dynamic risk index is calculated to describe the degree of concentration and dispersion of the driver's attention to the road environment within the stated time window.

[0025] Optionally, within a set time window, multiple consecutively occurring driving behavior features are arranged in chronological order to form a behavior feature sequence, and the changing patterns between different features in the behavior feature sequence are analyzed, including:

[0026] Set a fixed-duration sliding time window to continuously receive driving behavior features generated in chronological order;

[0027] The driving behavior features that enter the sliding time window are arranged in chronological order to form a behavior feature sequence;

[0028] The frequency of each driving behavior feature in the sequence of behaviors is counted within the sliding time window, and the frequency of each driving behavior feature in the sliding time window is calculated based on the number of occurrences.

[0029] The number of times the feature type changes between two adjacent driving behavior features is counted within the sliding time window;

[0030] The variation pattern of the behavioral feature sequence is determined based on the occurrence frequency and the number of changes.

[0031] Optionally, the vehicle communication module receives and parses structured road condition information of the road ahead in real time. This road condition information describes road events or states that are about to occur within a specific distance range and require the driver's prediction, including:

[0032] The vehicle communication module receives raw traffic data from roadside communication devices or cloud-based traffic service broadcasts.

[0033] Extract road description information related to the vehicle's current lane from the original road condition data;

[0034] From the road description information, filter out the road condition information ahead where the distance to the event location is less than a preset distance threshold;

[0035] The road condition information ahead is arranged in order of relative distance from near to far to form structured road condition information.

[0036] Optionally, the dynamic risk indicators are matched and evaluated with the currently active driving behavior characteristics and the road condition information to obtain evaluation results, including:

[0037] The system extracts the description information of the event closest to the vehicle that has not yet been passed from the structured road condition information as the target road condition information, and extracts the expected driving behavior and suggested areas of focus from the target road condition information.

[0038] The directional range of the visual gaze pattern in the driving behavior characteristics at the current moment is compared with the directional range of the suggested attention area to obtain a first judgment result;

[0039] The operation intention of the operation intention mode in the current driving behavior characteristics is compared with the operation type of the expected driving behavior to obtain a second judgment result;

[0040] The value of the dynamic risk indicator is compared with a preset attention distraction threshold to obtain the comparison result;

[0041] The first judgment result, the second judgment result, and the comparison result are combined to generate an evaluation result.

[0042] Optionally, based on the evaluation results, when the dynamic risk indicator exceeds a preset threshold and the driving behavior characteristics do not meet the expected behavior corresponding to the road condition information, guiding dialogue content associated with the current road condition is generated and output, including:

[0043] When the evaluation results indicate that the dynamic risk index exceeds the preset attention distraction threshold, and the visual gaze pattern in the driving behavior characteristics does not cover the suggested attention area, and the operation intention pattern in the driving behavior characteristics does not conform to the expected driving behavior, a dialogue generation is triggered.

[0044] Based on the event type in the target road condition information, select the corresponding guidance statement framework from the preset voice template library;

[0045] The distance to the location of the incident, the expected driving behavior, and the suggested area of ​​attention are filled into the guidance statement framework to generate complete guidance voice dialogue content, which is then broadcast through the car audio system.

[0046] Secondly, this application provides an in-vehicle intelligent cockpit dialogue system based on multimodal interaction, including:

[0047] The data acquisition module is used to collect the driver's eye and gesture data through the sensing devices of the in-vehicle smart cockpit;

[0048] The generation module is used to perform anti-interference processing and eye tracking on the eye data to generate a gaze direction vector;

[0049] The conversion module is used to convert the gaze direction vector and the gesture data into driving behavior features with semantic labels, and generate dynamic risk indicators based on the temporal pattern of the driving behavior features.

[0050] The parsing module is used to receive and parse structured road condition information of the road ahead of the vehicle in real time through the vehicle communication module. The road condition information describes road events or states that are about to occur within a specific distance range and that the driver needs to predict.

[0051] The evaluation module is used to match and evaluate the dynamic risk indicators with the characteristics of currently active driving behaviors and the road condition information to obtain the evaluation results.

[0052] The output module is used to generate and output guiding dialogue content related to the current road conditions based on the evaluation results, when the dynamic risk index exceeds a preset threshold and the driving behavior characteristics do not meet the expected behavior corresponding to the road condition information.

[0053] Thirdly, this application provides an electronic device, comprising:

[0054] Memory, used to store computer programs;

[0055] A processor, configured to execute the computer program to implement the steps of the in-vehicle intelligent cockpit dialogue method based on multimodal interaction as described in the first aspect above.

[0056] Fourthly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, can implement the steps of the in-vehicle intelligent cockpit dialogue method based on multimodal interaction as described in the first aspect above.

[0057] The in-vehicle intelligent cockpit dialogue method based on multimodal interaction provided in this application collects driver eye and gesture data through the sensing devices of the in-vehicle intelligent cockpit, enabling the simultaneous acquisition of raw information reflecting the driver's attention and operational intentions. By performing anti-interference processing and eye tracking on the eye data to generate a gaze direction vector, the method can more accurately determine the driver's actual gaze direction in the external space. By converting the gaze direction vector and gesture data into driving behavior features with semantic labels and generating dynamic risk indicators based on their temporal patterns, the method can quantitatively assess the driver's real-time attention to the road environment and the risk of distraction. By receiving and parsing structured road condition information ahead in real time through the vehicle communication module, the method can anticipate upcoming road events that require the driver's response. By matching and evaluating the dynamic risk indicators, current driving behavior features, and road condition information, the method can intelligently determine whether the driver's current state and behavior match the requirements of the road conditions ahead. Based on the evaluation results, when the risk is too high and the behavior is mismatched, the method generates and outputs guiding dialogue content, which can proactively provide accurate and timely voice prompts when necessary to assist the driver in safely dealing with complex road conditions.

[0058] Furthermore, gaze direction and gestures are categorized and assigned semantic labels to form understandable visual gaze patterns and operational intention patterns. These two are then combined at the same time to form a comprehensive driving behavior characteristic. By analyzing sequences of multiple such characteristics within a time window, their occurrence patterns and changes are statistically analyzed, ultimately calculating a dynamic risk indicator reflecting the driver's level of concentration and distraction. By transforming raw gaze and gesture data into a semantically clear, temporally sequential description of driver behavior, a refined and quantitative assessment of the driver's cognitive load and situational awareness under complex road conditions can be achieved, providing a core basis for subsequent risk assessment and dialogue decision-making. Attached Figure Description

[0059] To more clearly illustrate the technical solutions of the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0060] Figure 1 A flowchart illustrating a multimodal interaction-based in-vehicle smart cockpit dialogue method provided in this application embodiment;

[0061] Figure 2 A schematic diagram illustrating a specific implementation of a vehicle-mounted intelligent cockpit dialogue method based on multimodal interaction, provided in an embodiment of this application;

[0062] Figure 3 This is a schematic diagram illustrating another specific implementation of an in-vehicle intelligent cockpit dialogue system based on multimodal interaction, provided in an embodiment of this application. Detailed Implementation

[0063] When faced with complex road conditions, existing technologies struggle to deeply understand the driver's real-time attention allocation and cognitive load. Consequently, the cockpit dialogues triggered by these technologies may become a source of interference rather than assistance due to inappropriate timing or content. This results in insufficient accuracy and contextual relevance in practical applications, failing to meet the urgent need for safe and efficient human-machine interaction under high-load driving scenarios.

[0064] To address the aforementioned issues, this application proposes a novel in-vehicle intelligent cockpit dialogue method. The core of this method involves simultaneously analyzing the driver's gaze direction, gestures, and real-time structured road condition information from the vehicle network to comprehensively calculate a dynamic risk index reflecting the driver's current level of attention. Furthermore, it assesses in real-time whether the driver's behavior aligns with the expected requirements of the road ahead. Based on this assessment, guided dialogue strongly relevant to the current scenario is proactively generated and output only when the driver may not be fully paying attention to critical road conditions and their risk of distraction is high. Therefore, this solution achieves precise triggering and content customization of dialogue guidance in complex driving environments, effectively solving the problem of ineffective or disruptive dialogue caused by the lack of a deep understanding of the driver's state and the external environment in existing technologies. This enhances the naturalness of interaction while effectively ensuring driving safety.

[0065] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are merely some embodiments of the present application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0066] The core of this application is to provide a dialogue method for an in-vehicle intelligent cockpit based on multimodal interaction, and a flowchart of one specific implementation is shown below. Figure 1 As shown, the method includes:

[0067] S101. Collect driver's eye and gesture data through the sensing devices in the in-vehicle smart cockpit.

[0068] In one specific implementation, the sensing devices of the in-vehicle smart cockpit are first powered on and their driver software is initialized, assuming that the infrared camera and 3D sensing device are started to work normally. The infrared camera continuously captures a video stream containing the driver's face, while the 3D sensing device continuously emits signals to the common areas of the driver's hand movements and receives reflections, generating a point cloud data stream describing the spatial position of the object's surface.

[0069] Secondly, using target detection and tracking algorithms in computer vision, the real-time video stream from the infrared camera is analyzed frame by frame. By locating the face region in the video frames, key feature points such as the eyes and pupils are further precisely located from the face region. This allows for the real-time calculation of the degree of eye opening and closing, gaze direction, and head posture angle, forming structured eye data. The eye data refers to comprehensive information reflecting the driver's eye opening and closing state, gaze direction, pupil size, and head position relative to the steering wheel. This information is used to determine whether the driver is fatigued, distracted, or lacking concentration. The data is obtained by continuously capturing video of the driver's face using an in-vehicle infrared camera and analyzing the video stream using computer vision algorithms.

[0070] Next, point cloud processing and segmentation algorithms are used to process the point cloud data generated in real time by the 3D sensing device. By separating and clustering the point sets belonging to the driver's hands from the complex cockpit environment point cloud, and then using a gesture recognition model, such as a three-dimensional convolutional neural network, to analyze these hand point sets, the orientation of the palm, the extension state of the fingers, and the three-dimensional coordinates of the hand in the air are identified, forming structured gesture data. Here, gesture data refers to information reflecting the spatial position, posture, movement trajectory, and static gesture shape of the driver's hands in a specific area of ​​the cockpit, which is used to identify the driver's control intentions. The point cloud data containing depth information is calculated by emitting invisible light pulses and receiving reflected signals through the 3D sensing device, and then processed by a specific algorithm.

[0071] S102. Perform anti-interference processing and eye tracking on the eye data to generate a gaze direction vector.

[0072] Among them, anti-interference processing mainly refers to the measures taken to improve data quality before using image data to calculate the gaze direction, such as filtering out invalid or low-quality image frames caused by blinking, rapid head rotation, or sudden changes in lighting; eye tracking is a collective term for a series of technologies whose purpose is to continuously calculate the direction of human eye gaze by analyzing the physiological characteristics of the eyes; gaze direction vector specifically refers to data used to accurately describe the direction in which the driver's gaze points in the real space outside the vehicle.

[0073] In this embodiment of the application, S102 specifically includes:

[0074] S1021. Acquire a facial image sequence using a facial camera facing the driver, and extract an eye region image containing the pupil region and corneal reflection point from the facial image sequence.

[0075] Among them, the facial image sequence refers to a series of facial photos taken in chronological order by a vehicle-mounted camera facing the driver. It is used to provide raw visual material for analyzing the driver's eye state and head movements. It is obtained by taking pictures at a fixed frequency by the camera and uploading them to the processor. The corneal reflective point refers to a small, bright spot of light formed on the driver's cornea. It is used as a stable reference point for calculating eye movement. It is obtained by illuminating the eye with a vehicle-mounted near-infrared light source and locating the bright spot in the captured image.

[0076] In this embodiment, an image preprocessing algorithm is used to perform anti-interference processing on the facial video stream: histogram equalization is applied to each frame of facial image to enhance the contrast between the eye region and the surrounding skin; simultaneously, a Kalman filter algorithm is used to predict and calibrate the possible positions of the eyes in the current frame based on historical data of the eye positions in previous frames, in order to smooth out target loss or jumps caused by rapid head shaking, thereby obtaining a clearer and more stable sequence of facial images; through image recognition technology, a rectangular region containing the driver's eyes is automatically selected in each frame of facial image, and the image of this region is cropped to obtain the required eye region image.

[0077] S1022. Based on the image of the eye region, determine the geometric center point of the pupil region as the pupil center point, and locate the corneal reflection point formed on the cornea by the near-infrared light source in the vehicle-mounted intelligent cockpit.

[0078] In this embodiment, firstly, by employing image segmentation algorithms, such as threshold segmentation, region growing, or edge detection-based methods, the pupil region, which is darker and approximately circular and differs significantly from the white of the eye, iris, and skin in grayscale or color, is distinguished and obtained. Then, by calculating the average horizontal and vertical coordinates of all pixels in the pupil region, the geometric center of the pupil region, i.e., the pupil center point, is obtained. Simultaneously, in the eye region image, due to the illumination of the vehicle-mounted near-infrared light source, a very bright, high-contrast small spot is formed on the corneal surface. By using spot detection or brightness peak search algorithms in image processing, the center of this brightest and relatively fixed small region in the image is quickly located, obtaining the corneal reflection point.

[0079] S1023. Based on the relative positional relationship between the center point of the pupil and the corneal reflection point, calculate the relative line of sight direction of the driver's line of sight relative to the facial camera.

[0080] The relative line of sight refers to the angle information describing the direction of the driver's eye gaze relative to the camera in front of their face. It is used to initially indicate whether the eyeball is turning up, down, left, or right. It is obtained by calculating the direction of the line connecting the center point of the pupil and the corneal reflection point on the same eye image.

[0081] In this embodiment, the relative gaze direction is obtained by using the pupil-corneal reflection vector method, based on the two-dimensional pixel coordinate offset between the pupil center point and the corneal reflection point determined in a single frame image, combined with the pre-calibrated internal parameters of the camera, and by using a geometric projection model.

[0082] S1024. Obtain head posture data from the inertial measurement unit in the vehicle's intelligent cockpit, wherein the head posture data includes the rotation angle of the head.

[0083] The inertial measurement unit (IMU) is a miniature electronic sensor module installed near the vehicle or seat headrest. It measures the rotation angle, tilt angle, and other motion states of the driver's head. It obtains head posture data by sensing the angular velocity and acceleration of the head and performing internal calculations.

[0084] In this embodiment, the angular velocity of the head is measured by accessing the real-time data stream of the inertial measurement unit synchronized with the facial camera data. By integrating the angular velocity signal, the change in the rotation angle of the head relative to the initial moment can be calculated. By processing these data through sensor fusion algorithms, such as Kalman filtering, stable and accurate head posture data can be output in real time.

[0085] S1025. Based on the head posture data, the relative gaze direction is transformed from the coordinate system based on the facial camera to the global coordinate system with the vehicle as the origin, to obtain the gaze direction vector.

[0086] In this embodiment, by transforming three-dimensional spatial coordinates and using rotation matrix mathematical tools, the relative line of sight direction is calculated in the opposite direction according to the head rotation angle described by the head posture data. This is equivalent to transforming the line of sight direction from a coordinate system that rotates with the head to a global coordinate system that is fixed on the vehicle and does not change with the head rotation. Finally, a line of sight direction vector describing the absolute direction of the driver's line of sight in the real world outside the vehicle is output.

[0087] As an example, step 1021 first extracts a close-up image of the driver's eyes from the camera video stream.

[0088] Next, in step 1022, the coordinates of the center point of the pupil are accurately located in the close-up image of the eye. Pixel and the coordinates of the corneal reflective spot located nearby Pixel.

[0089] Next, in step 1023, the offset vector between these two points is calculated. :

[0090] Pixels

[0091] The offset vector There is a specific mapping relationship between the angle of eye movement and the eye's rotation, which is achieved through a pre-calibrated mapping function. Pixel offsets can be converted into viewing direction. For example, mapping functions. It can be constructed based on a linear model and normalization processing, assuming that the scaling factors in the horizontal and vertical directions are determined through the calibration process. That is, the first two components of the line-of-sight vector. , Proportional to the offset:

[0092] ,

[0093] in, , It is an offset vector The components, in order to obtain the unit vector, the third component The calculation is as follows:

[0094]

[0095] Will Substitute into the above expression:

[0096] , ,

[0097] Therefore, the unit vector of the line of sight direction relative to the camera coordinate system is obtained. .

[0098] Then, in step 1024, head posture data is obtained, assuming that the driver's head is rotating around the vertical axis. The axis was rotated 15 degrees to the left, which is the head posture angle. .

[0099] Finally, through step 1025, the relative line of sight direction is determined. Transform to a global coordinate system with the vehicle as the origin, through a coordinate system based on the head pose angle. Constructed rotation matrix To achieve this, the conversion formula is:

[0100]

[0101] The rotation matrix around the Y-axis is:

[0102]

[0103] Will and Substitute into the calculation:

[0104]

[0105] Received That is, the line-of-sight vector.

[0106] The above examples and formulas are merely schematic representations of the principles in this application; actual calculations may involve more complex models and calibration processes.

[0107] This application lays a reliable data foundation for accurately determining the driver's visual focus by transforming the original eye image into a stable, accurate gaze direction vector with clear spatial physical meaning.

[0108] S103. Convert the gaze direction vector and the gesture data into driving behavior features with semantic labels, and generate dynamic risk indicators based on the temporal pattern of the driving behavior features.

[0109] Among them, driving behavior characteristics are symbolic descriptions of a driver's comprehensive behavior at a specific moment; the behavior characteristic sequence is a series of driving behavior characteristics arranged in chronological order to observe continuous changes in behavior; and the dynamic risk index is a quantitative value used to reflect in real time whether a driver's attention to the road environment is focused or scattered over a period of time, as well as the stability of the behavior pattern.

[0110] In the embodiments of this application, such as Figure 2 As shown, S103 specifically includes:

[0111] S1031. Match the gaze direction vector with a preset orientation interval, and assign a semantic label describing the spatial orientation to each matched orientation interval to generate a visual gaze pattern with semantic labels.

[0112] Among them, the preset orientation intervals refer to several meaningful fan-shaped areas that are pre-divided into the external space in front of the vehicle, such as the road directly in front, the left rearview mirror area, the central control screen, and the right window; semantic labels are textual descriptions that are easy for humans to understand, such as looking at the road, looking to the left, and operating the central control; visual gaze pattern is a symbolic result that describes which preset area the driver's eyes are looking at.

[0113] In this embodiment of the application, the calculated gaze direction vector is geometrically compared with several predefined azimuth intervals. Each azimuth interval corresponds to a spatial angle range in the vehicle coordinate system. For example, the front of the vehicle is defined as 0 degrees, and 30 degrees to the left and right are defined as "the road in front". It is determined which spatial angle range the gaze direction vector falls into, and the corresponding text description, i.e. semantic label, is assigned to the successfully matched azimuth interval, thereby forming the visual gaze pattern at the current moment.

[0114] S1032. The gesture data is categorized into predefined gesture intent categories, and each gesture intent category is assigned a semantic label describing the operation intent, so as to generate an operation intent pattern with semantic labels.

[0115] Among them, the predefined gesture intent category is a set of several standard gesture actions that the system has learned or set in advance. Each set represents a possible operation intent, such as clicking, swiping, rotating, or holding the steering wheel; the operation intent pattern is a symbolic result that describes the driver's hand action intent.

[0116] In this embodiment, gesture data describing the trajectory and shape of hand movements are continuously received. These gesture data are matched with several predefined gesture intent categories in a library using a gesture recognition classification model. The gesture recognition model can be trained based on a large amount of gesture sample data and can identify discrete gesture actions from continuous sensor data. When a gesture is identified, such as a leftward swipe, it is classified into the corresponding intent category, such as swipe. Finally, a descriptive semantic label is assigned to the category, such as adjusting the air conditioner, thus forming the current operation intent pattern.

[0117] S1033. Combine the visual gaze pattern and the operation intention pattern at the same moment to form driving behavior features that describe the driver's comprehensive behavior at the corresponding moment.

[0118] In this embodiment of the application, a feature fusion method is used to pair and merge visual gaze patterns and operation intention patterns generated within the same timestamp, such as the same millisecond. For example, "viewing the central control" and "clicking the screen" are merged into "driving behavior feature: operating the central control screen". This driving behavior feature comprehensively represents the driver's complete interaction state at that moment.

[0119] S1034. Within a set time window, arrange multiple consecutively occurring driving behavior features in chronological order to form a behavior feature sequence, and analyze the changing patterns between different features in the behavior feature sequence.

[0120] Specifically, S1034 may include:

[0121] A fixed-duration sliding time window is set to continuously receive driving behavior features generated in chronological order; the driving behavior features entering the sliding time window are arranged in chronological order to form a behavior feature sequence; the number of times each driving behavior feature appears in the behavior feature sequence within the sliding time window is counted, and the frequency of occurrence of each driving behavior feature in the sliding time window is calculated based on the number of occurrences; the number of times the feature type changes between two adjacent driving behavior features within the sliding time window is counted; and the change pattern of the behavior feature sequence is determined based on the frequency of occurrence and the number of changes.

[0122] In this embodiment, a fixed-duration sliding time window is set to continuously add newly generated driving behavior features while removing old features that exceed the window, thus maintaining a latest, time-ordered sequence of behavior features. This sequence is then analyzed by counting the number of times each specific driving behavior feature appears within the window and calculating its proportion of the total number of features, i.e., the frequency of occurrence. The number of times driving behavior features change between adjacent moments within the window, i.e., the number of changes, is also counted. Frequent occurrences of non-driving-related behavior features, such as frequent operation of the central control system or frequent switching of behavior features, such as rapidly shifting gaze between the road, rearview mirror, and central control screen, all indicate distraction. These occurrence frequencies and number of changes are used to quantitatively capture the changing patterns of the behavior sequence.

[0123] S1035. Based on the aforementioned change pattern, a dynamic risk index is calculated to describe the degree of concentration and dispersion of the driver's attention to the road environment within the stated time window.

[0124] In this embodiment, a comprehensive evaluation algorithm is used to calculate a dynamic risk index based on the changing patterns using a predefined mathematical rule or function. The core idea is that the higher the frequency of behavioral features strongly related to safe driving, the lower the risk index should be; conversely, the higher the frequency of behavioral features unrelated to driving or distracting, the higher the risk index should be; and the more changes the behavioral feature sequence has, the more frequently the driver's attention switches and the less focused they are, and the higher the risk index should be accordingly. The frequency and the number of changes are combined into a single calculation formula to output a single, time-varying dynamic risk index. The higher the dynamic risk index value, the more distracted the driver's attention is towards the road environment and the worse their situational awareness is within the current time window.

[0125] As an example, step 1031 first determines that the current gaze direction vector points to the central control screen area, thus generating a visual gaze mode of viewing the central control.

[0126] Next, in step 1032, the gesture recognition model identifies the gesture as a swiping action, classifies it as an intention to adjust the volume, and generates an operation intention mode of adjusting the volume.

[0127] Next, through step 1033, the combination of viewing the central control and adjusting the volume is used to obtain the current driving behavior characteristics as viewing the central control and adjusting the volume.

[0128] Then, through step 1034, it is assumed that the sequence of behavioral features collected within the sliding time window of the most recent 10 seconds is: looking at the road and holding the steering wheel, looking at the road and holding the steering wheel, looking at the center console and adjusting the volume, looking at the center console and adjusting the volume, looking at the left side and holding the steering wheel.

[0129] For analysis purposes, we will number the different feature types, assuming there are three feature types:

[0130] type Safe driving: Keep your eyes on the road and your hands on the steering wheel.

[0131] type Distracting operation: Check the central control and adjust the volume.

[0132] type Observe the environment: Look to the left and hold the steering wheel.

[0133] The sequence can be represented as: .

[0134] Then count the frequency of occurrence: type Number of occurrences ;type Number of occurrences ;type Number of occurrences Total number of characteristics .

[0135] Calculate the frequency of each type within the window. :

[0136] , , .

[0137] Count the number of changes and traverse the sequence. Adjacent feature changes occur in: , .

[0138] Therefore, the number of feature type changes .

[0139] Finally, in step 1035, dynamic risk indicators are calculated based on the changing patterns. An exemplary calculation formula can combine the disorder and variability of the frequency distribution:

[0140] The disorder of the frequency distribution can be represented by information entropy. To measure, The higher the value, the more dispersed the attention is between different behaviors. Among these, information entropy... The specific expression is as follows:

[0141]

[0142] By calculating the frequency of occurrence of each type within the window as described above Substitute into the calculation:

[0143]

[0144] The degree of change can be directly expressed by the number of changes. Measurements can also be normalized, which will be discussed here. Divide by the upper limit Normalize: .

[0145] By using a weighted summation algorithm, the disorder and variability of the above frequency distribution are calculated to obtain a dynamic risk indicator. The specific expression is as follows:

[0146]

[0147] in, and These are preset weighting coefficients used to adjust the proportion of the two parts' influence. Assuming... , , Received This is the calculated dynamic risk indicator. Assuming the baseline value can be set to 0, which represents complete focus, this higher value quantitatively reflects that the driver's attention has deviated significantly from the main driving task within these 10 seconds, indicating a higher proportion of operation and control, and a change in behavior mode, resulting in a higher risk of distraction.

[0148] The examples and formulas described above are merely schematic representations of the principles employed in this application, illustrating how to extract quantitative patterns from behavioral sequences and calculate risk indicators. Actual risk calculation models may be more complex and integrate more factors; this application does not impose any limitations on them.

[0149] This application creatively transforms raw gaze and gesture signals into semantically clear behavioral units, and through statistical analysis of these behavioral units over time, it achieves a quantitative assessment of the driver's attention state and behavioral stability, providing a key and objective decision-making basis for intelligent cockpits to determine when intervention is needed.

[0150] S104. The vehicle communication module receives and parses the structured road condition information of the road ahead of the vehicle in real time.

[0151] Among them, vehicle communication module refers to vehicle-mounted wireless communication device, such as cellular network communication unit or dedicated short-range communication module, used to exchange data with external information sources; structured traffic information refers to a standardized and organized information that clearly describes events on the road ahead, their precise locations, and the expected requirements for the driver in a specific format.

[0152] In this embodiment of the application, S104 specifically includes:

[0153] S1041. Receive raw traffic data from roadside communication equipment or cloud-based traffic service broadcasts via the vehicle communication module.

[0154] Among them, roadside communication equipment is a communication device deployed beside the road that can broadcast traffic information in the vicinity; cloud traffic services are centralized information services provided through the Internet that collect and integrate traffic data over a wide range; raw traffic data is usually a description of events on the road ahead over multiple consecutive distance segments with reference to the vehicle's current position.

[0155] In this embodiment, by activating the vehicle's communication module, continuous monitoring of wireless broadcast signals and data streams from external information sources is initiated. Signals pushed by nearby roadside communication devices are received via dedicated short-range communication technology, and data packets broadcast by cloud-based traffic services are received via cellular network. These signals and data streams carry encoded raw traffic data. This raw traffic data from different sources is parsed and converted from wireless signal or network data packet format into a standardized data format that the system can process internally. This data typically contains descriptions of various traffic events originating from the vehicle and extending to multiple road segments ahead.

[0156] S1042. Extract road description information related to the vehicle's current driving lane from the original road condition data.

[0157] The road description information is a detailed description of a specific event, including the event type, distance from the event location, expected driving behavior, and suggested areas of focus.

[0158] In this embodiment, data parsing and matching technology is used to process the standardized original road condition data, read the vehicle's real-time high-precision positioning information and current driving lane information, and traverse the event type (e.g., construction, congestion, accident), the specific lane where the event occurred, the event location (e.g., 500 meters away from the vehicle), the expected driving behavior of the driver (e.g., changing lanes to the left or slowing down), and the suggested areas of focus for the driver (e.g., the left lane or the construction area ahead) in the original data. Through the vehicle positioning system and map data, the precise location of the vehicle and the lane number it is driving in are determined. The information affecting the lane is compared with the vehicle's current lane number, and only those events that are consistent with or closely related to the direction of the vehicle's current driving lane, the exact distance between the event and the vehicle's exact location, the suggested response behavior for the driver, and the suggested geographical area of ​​focus are extracted as road description information.

[0159] S1043. From the road description information, filter out the road condition information ahead where the distance to the event location is less than a preset distance threshold.

[0160] The preset distance threshold is a value pre-set based on driving reaction time and system design requirements, such as 200 meters or 300 meters, used to focus on nearby events that require the driver to react immediately or in the near future.

[0161] In this embodiment, the importance of all extracted road description information is filtered by comparison and judgment logic. The value of the event location distance field is read and compared with a preset distance threshold. Only event description information whose event location distance is less than the preset distance threshold is retained and can be considered as road condition information that needs immediate attention. Events that are far away are temporarily ignored. This ensures that the road condition information processed later is all near-term events that the driver will face in a short time and needs to anticipate and prepare for. It filters out information that is temporarily irrelevant from a distance, improving the timeliness and targeting of the processing.

[0162] S1044. Arrange the road condition information ahead in order of relative distance from near to far to form structured road condition information.

[0163] In this embodiment of the application, a sorting algorithm is used to process the filtered road condition information ahead. The event location distance value in each piece of information is read, and a sorting algorithm, such as quicksort or bubble sort, is used to arrange the event location distance values ​​in ascending order. The event closest to the vehicle is placed at the front, and the event slightly farther away is placed at the back, thus forming a list arranged in order of relative location distance from near to far. This ordered list is then output as the final structured road condition information.

[0164] As an example, in step 1041, the vehicle communication module first receives a piece of raw traffic data from the cloud service. The data includes: 800 meters ahead, all lanes are congested, the expected behavior is to slow down, and the area of ​​interest is ahead; 350 meters ahead, the second lane from the left has debris, the expected behavior is to change lanes to the right, and the area of ​​interest is the left front and the right lane.

[0165] Next, in step 1042, it is identified that the vehicle is traveling in the second lane from the left. After comparison, the second event description information that perfectly matches the current lane is extracted, namely the spilled object event. At the same time, due to the congestion affecting all lanes, the first information is also extracted.

[0166] Then, in step 1043, assuming the preset distance threshold is 500 meters, the event location distances of the two pieces of information are compared: 800 meters > 500 meters, 350 meters < 500 meters. Therefore, the congestion event at 800 meters is filtered out, and only the littering event at 350 meters is kept as the road condition information ahead.

[0167] Finally, in step 1044, since there is only one valid piece of information at this point, it is directly output as structured traffic information. If there are multiple events, such as an entrance merging event 100 meters away, they are sorted into a sequential list of [event at 100 meters, event at 350 meters]. The above example is only one example of this application. In actual applications, the distance threshold and data processing logic can be set according to requirements, and this application does not limit them.

[0168] Through the above steps, this application achieves the automatic acquisition, filtering, and organization of road event information that is close to and highly relevant to the vehicle in front, providing a clear, orderly external environmental context that is closely related to the current state of the vehicle. This is a prerequisite for the subsequent realization of intelligent matching and evaluation of driver behavior and road conditions.

[0169] S105. The dynamic risk indicators are matched and evaluated with the current active driving behavior characteristics and the road condition information to obtain the evaluation results.

[0170] Among them, the currently active driving behavior feature refers to the newly generated feature unit that represents the driver's overall behavioral state at this moment.

[0171] In this embodiment of the application, S105 specifically includes:

[0172] S1051. Obtain the description information of the event closest to the vehicle that has not yet been passed from the structured road condition information as the target road condition information, and extract the expected driving behavior and suggested attention area from the target road condition information.

[0173] Among them, target road condition information is a description of the event closest to the vehicle and not yet passed by the driver, which is selected from structured road condition information; expected driving behavior is the type of operation that the driver is clearly advised to take in the target road condition information, such as deceleration or lane change; suggested attention area is the spatial range that the driver is advised to focus on in the target road condition information, such as the left lane or construction area.

[0174] In this embodiment, a list reading and location judgment logic is used to access structured road condition information that has been sorted by distance. Starting from the front of the list, i.e. the closest location, each event is checked sequentially to see if it has been marked as passed. The basis for determining whether it has been passed can be vehicle positioning data, such as GPS coordinates, showing that the vehicle has passed the location where the event occurred, or the valid time contained in the event description has expired. The first event description information in the list that has not yet been marked as passed is selected and determined as the current target road condition information. The two key fields of expected driving behavior and suggested attention area are read from this target road condition information to extract the specific content. For example, the target road condition information is: there is an obstacle in the left lane 200 meters ahead, the expected driving behavior is to change lanes to the right, and the suggested attention area is the left lane and the right rear.

[0175] S1052. Compare the directional range of the visual gaze pattern in the driving behavior characteristics at the current moment with the directional range of the suggested attention area to obtain a first judgment result.

[0176] The first judgment result is used to characterize whether the driver's line of sight has covered the suggested area of ​​attention.

[0177] In this embodiment, a spatial region comparison logic is used to extract the visual gaze pattern from the driving behavior characteristics at the current moment, which is a semantic label representing the driver's current line of sight, such as looking forward or looking to the left. Simultaneously, the specific directional description corresponding to the suggested attention area is parsed from the suggested attention area; for example, the left lane corresponds to a spatial range from 30 degrees to 30 degrees to the left front and left rear of the vehicle. A preset mapping table is used to map the semantic label of the visual gaze pattern to a specific directional angle range, for example, mapping looking to the left to [30 degrees, 90 degrees], with the front of the vehicle as 0 degrees. It is then determined whether the mapped directional range intersects with the directional range described by the suggested attention area. If there is an intersection, the driver's line of sight has covered the suggested attention area as the first judgment result; otherwise, the driver's line of sight has not covered the suggested attention area as the first judgment result.

[0178] S1053. Compare the operation intention of the operation intention mode in the driving behavior characteristics at the current moment with the operation type of the expected driving behavior to obtain a second judgment result.

[0179] The second judgment result is used to characterize whether the driver's operating intention is consistent with the expected driving behavior.

[0180] In this embodiment, semantic matching or rule matching techniques are used to process the operation intention pattern within the same driving behavior feature. The operation intention pattern, representing the driver's hand operation intention (e.g., holding the steering wheel or operating the central control), is extracted from the driving behavior feature at the current moment. Simultaneously, the specific operation type semantics are parsed from the expected driving behavior, such as the operation intention of preparing to change lanes to the right. Through another set of preset rules or mapping relationships, it is determined whether the operation intention pattern matches the operation type required by the expected driving behavior. For example, holding the steering wheel or preparing to change lanes can match the expected driving behavior of changing lanes to the right, generating a second judgment result that the operation intention is consistent. Operating the central control or using the mobile phone is considered a mismatch, generating a second judgment result that the operation intention is inconsistent.

[0181] S1054. Compare the value of the dynamic risk indicator with the preset attention distraction threshold to obtain the comparison result.

[0182] The attention distraction threshold is a pre-set numerical limit used to determine whether the driver's attention distraction level has reached a level that requires attention; the comparison result is used to characterize whether the driver's attention distraction level exceeds the allowable range.

[0183] In this embodiment of the application, the current value of the dynamic risk indicator calculated in real time is read through numerical comparison logic and compared with a pre-set attention distraction threshold, for example, the threshold is set to 1.5; if the value of the dynamic risk indicator is greater than or equal to this threshold, it is determined that the driver's attention distraction exceeds the allowable range and a comparison result of attention not exceeding the limit is generated; if it is less than this threshold, it is determined that it is within the allowable range and a comparison result of attention exceeding the limit is generated.

[0184] S1055. The first judgment result, the second judgment result, and the comparison result are combined to generate an evaluation result.

[0185] The evaluation results are used to characterize the degree to which the driver's current driving behavior characteristics conform to the driving behavior required by the target road condition information.

[0186] In this embodiment, a rule engine or decision tree is used for comprehensive judgment. The first judgment result, the second judgment result, and the comparison result are input into the decision logic. For example, a satisfactory evaluation result is generated only when the first judgment result is that the line of sight covers the suggested area, the second judgment result is that the operation intention is in line with expectations, and the comparison result is that the attention is not exceeded. In any other case, such as when the line of sight is not covered, the operation intention is inconsistent, or the attention is distracted beyond the limit, a non-satisfactory evaluation result is generated. This evaluation result is a comprehensive judgment used to characterize the degree of conformity between the driver's current behavior and state and the best response required by the road conditions ahead.

[0187] As an example, firstly, through step 1051, assuming that the first event that has not been passed is selected from the structured road condition information list: "100 meters ahead, construction, it is recommended to slow down and pay attention to the right" and "300 meters ahead, congestion, it is recommended to slow down", the target road condition information is "100 meters ahead, construction, it is recommended to slow down and pay attention to the right", and the expected driving behavior is to slow down and the recommended area of ​​attention is the right side.

[0188] Secondly, in step 1052, assuming that the visual gaze pattern in the current driving behavior characteristics is looking forward, it is determined that the directional range corresponding to looking forward (e.g., ±15 degrees directly in front of the vehicle) and the directional range corresponding to focusing on the right side (e.g., 30 to 90 degrees to the right of the vehicle) do not overlap. Therefore, the first judgment result is negative, that is, the line of sight does not cover the suggested focus area.

[0189] Next, in step 1053, assuming that the operation intention mode in the current driving behavior characteristics is to keep the steering wheel, it is determined that keeping the steering wheel is compatible with the expected driving behavior of deceleration. Deceleration does not require special hand operation, and keeping the steering wheel is sufficient. Therefore, the second judgment result is obtained as yes, that is, the operation intention is in line with expectations.

[0190] Then, in step 1054, assuming the current value of the dynamic risk indicator is... The preset attention distraction threshold is ,because Therefore, the comparison result is that the degree of distraction exceeds the allowable range.

[0191] Finally, in step 1055, considering the first judgment result (no), the second judgment result (yes), and the comparison result (yes), an evaluation result is generated based on the simple logic that if any one of the three is not met, the whole system is not met. This indicates that although the driver's intention was correct, their gaze was not focused on the area that should have been focused on, and their attention was generally scattered, which is inconsistent with the requirements of the road construction ahead. The above example is only one example of logical judgment in this application; the actual comprehensive logic may be more complex, and this application does not limit it.

[0192] This application establishes an evaluation mechanism that matches the driver's internal state with external road demands in real time and with fine precision. This mechanism can accurately determine whether the driver is prepared to deal with complex road conditions ahead, thus providing precise decision-making triggers for subsequent intelligent dialogue intervention.

[0193] S106. Based on the evaluation results, when the dynamic risk index exceeds a preset threshold and the driving behavior characteristics do not meet the expected behavior corresponding to the road condition information, generate and output guiding dialogue content related to the current road condition.

[0194] In this embodiment of the application, S106 specifically includes:

[0195] S1061. When the evaluation results indicate that the dynamic risk index exceeds the preset attention distraction threshold, and the visual gaze pattern in the driving behavior characteristics does not cover the suggested attention area, and the operation intention pattern in the driving behavior characteristics does not conform to the expected driving behavior, a dialogue generation is triggered.

[0196] In this embodiment, the evaluation results are analyzed in detail through a decision logic module. The evaluation results include three independent sub-judgments: the comparison result of dynamic risk indicators, the matching result of visual gaze patterns (i.e., the first judgment result), and the matching result of operational intention patterns (i.e., the second judgment result). A preset logical trigger condition is that intervention is determined to be necessary only when the comparison result indicates that the dynamic risk indicators exceed the attention distraction threshold, the first judgment result indicates that the visual gaze pattern does not cover the suggested attention area, and the second judgment result indicates that the operational intention pattern does not conform to the expected driving behavior. A dialogue generation command signal is then generated. This logic ensures that voice guidance is only initiated in the most dangerous or mismatched situation where the driver's attention is distracted and they are neither looking at the right place nor doing the right thing.

[0197] S1062. Based on the event type in the target road condition information, select the corresponding guidance statement framework from the preset voice template library.

[0198] The preset voice template library is a collection of pre-stored text templates with fixed sentence structures and variable slots. Each template corresponds to a specific type of road event and dialogue intent. The guiding statement framework is a specific template selected from the voice template library. It contains the core structure of the dialogue and the information positions or slots to be filled.

[0199] In this embodiment, through query matching technology, after the dialogue generation function is triggered, the target road condition information corresponding to the triggering of the instruction is immediately obtained; the event type field in this target road condition information is parsed, such as construction, congestion, or slow-moving vehicles ahead; this event type is used as a keyword to quickly search in a preset voice template library. This preset voice template library can be understood as a dictionary or database, where each record is associated with a specific event type and one or more guiding statement frames. Through the query operation, the current event type is used as the key to quickly retrieve and select the most matching guiding statement frame. For example, for the event type of construction, the selectable frame is: "Please note that there is (event type) ahead (distance), it is recommended that you (expected driving behavior), and pay attention to (suggested area of ​​attention)."

[0200] S1063. Fill in the distance of the event location, the expected driving behavior, and the suggested area of ​​attention into the guidance statement framework to generate complete guidance voice dialogue content, and broadcast the guidance voice dialogue content through the car audio system.

[0201] In this embodiment, text filling and speech synthesis technology are used to generate the final speech content. The distance to the event location, expected driving behavior, and suggested attention area extracted from the target road condition information are used to perform text replacement operations on the guidance statement framework. For example, "100 meters," "decelerate," and "right side" are filled into the reserved positions in the guidance statement framework, namely, distance, expected driving behavior, and suggested attention area. After the filling is completed, a grammatically complete and specific text sentence is generated, which is the guiding voice dialogue content. The vehicle's text-to-speech module is called to convert this text content into an audio signal, which is then played through the vehicle's audio system, thereby completing the voice guidance for the driver.

[0202] As an example, firstly, through step 1061, assuming the received evaluation result is: the dynamic risk index value is 2.0, which exceeds the threshold of 1.5, the first judgment result is no, meaning the right side of the field of vision is not covered, and the second judgment result is yes, meaning the operation intention is to keep the steering wheel in line with the expected behavior of deceleration. Since the conditions of not covering the field of vision and exceeding the threshold are met, but the operation intention is not met, the logic that all three conditions must be met simultaneously to trigger the dialogue generation is not triggered at this time. Assuming another situation: the driver is operating the screen, i.e. the operation intention is not met, then all three conditions are met, and the dialogue generation instruction is triggered.

[0203] Then, through step 1062, assuming that the target road condition information event type corresponding to the trigger is construction, the guide statement frame corresponding to construction is queried in the voice template library, and a suitable frame is found as follows: There is (event type) ahead (distance), please (expected driving behavior), and pay attention to (suggested area).

[0204] Finally, through step 1063, the specific parameters are obtained: distance is 100 meters, expected driving behavior is to slow down, and suggested attention area is the right side. These parameters are filled into the corresponding positions in the frame to generate complete guiding voice dialogue text: "Construction ahead 100 meters, please slow down and pay attention to the right." This text is then converted into speech and broadcast through the car audio system. The above example is only one example of logic and content generation in this application. The actual triggering logic and statement template can be designed in a more complex way according to safety policies and user experience, and this application does not limit them.

[0205] Through the above steps, this application enables the automatic generation and output of highly context-dependent personalized voice guidance in accurately determined high-risk or mismatched scenarios. This transforms the conclusions of intelligent perception and assessment into direct, clear, and timely safety assistance operations for the driver, thereby effectively improving the safety and collaborative efficiency of human-machine co-driving.

[0206] Figure 3 This application provides a schematic diagram of a specific implementation of an in-vehicle intelligent cockpit dialogue system based on multimodal interaction, referring to... Figure 3 The system may include:

[0207] The data acquisition module 31 is used to collect the driver's eye data and gesture data through the sensing device of the in-vehicle smart cockpit.

[0208] The generation module 32 is used to perform anti-interference processing and eye tracking on the eye data to generate a gaze direction vector;

[0209] The conversion module 33 is used to convert the gaze direction vector and the gesture data into driving behavior features with semantic labels, and generate dynamic risk indicators based on the temporal pattern of the driving behavior features.

[0210] The parsing module 34 is used to receive and parse the structured road condition information of the road ahead of the vehicle in real time through the vehicle communication module. The road condition information describes road events or states that are about to occur within a specific distance range and that the driver needs to predict.

[0211] Evaluation module 35 is used to match and evaluate the dynamic risk indicators with the characteristics of currently active driving behavior and the road condition information to obtain evaluation results:

[0212] The output module 36 is used to generate and output guiding dialogue content related to the current road condition based on the evaluation results, when the dynamic risk index exceeds a preset threshold and the driving behavior characteristics do not meet the expected behavior corresponding to the road condition information.

[0213] The in-vehicle intelligent cockpit dialogue system based on multimodal interaction in this application embodiment is used to implement the aforementioned in-vehicle intelligent cockpit dialogue method based on multimodal interaction. Therefore, the specific implementation of the in-vehicle intelligent cockpit dialogue system based on multimodal interaction can be found in the embodiment section of the in-vehicle intelligent cockpit dialogue method based on multimodal interaction mentioned above. The specific implementation can be referred to the description of the corresponding embodiments, which will not be repeated here.

[0214] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for executing the computer program to implement the steps of the above-described multimodal interaction-based in-vehicle intelligent cockpit dialogue method.

[0215] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of any of the above-described multimodal interaction-based in-vehicle intelligent cockpit dialogue methods.

[0216] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as USB flash drives, read-only memory, random access memory, portable hard drives, magnetic disks, or optical disks.

[0217] Embodiments of the present invention also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the embodiments of the in-vehicle intelligent cockpit dialogue method based on multimodal interaction.

[0218] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0219] The foregoing has provided a detailed description of a multimodal interaction-based in-vehicle intelligent cockpit dialogue method and system. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of this application.

Claims

1. A dialogue method for an in-vehicle intelligent cockpit based on multimodal interaction, characterized in that, include: The driver's eye and gesture data are collected through the sensors in the in-vehicle smart cockpit. The eye data is subjected to anti-interference processing and eye tracking to generate a gaze direction vector; The gaze direction vector and the gesture data are converted into driving behavior features with semantic labels, and dynamic risk indicators are generated based on the temporal patterns of the driving behavior features. The vehicle communication module receives and parses structured road condition information of the road ahead in real time. The road condition information describes road events or states that are about to occur within a specific distance range and that the driver needs to anticipate. The dynamic risk indicators are matched and evaluated with the current active driving behavior characteristics and the road condition information to obtain the evaluation results: Based on the evaluation results, when the dynamic risk index exceeds a preset threshold and the driving behavior characteristics do not meet the expected behavior corresponding to the road condition information, guiding dialogue content associated with the current road condition is generated and output.

2. The method according to claim 1, characterized in that, The eye data is subjected to anti-interference processing and eye tracking to generate a gaze direction vector, including: A facial image sequence is acquired using a facial camera facing the driver, and an eye region image containing the pupil region and corneal reflection point is extracted from the facial image sequence. Based on the image of the eye region, the geometric center point of the pupil region is determined as the pupil center point, and the corneal reflection point formed on the cornea by the near-infrared light source in the vehicle intelligent cockpit is located. Based on the relative positional relationship between the center point of the pupil and the corneal reflection point, the relative line of sight of the driver relative to the facial camera is calculated; Acquire head posture data from the inertial measurement unit in the in-vehicle intelligent cockpit, the head posture data including the rotation angle of the head; Based on the head posture data, the relative gaze direction is transformed from a coordinate system based on the facial camera to a global coordinate system with the vehicle as the origin, resulting in a gaze direction vector.

3. The method according to claim 1, characterized in that, The gaze direction vector and the gesture data are converted into driving behavior features with semantic labels, and dynamic risk indicators are generated based on the temporal patterns of the driving behavior features, including: The gaze direction vector is matched with a preset orientation interval, and each matched orientation interval is assigned a semantic label describing spatial orientation to generate a visual gaze pattern with semantic labels. The gesture data is categorized into predefined gesture intent categories, and each gesture intent category is assigned a semantic label describing the operation intent, so as to generate an operation intent pattern with semantic labels; The visual gaze pattern and the operational intention pattern at the same moment are combined to form driving behavior characteristics that describe the driver's comprehensive behavior at the corresponding moment. Within a set time window, multiple consecutively occurring driving behavior features are arranged in chronological order to form a behavior feature sequence, and the changing patterns between different features in the behavior feature sequence are analyzed. Based on the aforementioned pattern of change, a dynamic risk index is calculated to describe the degree of concentration and dispersion of the driver's attention to the road environment within the stated time window.

4. The method according to claim 3, characterized in that, Within a set time window, multiple consecutively occurring driving behavior features are arranged chronologically to form a behavior feature sequence, and the changing patterns between different features in the behavior feature sequence are analyzed, including: Set a fixed-duration sliding time window to continuously receive driving behavior features generated in chronological order; The driving behavior features that enter the sliding time window are arranged in chronological order to form a behavior feature sequence; The frequency of each driving behavior feature in the sequence of behaviors is counted within the sliding time window, and the frequency of each driving behavior feature in the sliding time window is calculated based on the number of occurrences. The number of times the feature type changes between two adjacent driving behavior features is counted within the sliding time window; The variation pattern of the behavioral feature sequence is determined based on the occurrence frequency and the number of changes.

5. The method according to claim 1, characterized in that, The vehicle communication module receives and parses structured road condition information ahead of the vehicle in real time. This road condition information describes road events or states that are about to occur within a specific distance range and require the driver's prediction, including: The vehicle communication module receives raw traffic data from roadside communication devices or cloud-based traffic service broadcasts. Extract road description information related to the vehicle's current lane from the original road condition data; From the road description information, filter out the road condition information ahead where the distance to the event location is less than a preset distance threshold; The road condition information ahead is arranged in order of relative distance from near to far to form structured road condition information.

6. The method according to claim 1, characterized in that, The dynamic risk indicators are matched and evaluated with the current active driving behavior characteristics and the road condition information to obtain the evaluation results, including: The system extracts the description information of the event closest to the vehicle that has not yet been passed from the structured road condition information as the target road condition information, and extracts the expected driving behavior and suggested areas of focus from the target road condition information. The directional range of the visual gaze pattern in the driving behavior characteristics at the current moment is compared with the directional range of the suggested attention area to obtain a first judgment result; The operation intention of the operation intention mode in the current driving behavior characteristics is compared with the operation type of the expected driving behavior to obtain a second judgment result; The value of the dynamic risk indicator is compared with a preset attention distraction threshold to obtain the comparison result; The first judgment result, the second judgment result, and the comparison result are combined to generate an evaluation result.

7. The method according to claim 1, characterized in that, Based on the evaluation results, when the dynamic risk index exceeds a preset threshold and the driving behavior characteristics do not meet the expected behavior corresponding to the road condition information, guiding dialogue content associated with the current road condition is generated and output, including: When the evaluation results indicate that the dynamic risk index exceeds the preset attention distraction threshold, and the visual gaze pattern in the driving behavior characteristics does not cover the suggested attention area, and the operation intention pattern in the driving behavior characteristics does not conform to the expected driving behavior, a dialogue generation is triggered. Based on the event type in the target road condition information, select the corresponding guidance statement framework from the preset voice template library; The distance to the location of the incident, the expected driving behavior, and the suggested area of ​​attention are filled into the guidance statement framework to generate complete guidance voice dialogue content, which is then broadcast through the car audio system.

8. A vehicle-mounted intelligent cockpit dialogue system based on multimodal interaction, characterized in that, include: The data acquisition module is used to collect the driver's eye and gesture data through the sensing devices of the in-vehicle smart cockpit; The generation module is used to perform anti-interference processing and eye tracking on the eye data to generate a gaze direction vector; The conversion module is used to convert the gaze direction vector and the gesture data into driving behavior features with semantic labels, and generate dynamic risk indicators based on the temporal pattern of the driving behavior features. The parsing module is used to receive and parse structured road condition information of the road ahead of the vehicle in real time through the vehicle communication module. The road condition information describes road events or states that are about to occur within a specific distance range and that the driver needs to predict. The evaluation module is used to match and evaluate the dynamic risk indicators with the characteristics of currently active driving behaviors and the road condition information to obtain the evaluation results. The output module is used to generate and output guiding dialogue content related to the current road conditions based on the evaluation results, when the dynamic risk index exceeds a preset threshold and the driving behavior characteristics do not meet the expected behavior corresponding to the road condition information.

9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the in-vehicle intelligent cockpit dialogue method based on multimodal interaction as described in any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, enables the implementation of the in-vehicle intelligent cockpit dialogue method based on multimodal interaction as described in any one of claims 1 to 7.