Conditional autonomous driving takeover prompting method and system
By performing frame-level time alignment and expert risk perception scanning strategy evaluation on the multi-source data streams collected by the autonomous driving takeover system, a driver object-level gaze sequence is generated, which solves the problems of redundant alarms and risk perception blind spots in the existing system, and achieves more efficient takeover prompts and improved safety.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TONGJI UNIV
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-19
AI Technical Summary
Existing autonomous driving takeover systems rely on context-driven threshold triggers, which are difficult to reflect the driver's real-time risk perception status, are prone to generating redundant alarms, and are difficult to specifically compensate for blind spots in risk perception, resulting in takeover failure and insufficient safety.
By collecting multi-source data streams and performing frame-level temporal alignment, a driver object-level gaze sequence is generated. Then, using an interpretable expert risk perception scanning strategy and a Markov decision process, the driver's risk perception coverage is assessed, and targeted augmented reality prompts are output.
It reduces redundant alarms, improves takeover success rate and safety margin, reduces driver cognitive load, and enhances system real-time performance and interpretability.
Smart Images

Figure CN122244842A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent connected vehicles and autonomous driving safety technology, and in particular to a conditional autonomous driving takeover prompting method and system. Background Technology
[0002] The development of autonomous driving technology enables vehicles to undertake some or all of the dynamic driving tasks within a specific Operating Design Domain (ODD). Especially in conditional autonomous driving mode, vehicles can stably perform longitudinal and lateral control in scenarios such as highways and traffic congestion, significantly reducing the driver's workload. Compared with traditional manual driving, conditional autonomous driving has advantages in environmental perception, control stability, and execution consistency, and is expected to improve traffic safety and efficiency. However, limited by factors such as perception capability boundaries, road traffic complexity, and long-tail extreme scenarios, conditional autonomous driving systems still need to issue a takeover request (TOR) to the driver when reaching their capability boundaries, requiring the driver to regain control and make the correct decision within a short period. The takeover process is highly time-sensitive and risky. If the driver fails to complete sufficient risk perception and control takeover within the critical time window, it can easily lead to problems such as insufficient emergency braking and delayed steering decisions, thereby increasing the risk of accidents.
[0003] Currently, prompts and guidance for takeover requests primarily rely on context-driven takeover prompt strategies: the system determines whether to issue a prompt, as well as the prompt's intensity, frequency, and target, based on external environmental risk indicators (such as collision time-to-catch (TTC), relative speed, distance threshold, road geometry, or rule-triggered conditions). While this approach is relatively straightforward in engineering implementation, it still has significant limitations in actual takeover processes, mainly in the following aspects: First, there is a significant problem of redundant alarms and misaligned prompts. Context-driven prompts often only provide alerts based on the magnitude of environmental risk, without assessing whether the driver has already paid attention to the risky target. Continuing to provide prompts when the driver has already made an effective observation can easily lead to redundant prompts, passive interruptions to attention, increased cognitive load, and even "alarm fatigue," thereby reducing the quality of takeover.
[0004] Secondly, the system neglects the driver's visual perception state and blind spots in risk coverage. Successful takeover depends not only on the presence of external hazards but also on the driver's accurate perception of key risk sources within the critical takeover window. Most existing systems do not incorporate the driver's "what they have seen and what they haven't seen" into their triggering logic, lacking assessment of the driver's visual search trajectory and the extent of object-level risk coverage. In scenarios with multiple risk sources, obstructions, and potential risks, the system's suggested targets may not match the driver's actual missing risk information, resulting in prompts that fail to compensate for the driver's perceptual blind spots.
[0005] Third, takeover failure is significantly correlated with risk perception bias. Extensive research and testing experience show that takeover failure often stems from risk perception bias in the initial stages of takeover: delayed or missed attention to key targets by the driver leads to delayed hazard identification and decision-making, thereby compressing safety margins and reducing the success rate of takeover. Particularly within the short critical time window after TOR (Total Risk) triggering, whether the driver's visual search covers key risk objects has a decisive impact on subsequent control decisions. Alert mechanisms triggered solely by external risk thresholds are insufficient to specifically correct this type of perception bias.
[0006] Furthermore, from an engineering implementation perspective, the takeover alert system must balance real-time performance with deployment costs: it must achieve millisecond-level response under onboard computing power constraints while avoiding the introduction of complex models that could cause additional latency and instability. While some existing deep learning-based driver state assessment methods offer some predictive capabilities, they often suffer from insufficient interpretability, high computational overhead, and difficulty in achieving closed-loop coordination with alert strategies, limiting their reliable application within the critical takeover window.
[0007] Therefore, there is an urgent need to design an autonomous driving takeover prompt method that can reduce redundant alarms, reduce cognitive load, and improve takeover success rate and safety margin within the critical window after a takeover request (TOR), thereby improving the overall safety and availability of conditional autonomous driving systems. Summary of the Invention
[0008] The purpose of this invention is to overcome the shortcomings of existing autonomous driving takeover systems, such as reliance on context-driven threshold triggering, difficulty in reflecting the driver's real-time risk perception status, easy generation of redundant alarms, and difficulty in specifically compensating for risk perception blind spots. This invention provides a conditional autonomous driving takeover prompting method and system that can quantitatively evaluate the driver's risk perception ability in real time within the critical takeover window, and trigger augmented reality prompts for key risk objects when the evaluation results are insufficient or there are gaps in risk perception coverage. This reduces cognitive load, reduces redundant alarms, and improves takeover safety and safety margin.
[0009] The objective of this invention can be achieved through the following technical solutions: According to a first aspect of the present invention, a conditional automatic driving takeover prompting method is provided, comprising: The multi-source data streams collected during the takeover process are aligned at the frame level. The multi-source data streams include eye-tracking signals, traffic environment status, first-person driving foreground visual images, and driving control actions. The first-person driving foreground visual images are segmented into instances to obtain instance regions of risk-related objects. The gaze points are then mapped to the instance regions to generate a driver object-level gaze sequence. The expert visual scanning process is modeled as a partially observable Markov decision process, and an interpretable expert risk perception scanning strategy is adopted to generate multiple expert object-level reference gaze sequences based on the context of the takeover scenario. After receiving the driver's object-level gaze sequence and multiple expert object-level reference gaze sequences, object-level scanning path matching is performed, and the risk perception coverage gap judgment result is output based on the soft voting mechanism. When a coverage gap is determined, the uncovered key risk objects are used as the prompt target objects. After receiving the target object of the prompt, it is overlaid and rendered on the first-person driving foreground visual image and output to the display terminal. When it is determined that the coverage is sufficient, the prompt is suppressed or canceled.
[0010] Preferably, the driver object-level gaze sequence includes multiple gaze events, each gaze event including at least the gaze start and end time, gaze point coordinates and the corresponding risk-related object identifier, wherein gaze points that hit the same object consecutively are merged into a gaze event.
[0011] Preferably, the expert visual scanning process is modeled as a partially observable Markov decision process, wherein the state variables in its state space are... Including the features of the currently viewed object, the features of neighboring objects, the features of vehicles approaching from behind, and the state of the vehicle itself, the action variables in its action space. This includes actions such as maintaining the current gaze target, switching to a neighboring object to the left, switching to a neighboring object to the right, and pointing towards the rear view area. The observed variables in its observation space include first-person visual input, the set of traffic objects and their kinematic characteristics, and the vehicle's state. The output strategy... .
[0012] Preferably, the interpretable expert risk perception scanning strategy is constructed using a latent variable behavior cloning algorithm based on expectation maximization (EM). In the E-step, the latent variables corresponding to the action sequence are sampled based on particle filtering and weighted and resampled according to the observed likelihood. In the M-step, an enumeration search is performed in the preset strategy sketch space to generate interpretable rules containing logical expressions and real parameters.
[0013] Preferably, during the training of the interpretable expert risk perception scanning strategy, the EM training objective is represented by minimizing cross-entropy loss plus complexity penalty, and the calculation expression is: , In the formula: Latent variables The posterior approximate distribution; This is a complexity penalty term. For strategy parameters; These are the weighting coefficients; For observed variables and latent variables The gaze strategy output under the given conditions; For action variables; For a moment.
[0014] Preferably, after triggering the takeover request, the moment when the driver's gaze returns to the forward road is detected is taken as the perception re-access point. From the perception re-access point, a risk perception window of a preset duration is started to assess the risk perception capability.
[0015] Preferably, after receiving the driver's object-level gaze sequence and the expert's object-level reference gaze sequence, a similarity score is calculated based on the substitution matrix scoring function and the dynamic programming sequence alignment, and object-level scanning path matching is performed, specifically including: After receiving the driver's object-level gaze sequence and the expert's object-level reference gaze sequence, the object instance region acquired in each frame of the sequence is taken as the instance-level dynamic region of interest. A substitution matrix scoring function is constructed, and the scoring elements in the substitution matrix are calculated based on the pixel spatial distance between the instance-level dynamic regions of interest of different objects corresponding to the same frame in the two sequences using the Sigmoid decay function. The substitution matrix scoring function satisfies the following: when the two regions of interest correspond to the same object, a first fixed high score is assigned; when they correspond to different objects, the score is calculated based on the pixel distance between the regions of interest of the different objects using the Sigmoid decay function, and a distance inflection point threshold is set. After finding the optimal alignment path in the substitution matrix using a dynamic programming algorithm, the score elements of all frames on the optimal alignment path are summed and the gap penalty is subtracted to calculate the first score. ; Based on the sum of self-scores of the driver's object-level gaze sequence and the expert's object-level reference gaze sequence when ideally perfectly matched. The first score after normalization A similarity score is obtained.
[0016] Preferably, the driver risk perception coverage gap assessment result is output based on a soft voting mechanism, specifically including: Calculate similarity scores for multiple expert-level reference gaze sequences; The system calculates the proportion of similarity scores exceeding a preset threshold among multiple expert-level reference gaze sequences. When the proportion is lower than the preset proportion, it determines that there is a coverage gap and triggers a prompt; otherwise, it determines that there is no coverage gap and suppresses the prompt.
[0017] Preferably, the step of overlaying and rendering the received prompt target object onto the first-person driving foreground visual image and then outputting it to the display terminal specifically includes: After receiving the prompt target object, spatial registration and overlay rendering are performed on the prompt target object. The spatial registration is based on the two-dimensional image coordinates of the prompt target object in the first-person driving foreground visual image and the preset field of view mapping relationship. The prompt element is rendered on the display terminal so that the prompt element and the real risk object in the driver's forward physical field of view can achieve visual perspective alignment and coverage. The duration, prominence, and / or trigger threshold of the prompts are dynamically adjusted based on the risk perception coverage gap determination results.
[0018] According to a second aspect of the present invention, a conditional automated driving takeover alert system is provided, the system comprising: The multimodal perception data synchronization module is used to perform frame-level time alignment on the multi-source data streams collected during the takeover process. The multi-source data streams include eye-tracking signals, traffic environment status, first-person driving foreground visual images, and driving control actions. The module performs instance segmentation on the first-person driving foreground visual images to obtain instance regions of risk-related objects, and maps the gaze point to the instance regions to generate a driver object-level gaze sequence. The expert risk perception strategy modeling module is used to model the expert visual scanning process as a partially observable Markov decision process. It adopts an interpretable expert risk perception scanning strategy to generate multiple expert object-level reference gaze sequences based on the context of the takeover scenario. The real-time assessment module for driver risk perception ability is used to receive the driver's object-level gaze sequence and multiple expert object-level reference gaze sequences, perform object-level scanning path matching, and output the risk perception coverage gap judgment result based on the soft voting mechanism. When a coverage gap is determined, the uncovered key risk objects are used as the prompt target objects. The prompt rendering and output module receives the prompt target object, overlays it onto the first-person driving foreground visual image, and outputs it to the display terminal. When sufficient coverage is determined, the prompt is suppressed or canceled. Compared with the prior art, the present invention has the following beneficial effects: 1) The prompt triggering is changed from being driven by the context threshold to being driven by the risk perception ability assessment: Based on the driver's risk perception ability assessment results, prompts can be suppressed when the driver has already covered key risk information, and targeted guidance can be output when there is a coverage gap, which significantly reduces redundant alarms and alarm fatigue.
[0019] 2) More accurate gap location and more targeted prompts: By matching the scanning paths of the driver's object-level gaze sequence with the expert reference gaze sequence, key risk objects not covered by the driver can be identified, making the prompt target consistent with the driver's actual perceived blind spot, improving the efficiency of risk situation establishment and the quality of takeover.
[0020] 3) Real-time performance and deployment friendliness: It adopts lightweight computing mechanisms such as instance-level dynamic regions of interest, substitution matrix and dynamic programming alignment, which can achieve low-latency closed-loop evaluation and prompt output under the constraints of on-board computing power, and meet the real-time requirements of taking over critical windows.
[0021] 4) Reduce cognitive load and increase safety margin: By using augmented reality presentation that prompts only when needed, it reduces interference from irrelevant information, lowers the cognitive load on the driver, improves the success rate of takeover and increases safety margin indicators such as collision time.
[0022] 5) High interpretability, easy verification and parameter tuning: The expert risk perception strategy modeling process adopts interpretable discrete gaze actions and rule expressions, which is conducive to engineering verification, threshold setting and security review, and improves system usability and credibility. Attached Figure Description
[0023] Figure 1 This is a flowchart of the method of the present invention.
[0024] Figure 2 This is a schematic diagram of the system architecture of the present invention.
[0025] Figure 3 This is a block diagram illustrating the data flow implementation of the system architecture of the present invention.
[0026] Figure 4 A module for modeling expert risk perception strategies.
[0027] Figure 5 This module is for real-time assessment of drivers' risk perception capabilities.
[0028] Figure 6 Design diagram for risk experimentation scenarios of autonomous driving.
[0029] Figure 7 Example image showing the overlay display of the experimental platform scene and AR-HUD prompts. Detailed Implementation
[0030] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0031] Example This embodiment provides a conditional autonomous driving takeover prompt method. This method extracts instance regions of interactive traffic entities from a first-person scene and accurately maps the gaze point to this semantic space, thereby clearly knowing what specific risk entities the driver is actually paying attention to. This reveals a key premise for the inherent risk perception and scanning patterns of human drivers. It removes the interference of pure geometric coordinates and provides a data foundation with real traffic interaction logic for the subsequent modeling of expert strategies, fundamentally ensuring the scientific nature and interpretability of the system evaluation and intervention mechanism.
[0032] like Figure 1 As shown, the conditional autonomous driving takeover prompting method in this embodiment specifically includes: S1. Perform frame-level time alignment on the multi-source data streams collected during the takeover process. The multi-source data streams include eye-tracking signals, traffic environment status, first-person driving foreground visual images, and driving control actions. Perform instance segmentation on the first-person driving foreground visual images to obtain instance regions of risk-related objects, and map the gaze point to the instance regions to generate a driver object-level gaze sequence.
[0033] Specifically, it includes the following sub-steps: S11, Frame-level time alignment: Eye-tracking signals, traffic environment status, first-person driving foreground visual images, and driving control actions are aligned to the same timeline (e.g., 60Hz frame rate) using a unified timestamp mechanism to obtain the time. collection of objects With gaze point , The coordinates are two-dimensional, and the set of objects can be obtained from the simulation environment or the output of vehicle perception.
[0034] S12. Semantic Space Representation and Instance Segmentation: Perform instance segmentation on first-person scene images to obtain instance masks or detection boxes of interactive traffic entities (such as surrounding vehicles, pedestrians, etc.) and related static targets, and extract the coordinates of the object center point in the image plane. This forms a semantic space representation, providing a spatial prior for subsequent gaze mapping and semantic-level attention analysis.
[0035] S13. Gaze Mapping and Driver Object-Level Gaze Sequence Generation: After projecting the gaze point onto the image plane, a hit determination is performed with the instance region (ROI) of each object: if the gaze point falls into the object... If the object is within the instance mask or detection box, then the object being gazed at at that moment is recorded as... If no object is hit, it is recorded as a road area / background area / unclassified area. Gaze points that consecutively hit the same object are merged into a gaze event, forming a driver-level gaze sequence. ,in, Indicates the first The object identifier (including object category and instance ID) corresponding to each gaze event. This represents the number of gaze events within the risk perception window.
[0036] S2. The expert visual scanning process is modeled as a partially observable Markov decision POMDP process, and an interpretable expert risk perception scanning strategy is adopted to generate multiple expert object-level reference gaze sequences based on the context of the takeover scenario.
[0037] Specifically, it includes the following sub-steps: S21, Discrete gaze action space and policy output.
[0038] To achieve an interpretable and low-overhead expression of expert scanning logic, this embodiment constructs a discrete gaze action space: Where: STAY indicates keeping the gaze on the current object; LEFT / RIGHT indicates switching to the left / right adjacent object; BACK indicates pointing to the rear view area (such as the rearview mirror area or the object of oncoming traffic behind).
[0039] Expert risk perception strategy modeling output strategy It can further output the next optimal gaze target. It is used to provide hints for candidate targets when there are coverage gaps.
[0040] S22, POMDP modeling and latent variable behavior cloning training.
[0041] Modeling the expert visual scanning process as a partially observable Markov decision process: State variables in the state space At least including: features of the currently viewed object (category, relative distance, relative speed, TTC, etc.), features of the set of neighboring objects, features of vehicles approaching from behind, and the vehicle's status (speed, acceleration, lane position, autonomous driving mode status, TOR trigger information, etc.); The action space is derived from the discrete gaze action space, i.e., its action variables. This includes actions such as maintaining the current gaze object, switching to a nearby object to the left, switching to a nearby object to the right, and pointing to the rear view area; Observation variables in the observation space It should include at least: a first-person driving foreground visual image, a set of traffic objects and their kinematic features, and the vehicle's status; To characterize the uncertainty that experts may have different risk concerns / scanning intentions under the same observations, this embodiment introduces latent variables in policy learning. These are used to characterize potential factors related to attentional decisions, including but not limited to: risk profile level, type of attentional intent, priority of key objects, or scanning phase. Latent variables It can be a discrete variable or a continuous variable.
[0042] Therefore, the expert-focused strategy can be represented as a conditional strategy: ,in, These are strategy parameters or rule parameters.
[0043] Training employs latent variable behavioral clones within the EM framework, with the core objective of utilizing expert demonstration data. The goal is to improve the predictive probability of expert actions. This objective can be expressed as maximizing the (weighted) log-likelihood, which is equivalently expressed as minimizing the (weighted) cross-entropy loss (i.e., the negative log-likelihood), with a complexity penalty added to suppress rule inflation and overfitting. , in: Latent variables The posterior approximate distribution; This is a complexity penalty term; These are the weighting coefficients; For observed variables and latent variables Expert gaze strategies under certain conditions; This is for calculating cross-entropy.
[0044] The synthesized strategy expression View it as a logic syntax tree. The calculation formula is: , in: This indicates the total number of logical connectors (such as union ∨, intersection ∧, negation ¬) used in the strategy. This indicates the total number of driving environment features activated in the strategy (such as the distance and speed of the currently viewed target, the vehicle's yaw angle, etc.). , The preset weight coefficient is usually set to 1, which means that the logical depth and feature dimension are penalized equally.
[0045] Specific application example: Suppose the model synthesizes the following gaze transfer rule for a driver in an emergency takeover scenario (such as a collision with the vehicle in front): , In this example, the number of feature variables (Involves: relative yaw angle, brake pedal opening, and the previous moment's action state). Number of logical operators. (Involving 2) (Connector). According to the formula above, the complexity penalty term for this strategy... =3+2=5.
[0046] During training, the weight coefficient λ is adjusted (as in this embodiment, λ is set to λ). complexity =0.02), the model can effectively balance the minimization of MSE (mean squared error) and logical simplicity, thereby avoiding the generation of overly complex "black box" rules and ensuring that the strategy has a high degree of expert interpretability.
[0047] E-step (latent variable posterior inference / sampling): given the current parameters Below, posterior inference of latent variables is performed based on the observed sequence and expert action sequence to obtain... ,in, In the observation sequence and action sequence Latent variable sequence under given conditions, For the current iteration round The strategy parameters. In some implementations, particle filtering is used to adjust the latent variables. Sequence sampling is performed, and particles are weighted and resampled based on observed likelihood; sign rules / interpretable priors are used for latent variables. The feasible values, transitions, or weights are constrained / modulated to eliminate irrational intentional transitions and improve interpretability and stability.
[0048] M-step (Behavior Cloning Update / Rule Enumeration Optimization): Based on the posterior weights or samples obtained in the E-step, the policy is updated according to the behavior cloning criterion to increase its prediction probability of expert actions under latent variable conditions. In some implementations, candidate rule skeletons are enumerated and searched within a preset policy sketch space to obtain the rule structure; and the real-valued parameters in the rules are numerically optimized to optimize the weighted cross-entropy (or equivalent weighted log-likelihood). At the same time, complexity penalty constraints are combined to avoid overfitting and rule inflation, thereby obtaining an interpretable "rule + parameter" policy expression.
[0049] Through the above EM iterations, when the convergence criterion is met (e.g., the objective function change is below a threshold or the cross-entropy of the validation set no longer decreases), the convergence strategy is obtained. During the online phase, observations can be combined with the current scene. Sampling or inference is performed, and multiple expert-level reference gaze sequences are generated accordingly for subsequent risk perception capability evaluation and coverage gap identification.
[0050] S23. Generate expert object-level reference gaze sequence.
[0051] During the online phase, after a takeover request is triggered, at least one expert strategy model is selected from the expert strategy library based on the current scenario context, and a model is generated. Expert object-level reference gaze sequence ,in, The size of the expert reference sequence set can be related to the number of policy models participating in matching in the expert policy library (e.g., one reference sequence generated for each policy model), or to the number of reference sequences generated by multiple random samplings of the same policy model; in some implementations, It can be set to 20 or more to cover a reasonable diversity of expert scanning strategies.
[0052] S3. After receiving the driver's object-level gaze sequence and multiple expert object-level reference gaze sequences, perform object-level scanning path matching, and output the risk perception coverage gap judgment result based on the soft voting mechanism. When a coverage gap is determined, the uncovered key risk objects are used as the prompt target objects.
[0053] S31. Define the risk perception window.
[0054] This embodiment uses the moment when the driver's gaze returns to the forward road as the perception re-entry point. From this moment, a risk perception window (e.g., 2 seconds or other duration sufficient to cover the initial situation reconstruction phase of the takeover) is determined to capture the key visual search process during the initial takeover phase. Figure 6 As shown.
[0055] S32, Instance-level dynamic ROI and object-level scan path.
[0056] Unlike fixed-mesh AOI methods, this embodiment employs instance-level dynamic Region of Interest (ROI): each frame uses the detected and tracked object instance region as the ROI, thereby ensuring that the matching evaluation is consistent with traffic risk semantics and adapts to object movement, occlusion, and scene changes. See [link to documentation]. Figure 5 .
[0057] S33, Substitution Matrix Scoring Function and Sigmoid Similarity.
[0058] A substitution matrix scoring function is constructed between Regions of Interest (ROIs) to evaluate the matching degree between driver sequence elements and expert sequence elements, satisfying the following: when two ROIs correspond to the same object, a first fixed high score is assigned; when they correspond to different objects, the score is calculated by using the Sigmoid decay function based on the pixel distance between the ROIs of the different objects, and a distance inflection point threshold is set.
[0059] The specific mathematical expression is: , in: Define the region of interest (ROI) for the object; Pixel distance; This is the attenuation parameter.
[0060] This function ensures that matching of the same object yields a significantly high score, while matching of different objects attenuates as spatial distance increases, thus suppressing irrelevant alignment.
[0061] In the dynamic programming alignment process, a gap penalty (e.g., 3 or other values) can be set to constrain spurious alignment caused by excessive skipping. Specifically, constraining spurious alignment caused by excessive skipping means that when dynamic programming seeks the globally maximum similarity matching path, if the skipping of sequence elements is not restricted (i.e., inserting gaps has no cost), the algorithm can easily obtain an artificially high similarity score by skipping a large number of actually mismatched intermediate gaze events without restriction, in order to piece together a few discrete high-scoring hit points. By introducing a gap penalty, every non-continuous match or event omission between sequences will incur a definite deduction cost; this mechanism forces the matching process to strictly respect the temporal order and continuity logic of visual scanning, avoiding from the algorithm's underlying layer the misjudgment of the driver's actual chaotic or severely missed scattered gazes as highly consistent with the expert strategy, thereby fundamentally ensuring the objectivity and authenticity of the real-time risk perception assessment results.
[0062] S34, Dynamic Programming Sequence Alignment and Similarity Normalization.
[0063] After receiving the driver object-level gaze sequence and the expert object-level reference gaze sequence, the object instance region obtained in each frame of the sequence is taken as the instance-level dynamic region of interest. The substitution matrix scoring function is constructed, and the scoring element in the substitution matrix is calculated based on the pixel spatial distance between the different object instance-level dynamic regions of interest corresponding to the same frame in the two sequences using the Sigmoid decay function. After finding the optimal alignment path in the substitution matrix using a dynamic programming algorithm, the score elements of all frames on the optimal alignment path are summed and the gap penalty is subtracted to calculate the first score. ; Based on the sum of self-scores of the driver's object-level gaze sequence and the expert's object-level reference gaze sequence when ideally perfectly matched. The first score after normalization A similarity score is obtained.
[0064] In this embodiment, the Needleman-Wunsch algorithm is used to perform cross-temporal alignment of existing sequences to a certain extent, with the goal of capturing saccades. For example, experts complete the front-left-front gaze action within 0.5-2 seconds, while subjects complete the front-left-front gaze action within 1-2.5 seconds. The Needleman-Wunsch algorithm compares the consistency of the gaze regions of the two sequences under the globally optimal matching path, allowing for the insertion of gaps to tolerate the shift in scan start time or the absence of a single gaze event, and then making the comparison.
[0065] like Figure 5 As shown, this embodiment uses a dynamic programming alignment algorithm to calculate the driver object-level gaze sequence. With each expert object-level reference gaze sequence The maximum matching score is calculated and normalized to obtain the similarity score: .
[0066] This yields a set of similarity scores: .
[0067] S35, Output of soft voting mechanism and risk perception coverage gap determination results.
[0068] This embodiment uses soft voting to output risk perception capability assessment results and coverage gap determination: when the similarity exceeds a preset proportion (e.g., 50%)... When the threshold is exceeded (e.g., 0.5), the risk perception coverage gap is determined to be adequate, and a suppression prompt signal is output; otherwise, a coverage gap is determined, a prompt trigger signal is output, and the prompt target object is identified.
[0069] The methods for determining the target object may include: 1) Select the expert sequence with the highest similarity. Extract its key uncovered objects; 2) The expert risk perception strategy modeling module outputs the next optimal fixation target. And it serves as a cue target when the target is not covered by the driver; 3) When multiple targets are concurrent, arbitration is conducted based on the risk priority of the target (e.g., TTC, relative speed, lateral intrusion probability) and the degree of non-coverage to determine one or more alert targets.
[0070] S4. The received prompt target object is overlaid and rendered on the first-person driving foreground visual image and then output to the display terminal. When the risk perception coverage gap determination result determines that the coverage is sufficient, the prompt is suppressed or canceled.
[0071] When the driver's risk perception ability real-time assessment module outputs a prompt trigger signal, the prompt target is spatially registered and overlaid on the first-person view screen, and then output to the display terminal, such as... Figure 7 As shown, the specific process includes: S41. Spatial registration: Obtain the two-dimensional pixel coordinates and bounding box of the prompt target in the first-person visual image. Based strictly on the preset field of view (FOV) mapping relationship between the camera and the HUD display interface, the two-dimensional prompt element is proportionally perspective-mapped onto the transparent display plane of the HUD, thereby forming a prompt box, highlight area or guide mark that is aligned with the visual level of the physical object in front. S42. Prompt style: It can be in the form of border selection, semi-transparent highlight, arrow guidance, etc. S43. Dynamic Adjustment: Based on the risk perception capability assessment results, coverage gap, similarity gap, or remaining safety margin (e.g., TTC), dynamically adjust the prominence, duration, flashing frequency, or trigger threshold of the prompt; S44. Suppression / Cancellation Mechanism: When the evaluation results meet the preset requirements and / or have sufficient coverage, a suppression prompt is output or a cancellation prompt is displayed to avoid redundant alarms that could lead to attentional interference and increased cognitive load.
[0072] This embodiment provides a conditional autonomous driving takeover prompting system, the overall architecture of which is as follows: Figure 2 and Figure 3 As shown, the system uses the driver's risk perception ability assessment results within the critical takeover window as the trigger basis, and constructs a closed-loop intervention link consisting of multimodal perception data synchronization, expert risk perception strategy modeling, real-time assessment of driver risk perception ability, and prompt rendering and output. This enables conditional takeover guidance, which prompts only when the evaluation is insufficient or there is a coverage gap, and suppresses prompts when the evaluation meets the requirements.
[0073] The system may include the following hardware / software units (which may be implemented by an onboard domain controller, a driving simulation platform, or a combination of both): 1) Environmental perception unit: camera, millimeter-wave radar / liDAR, and fusion perception and target tracking module, used to output the set of surrounding risk-related objects and their pose, velocity, acceleration, heading angle, etc.; 2) Driver monitoring unit: driver face camera / eye tracker and gaze event detection module, used to output gaze point coordinates, gaze duration, pupil diameter, etc.; 3) Computing and communication unit: onboard computing platform / domain controller, used to perform multimodal data alignment, expert strategy reasoning, real-time evaluation of risk perception capabilities, and prompt decision-making; 4) Prompt presentation unit: display terminal, or windshield projection, instrument panel, central control screen, etc. as alternative display terminals, used to achieve spatial registration and overlay display.
[0074] like Figure 3As shown, the system in this embodiment includes: a multimodal perception data synchronization module, an expert risk perception strategy modeling module, a driver risk perception ability real-time assessment module, and a prompt rendering and output module. The data interaction relationships between the modules are as follows: (1) Multimodal sensing data synchronization module.
[0075] It receives eye-tracking signals, traffic environment status, first-person driving foreground visual images, and driving control actions. It performs unified timestamp alignment on each data stream and generates a driver object-level gaze sequence through instance segmentation and gaze mapping, which is then sent to the real-time assessment module for driver risk perception ability.
[0076] (2) Expert risk perception strategy modeling module.
[0077] like Figure 4 As shown, an interpretable expert risk perception scanning strategy is obtained through offline training using latent variable behavior cloning based on expert driver data. In the online phase, multiple expert object-level reference gaze sequences are generated based on the context of the takeover scenario and sent to the real-time assessment module for driver risk perception capability.
[0078] (3) Real-time assessment module for driver risk perception ability.
[0079] After receiving the driver's object-level gaze sequence and multiple expert object-level reference gaze sequences, object-level scanning path matching is performed. Based on the soft voting mechanism, the risk perception coverage gap judgment result is output. When a coverage gap is determined, the uncovered key risk objects are used as the prompt target objects.
[0080] (4) Prompt rendering and output module.
[0081] In this embodiment, the prompt rendering and output module uses AR-HUD (Augmented Reality Head-Up Display) prompt rendering.
[0082] After receiving the target object and display parameters, augmented reality overlay rendering is performed on the first-person perspective view and output to the display terminal; when a suppress prompt signal is received, the prompt is canceled or not rendered to reduce redundant alarms.
[0083] In addition, the prompt rendering and output module can be deployed in the rendering engine of the driving simulation scene and directly overlaid when rendering the first-person perspective screen; it can also provide a display interface to output the prompt screen or prompt element data to an external hardware display screen.
[0084] When using the closed-loop system of this invention to provide takeover notification, the following steps can be performed: Step 1, Takeover Request Trigger and Window Initialization: After the autonomous driving system triggers a Takeover Request (TOR), it detects the moment when the driver's gaze returns to the forward road as the perception re-entry point and initiates the risk perception window. In this embodiment, the preset duration is set to 2 seconds. This duration is configured to cover the complete initial perception phase from visual information acquisition to decision formation, and its time window is set to accommodate a single effective visual saccade lasting 200 to 350 milliseconds, ensuring that the extracted driver object-level gaze sequence meets the minimum eye-tracking data density requirements for constructing the risk assessment scanning path.
[0085] Step 2, Multimodal Data Synchronization and Driver Object-Level Gaze Sequence Generation: Frame-level alignment is performed on eye-tracking signals, traffic environment status, first-person driving foreground visual images, and driving control actions; instance segmentation is performed on the first-person scene images, and gaze points are mapped to the Region of Interest (ROI) of the object, gaze events are extracted, and a driver object-level gaze sequence is formed. .
[0086] Step 3, Expert Object-Level Reference Gaze Sequence Generation: The expert risk perception strategy modeling module reads the current scene context and generates... Expert object-level reference gaze sequence It also provides the next optimal gaze target candidate.
[0087] Step 4, Real-time Assessment of Risk Perception Capability and Determination of Coverage Gap: Constructing the substitution matrix scoring function and gap penalty parameters for... With each Perform dynamic programming alignment and calculate the similarity set. The evaluation results and coverage gap determination are output through soft voting.
[0088] Step 5, Target Selection and AR-HUD Rendering Output: If the evaluation results are insufficient and / or there are coverage gaps, the uncovered key risk objects are identified as the prompt targets, and spatial registration and overlay display are performed through the prompt rendering and output module; if the evaluation results meet the requirements and / or the coverage is sufficient, the prompt is suppressed or canceled.
[0089] Step 6, Closed-loop update: The system continuously receives updated eye-tracking and environmental perception data and updates the matching evaluation results on a rolling basis; when the evaluation changes from insufficient to meeting the requirements, the prompt is withdrawn, thus forming a closed-loop intervention.
[0090] like Figure 6As shown, in a typical risk experiment scenario, there are multiple risk-related objects on the road ahead (e.g., vehicles driving side-by-side, potential vehicles merging from the side, vehicles approaching quickly from behind, etc.). If the driver only focuses on the vehicles in front and fails to cover the key risk objects to the side and rear during the initial takeover, the driver's risk perception ability real-time assessment module outputs a low similarity score and determines that there is a coverage gap. The prompt rendering and output module then highlights the uncovered objects. See [link to relevant documentation]. Figure 7 The system guides the driver to make additional observations; once the driver has finished focusing on the object, the system's similarity score increases and the output coverage is sufficient, prompting automatic suppression or cancellation, thereby reducing redundant alarms and improving the quality of takeover decisions and safety margins.
[0091] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A conditional automated driving takeover prompting method, characterized in that, include: The multi-source data streams collected during the takeover process are aligned at the frame level. The multi-source data streams include eye-tracking signals, traffic environment status, first-person driving foreground visual images, and driving control actions. The first-person driving foreground visual images are segmented into instances to obtain instance regions of risk-related objects. The gaze points are then mapped to the instance regions to generate a driver object-level gaze sequence. The expert visual scanning process is modeled as a partially observable Markov decision process, and an interpretable expert risk perception scanning strategy is adopted to generate multiple expert object-level reference gaze sequences based on the context of the takeover scenario. After receiving the driver's object-level gaze sequence and multiple expert object-level reference gaze sequences, object-level scanning path matching is performed, and the risk perception coverage gap judgment result is output based on the soft voting mechanism. When a coverage gap is determined, the uncovered key risk objects are used as the prompt target objects. The received prompt target object is overlaid and rendered on the first-person driving foreground visual image and then output to the display terminal. When sufficient coverage is determined, the prompt is suppressed or canceled.
2. The conditional automated driving takeover prompting method according to claim 1, characterized in that, The driver object-level gaze sequence includes multiple gaze events. Each gaze event includes at least the gaze start and end time, gaze point coordinates, and the corresponding risk-related object identifier. Gaze points that hit the same object consecutively are merged into a gaze event.
3. The conditional automated driving takeover prompting method according to claim 1, characterized in that, The process of expert visual scanning is modeled as a partially observable Markov decision process, with state variables in its state space. Including the features of the currently viewed object, the features of neighboring objects, the features of vehicles approaching from behind, and the state of the vehicle itself, the action variables in its action space. This includes actions such as maintaining the current gaze target, switching to a neighboring object to the left, switching to a neighboring object to the right, and pointing towards the rear view area. The observed variables in its observation space include first-person visual input, the set of traffic objects and their kinematic characteristics, and the vehicle's state. The output strategy... .
4. The conditional automated driving takeover prompting method according to claim 1, characterized in that, The interpretable expert risk perception scanning strategy is constructed using a latent variable behavior cloning algorithm based on expectation maximization (EM). In the E-step, the latent variables corresponding to the action sequence are sampled based on particle filtering and weighted and resampled according to the observed likelihood. In the M-step, an enumeration search is performed in the preset strategy sketch space to generate interpretable rules containing logical expressions and real parameters.
5. The conditional automated driving takeover prompting method according to claim 4, characterized in that, During the training of the interpretable expert risk perception scanning strategy, the EM training objective is represented by minimizing cross-entropy loss plus complexity penalty, and the calculation expression is: , In the formula: Latent variables The posterior approximate distribution; This is a complexity penalty term. For strategy parameters; These are the weighting coefficients; For observed variables and latent variables The gaze strategy output under the given conditions; For action variables; For a moment.
6. The conditional automated driving takeover prompting method according to claim 1, characterized in that, After a takeover request is triggered, the moment when the driver's gaze returns to the forward road is taken as the perception re-access point. From the perception re-access point, a risk perception window of a preset duration is started to assess the risk perception capability.
7. The conditional automated driving takeover prompting method according to claim 1, characterized in that, After receiving the driver's object-level gaze sequence and the expert's object-level reference gaze sequence, a similarity score is calculated based on the substitution matrix scoring function and the dynamic programming sequence alignment, and object-level scanning path matching is performed, specifically including: After receiving the driver's object-level gaze sequence and the expert's object-level reference gaze sequence, the object instance region acquired in each frame of the sequence is taken as the instance-level dynamic region of interest. A substitution matrix scoring function is constructed, and the scoring elements in the substitution matrix are calculated based on the pixel spatial distance between the instance-level dynamic regions of interest of different objects corresponding to the same frame in the two sequences using the Sigmoid decay function. The substitution matrix scoring function satisfies the following: when the two regions of interest correspond to the same object, a first fixed high score is assigned; when they correspond to different objects, the score is calculated based on the pixel distance between the regions of interest of the different objects using the Sigmoid decay function, and a distance inflection point threshold is set. After finding the optimal alignment path in the substitution matrix using a dynamic programming algorithm, the score elements of all frames on the optimal alignment path are summed and the gap penalty is subtracted to calculate the first score. ; Based on the sum of self-scores of the driver's object-level gaze sequence and the expert's object-level reference gaze sequence when ideally perfectly matched. The first score after normalization A similarity score is obtained.
8. The conditional automated driving takeover prompting method according to claim 1, characterized in that, The results of the driver risk perception coverage gap assessment are output based on a soft voting mechanism, specifically including: Calculate similarity scores for multiple expert-level reference gaze sequences; The system calculates the proportion of similarity scores exceeding a preset threshold among multiple expert-level reference gaze sequences. When the proportion is lower than the preset proportion, it determines that there is a coverage gap and triggers a prompt; otherwise, it determines that there is no coverage gap and suppresses the prompt.
9. The conditional automated driving takeover prompting method according to claim 1, characterized in that, The step of overlaying and rendering the received prompt target object onto the first-person driving foreground visual image and then outputting it to the display terminal specifically includes: After receiving the prompt target object, spatial registration and overlay rendering are performed on the prompt target object. The spatial registration is based on the two-dimensional image coordinates of the prompt target object in the first-person driving foreground visual image and the preset field of view mapping relationship. The prompt element is rendered on the display terminal so that the prompt element and the real risk object in the driver's forward physical field of view can achieve visual perspective alignment and coverage. The duration, prominence, and / or trigger threshold of the prompts are dynamically adjusted based on the risk perception coverage gap determination results.
10. A system employing the conditional automated driving takeover prompting method according to claim 1, characterized in that, The system includes: The multimodal perception data synchronization module is used to perform frame-level time alignment on the multi-source data streams collected during the takeover process. The multi-source data streams include eye-tracking signals, traffic environment status, first-person driving foreground visual images, and driving control actions. The module performs instance segmentation on the first-person driving foreground visual images to obtain instance regions of risk-related objects, and maps the gaze point to the instance regions to generate a driver object-level gaze sequence. The expert risk perception strategy modeling module is used to model the expert visual scanning process as a partially observable Markov decision process. It adopts an interpretable expert risk perception scanning strategy to generate multiple expert object-level reference gaze sequences based on the context of the takeover scenario. The real-time assessment module for driver risk perception ability is used to receive the driver's object-level gaze sequence and multiple expert object-level reference gaze sequences, perform object-level scanning path matching, and output the risk perception coverage gap judgment result based on the soft voting mechanism. When a coverage gap is determined, the uncovered key risk objects are used as the prompt target objects. The prompt rendering and output module is used to receive the prompt target object, overlay it onto the first-person driving foreground visual image, and output it to the display terminal. When it is determined that the coverage is sufficient, the prompt is suppressed or canceled.