A driving risk prediction system and method based on a multi-modal large language model

By combining a multimodal large language model with multi-source information acquisition and deep learning algorithms, the problem of insufficient utilization of visual information in driving risk prediction models is solved, and real-time identification of abnormal driver behavior and interpretable risk warnings are achieved.

CN122264221APending Publication Date: 2026-06-23TONGJI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TONGJI UNIV
Filing Date
2026-04-01
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing driving risk prediction models struggle to effectively utilize visual information about the driver and vehicle status, and their outputs lack interpretability that aligns with the driver's cognition.

Method used

By employing a multimodal large language model, through multi-source driving information collection, dataset construction, model fine-tuning training and feature extraction, combined with a two-layer sequence-to-sequence long short-term memory recurrent network and a sliding time window algorithm, we can predict and interpret driving risks.

Benefits of technology

It significantly enhances the system's ability to understand complex traffic scenarios, enabling it to identify abnormal driver behavior in real time and provide interpretable risk warnings.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122264221A_ABST
    Figure CN122264221A_ABST
Patent Text Reader

Abstract

The application provides a driving risk prediction system and method based on a multi-modal large language model, and relates to the technical fields of intelligent transportation and artificial intelligence. The system comprises: a multi-source driving information acquisition module for acquiring high-quality following event samples and driving element information; a multi-source driving data set construction module for constructing a high-quality traffic accident and driving behavior data set suitable for fine-tuning training of a multi-modal large model; a large model fine-tuning training module for optimizing parameters of a multi-modal large language model; a driving feature semantic extraction module for extracting visual semantic features in multi-perspective driving videos; a time series risk prediction module for deeply fusing physical data of a vehicle-mounted sensor and visual semantic features to construct a heterogeneous vector, and realizing prediction of a future risk trend based on time series reasoning; and a risk grading and explanation module for establishing a risk quantification grading threshold, and realizing quantitative attribution results of the prediction results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of intelligent transportation and artificial intelligence technology, and in particular to a driving risk prediction system and method based on a multimodal large language model. Background Technology

[0002] Against the backdrop of the rapid development of intelligent transportation systems and autonomous driving technology, the incidence of traffic accidents is also increasing year by year. In this context, driving risk prediction has become a core issue in the field of active safety. Its core essence lies in establishing and improving advanced driver assistance systems and integrating multi-source information (such as vehicle information, environmental data, driver status, etc.) to predict driving risks using scientifically advanced algorithms. However, traditional risk prediction models heavily rely on structured trajectory data acquired by onboard sensors, such as longitudinal speed, instantaneous acceleration, following distance, and traffic flow density. Although these physical indicators can be accurately measured by machines, they only represent geometric relationships in physical space and are insufficient to encompass higher-order semantic information during the driving process. For example, the driver's physiological and psychological state, details of hand operations, and road surface semantic features often provide warnings before an accident occurs, but traditional sensors struggle to capture these "soft" features, thus failing to provide timely responses and warnings about driving risks. Furthermore, most machine learning algorithms exhibit significant "black box" characteristics when processing this data; while they can provide prediction results, they cannot inform the driver or the system of the underlying risk-causing logic, which greatly limits the widespread deployment of prediction models in intelligent driving systems.

[0003] Therefore, utilizing advanced technologies, such as artificial intelligence, particularly the introduction of Multimodal Large Language Models (MLLMs) into real-time driving risk prediction, will provide a new technological path for research in this field, thanks to their superior visual structural cognition and semantic reasoning capabilities, as well as strong generalization and adaptability. By transforming driving videos or image sequences into structured semantic descriptions, models can simulate the perceptual logic of human drivers and identify unstructured risk factors. However, most current research data focuses on external factors, such as traffic flow information and environmental conditions, and rarely incorporates driver and vehicle status. Much research currently only uses large language models to provide drivers with easily understandable voice prompts or warnings. Summary of the Invention

[0004] To overcome the shortcomings of existing technologies, the purpose of this invention is to provide a driving risk prediction system and method based on a multimodal large language model. This invention solves the problems of existing driving risk prediction models having limited data utilization dimensions, often ignoring visual information such as road environment and driver state, and lacking interpretability of output results that are consistent with driver cognition.

[0005] To achieve the above objectives, the present invention provides the following solution: A driving risk prediction system based on a multimodal large language model includes: A multi-source driving information acquisition module is used to collect multi-source driving information, wherein the multi-source driving information includes: natural driving video and vehicle trajectory data; A multi-source driving dataset construction module is used to construct a multimodal training dataset containing visual information, sensor information, and behavioral semantic labels from the multi-source driving information. The large model fine-tuning training module is used to optimize the parameters of the multimodal large language model using the multimodal training dataset, and to enhance the feature extraction capability of the multimodal large language model through low-rank adaptation, so as to obtain the fine-tuned multimodal large language model. The driving feature semantic extraction module is used to input the natural driving video into the fine-tuned multimodal large language model to extract visual semantic features; The temporal risk prediction module is used to deeply fuse the vehicle trajectory data and the visual semantic features to construct a heterogeneous vector, and to predict future risk trends based on a two-layer sequence-to-sequence long short-term memory recurrent network and a sliding time window algorithm, thereby obtaining temporal inference prediction results. The risk grading and interpretation module is used to quantify and grade risk based on the relative reciprocal collision time threshold, and introduces contribution algorithm analysis to achieve quantitative attribution and technical interpretation of the time-series inference prediction results.

[0006] A driving risk prediction method based on a multimodal large language model includes: Collect multi-source driving information, including: natural driving video and vehicle trajectory data; A multimodal training dataset containing visual information, sensor information, and behavioral semantic labels was constructed from the multi-source driving information. The parameters of the multimodal large language model are optimized using the multimodal training dataset, and the feature extraction capability of the multimodal large language model is enhanced by the low-rank adaptation method, resulting in a fine-tuned multimodal large language model. The natural driving video is input into the finely tuned multimodal large language model to extract visual semantic features; The vehicle trajectory data and visual semantic features are deeply integrated to construct a heterogeneous vector, and the prediction of future risk trends is realized based on a two-layer sequence-to-sequence long short-term memory recurrent network and a sliding time window algorithm, thus obtaining the temporal inference prediction result. A risk quantification and classification threshold based on the relative reciprocal collision time is used, and a contribution algorithm analysis is introduced to achieve quantitative attribution and technical explanation of the time-series inference prediction results.

[0007] The present invention discloses the following technical effects: This invention provides a driving risk prediction system and method based on a multimodal large language model. This invention introduces a multimodal large language model into the field of real-time driving risk prediction. This technology transforms unstructured multi-channel video streams into high-order feature labels containing driver physiological state, road semantic environment, and the intentions of vehicles ahead, achieving a dimensional upscaling from "geometric spatial data" to "human-perceived semantics," significantly enhancing the system's ability to understand complex traffic scenarios. This invention proposes a hierarchical triggering recognition logic. The system performs video-level dynamic scanning in 5-second windows to capture continuous actions (such as the driver looking down or using a mobile phone). Once abnormal behavior is detected, it automatically backtracks and initiates high-frequency single-frame image traversal to accurately pinpoint the start and end times of the risky action. Attached Figure Description

[0008] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0009] Figure 1 A schematic diagram of a driving risk prediction system based on a multimodal large language model is provided for an embodiment of the present invention. Figure 2 This is a schematic diagram of a multimodal driving training set provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of the fine-tuning training process of the detection model provided in an embodiment of the present invention; Figure 4 This is a schematic diagram of a large driving feature semantic reading method provided in an embodiment of the present invention. Attached image description: 1- Multi-source driving information acquisition module, 2- Multi-source driving dataset construction module, 3- Large model fine-tuning training module, 4- Driving feature semantic extraction module, 5- Temporal risk prediction module, 6- Risk classification and interpretation module. Detailed Implementation

[0011] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0012] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0013] like Figure 1 As shown, this invention provides a driving risk prediction system based on a multimodal large language model, comprising: Multi-source driving information acquisition module 1 is used to acquire multi-source driving information, wherein the multi-source driving information includes: natural driving video and vehicle trajectory data; Multi-source driving dataset construction module 2 is used to construct a multimodal training dataset containing visual information, sensor information and behavioral semantic labels from the multi-source driving information; The large model fine-tuning training module 3 is used to optimize the parameters of the multimodal large language model using the multimodal training dataset, and to enhance the feature extraction capability of the multimodal large language model through the low-rank adaptation method, so as to obtain the fine-tuned multimodal large language model. The driving feature semantic extraction module 4 is used to input the natural driving video into the fine-tuned multimodal large language model to extract visual semantic features; The temporal risk prediction module 5 is used to deeply fuse the vehicle trajectory data and the visual semantic features to construct a heterogeneous vector, and to predict future risk trends based on a two-layer sequence-to-sequence long short-term memory recurrent network and a sliding time window algorithm, so as to obtain the temporal inference prediction result. The risk classification and interpretation module 6 is used to quantify the risk classification threshold based on the relative reciprocal collision time and introduces a contribution algorithm analysis to achieve quantitative attribution and technical interpretation of the time-series inference prediction results.

[0014] Specifically, the multi-source driving information acquisition module is responsible for collecting standard following event samples and important driving element information that conform to spatiotemporal continuity and consistency from natural driving video and vehicle trajectory data collected by the multi-sensor system; the multi-source driving dataset construction module is responsible for constructing a multimodal training dataset containing visual information, sensor information, and behavioral semantic labels from the collected multi-source driving information through unified cleaning, structured organization, and element labeling, for subsequent large-scale model fine-tuning training; the large-scale model fine-tuning training module is responsible for enabling the model to learn the sensitivity and accuracy of driving scene element information recognition through low-rank adaptation, significantly enhancing its ability to recognize complex driving behaviors and potential risks, and generating interpretable text output to support the subsequent feature extraction module; the driving feature semantic extraction module is used to... The system extracts visual semantic features from multi-view driving videos reflecting road environment, preceding vehicle behavior, and driver status through a hybrid logic of "video recognition prediction + image recognition localization," providing multi-source data input for the prediction model. The temporal risk prediction module is used to deeply fuse onboard sensor physical data and visual semantic features to construct heterogeneous vectors, and realizes temporal inference prediction of future risk trends based on a two-layer "sequence-to-sequence-long short-term memory recurrent network" and a sliding time window algorithm. After the model is trained, it can output the highest risk level within the future time window based on the input multimodal risk feature sequence. The risk classification and interpretation module is used to establish a risk quantification classification threshold based on the relative reciprocal collision time, and introduces a contribution algorithm analysis to realize quantitative attribution and technical interpretation of the prediction results, effectively enhancing the interpretability of the model's prediction results.

[0015] Furthermore, the multi-source driving information acquisition module 1 includes: The video acquisition submodule is used to acquire real road environment data containing multi-view driving footage to obtain the natural driving video. The trajectory acquisition submodule is used to acquire vehicle dynamics characteristics and driving position sequences to obtain the vehicle trajectory data.

[0016] Specifically, Naturalistic Driving Study (NDS) is a method for continuously and non-invasively collecting driving behavior data in real-world road environments to reveal driver behavior patterns and accident causes. Data recording in these experiments is conducted without interfering with the driver's normal driving behavior, aiming to obtain data that closely reflects natural driving behavior. The method involves installing radar, cameras, sensors, and other equipment on the vehicle to collect data on vehicle dynamics, driving trajectory, driver behavior, and driving audio and video.

[0017] Using the natural driving dataset, we performed detailed semantic labeling on three core driving scene image types: front view image, driver's face view image, and driver's hand view image. The specific labeling included multi-dimensional content such as the driving state of the vehicle in front, the driver's distraction state, and driving environment features, which effectively reduced the probability of misclassification in the subsequent model feature extraction process.

[0018] Furthermore, the multi-source driving dataset construction module 2 includes: The semantic labeling submodule is used to perform multi-dimensional manual annotation on the natural driving video to obtain the behavioral semantic tags; The feature combination submodule is used to extract driving environment and vehicle status from the natural driving video as visual information, extract physical features from the vehicle trajectory data as sensor information, and combine the visual information, the sensor information and the behavioral semantic labels to construct the multimodal training dataset.

[0019] Specifically, during the dataset annotation phase, an expert annotation model was adopted, with five experienced domain experts performing detailed semantic annotations. Because specific driving features needed to be extracted from the video of the vehicle in front, and because the dataset collected under normal driving conditions suffered from imbalance, 2000 images were carefully selected and annotated for each feature in this view. The other two views (facial video and hand video) each contained 8000 images for annotation.

[0020] Furthermore, the large model fine-tuning training module 3 includes: The parameter freezing submodule is used to load the pre-trained base model and set the original weight matrix to a frozen state; The bypass construction submodule is used to introduce two consecutive small matrices into the converter layer to form a bypass, and to initialize the two consecutive small matrices with random Gaussian initialization and zero matrix initialization respectively, to construct the underlying architecture of the low-rank adaptation method. The parameter optimization submodule is used to simultaneously feed the input vector of the multimodal training dataset into the original branch and the low-rank bypass branch for parameter optimization during training iterations. The feature fusion submodule is used to sum and fuse the outputs of the original branch and the low-rank bypass branch to obtain the fine-tuned multimodal large language model.

[0021] Specifically, the low-rank adaptation method enables the model to learn the sensitivity and accuracy of recognizing driving scene elements using a small amount of labeled driving scene data, thereby enhancing its driving feature extraction capabilities. First, the system defines the fine-tuning task as a conditional probability prediction problem based on a given driving context. Let the training set be: ; in It is a combination of visual input and prompts. These are structured labels for driving elements. The goal of fine-tuning is to minimize the negative log-likelihood loss, and its objective function is... Defined as: ; in, This indicates the frozen pre-training parameters. This represents the set of trainable parameters that the current module needs to optimize. This loss function ensures that the semantic features output by the model are highly consistent with the driving elements calibrated by experts. The implementation steps of low-rank adaptation are as follows: 1) Pre-trained parameter freezing: First, load the pre-trained multimodal large language model as the base model and set its original weight matrix to a frozen state, so that it does not accept gradient updates during fine-tuning. 2) Constructing a low-rank bypass matrix: Introduce two consecutive small matrices in the Transformer layer of the model to form a bypass, so as to significantly reduce the number of parameters required for computation; 3) Matrix initialization: To ensure stability in the early stages of training, the two small matrices are initialized with random Gaussian initialization and zero matrix initialization respectively. This ensures that at the moment fine-tuning begins, the bypass output is zero and the model output is equivalent to the original pre-training output. 4) Forward computation and fusion: During training iterations, the input vector simultaneously enters both the original branch and the low-rank bypass branch. The final output of the model is obtained by summing these two parts, as shown in the following formula: ; in, This is the output vector of the large model after fine-tuning. The input vector for the model; and That is, the low-rank matrix learned during the fine-tuning process.

[0022] Specifically, during training, the labeled images and their text labels are used as input, and prompts are used to guide the model to focus on specific driving issues, thereby improving the model's recognition accuracy and focus. Fine-tuning training utilizes the LLaMA-Factory open-source framework, with a hardware platform consisting of dual NVIDIA RTX 4090 GPUs (each with 24 GB of VRAM). The training hyperparameters are set as follows: batch size of 2, iterations of 10; mean squared error (MSE) as the loss function, Adam algorithm as the optimizer; initial learning rate of 1×10⁻⁶. -5 Cosine annealing scheduling is used; LoRA rank is set to 8, LoRA_α is set to 16, and dropout rate is 0.05.

[0023] Furthermore, the driving feature semantic extraction module 4 includes: The static image extraction submodule is used to segment the natural driving video into single-frame images at a specified frequency and input them into the fine-tuned multimodal large language model to extract the visual semantic features; The dynamic video extraction submodule is used to package multiple consecutive images into a video sequence and input it into the fine-tuned multimodal large language model for overall action semantic determination, so as to extract the visual semantic features related to the driver's state.

[0024] Furthermore, 1) Forward-looking scene video extraction: The continuous video captured by the vehicle's front camera is divided into image frames at 10 frames / second, matching the sampling frequency of the onboard data. Each frame is input into the LLM in chronological order, and the model answers the following questions in sequence based on prompts: road type (urban road, highway, or residential road), presence of pedestrians or other interfering factors in front, and whether the brake lights of the vehicle in front are on. For example, for a forward-looking image prompt: "The image shows a following scene. Please answer: 1. What is the road surface type? 2. Are there pedestrians interfering between the vehicle in front and behind? 3. Are the brake lights of the vehicle in front on?" The model generates text answers based on the input image content and parses the answers into corresponding feature values. The forward-looking features obtained in this process are: road type (3 subvariables), brake light status of the vehicle in front (on / off), and pedestrian interference (present / absent).

[0025] 2) Driver Video Extraction: Due to the sequential nature of the driver's facial and hand movements, a "video + image" recognition method was adopted. Two sets of features were extracted: driver facial centering (yes / no) and risky hand movements (yes / no).

[0026] 3) Prompt Design and Input Format: In the forward-looking, face, and hand perspectives, the prompt design is consistent with the instruction style used during model fine-tuning to guide the model to focus on the required features. The data input to the LLM includes instruction text and the image / video to be analyzed, and the output is the feature answer that meets the requirements. Through this method, multimodal LLM can convert visual scene information into structured semantic features. Ultimately, this module obtains five risk features: road type, brake light status of the vehicle in front, pedestrian interference, driver face partiality, and driver hand risk. These visual features, together with four structured features (vehicle speed, acceleration, following distance, and speed difference) collected by onboard sensors, provide multi-source input for the risk prediction model.

[0027] Specifically, the driving feature semantic extraction module 4 incorporates a unique hybrid recognition logic of "video recognition prediction + image recognition localization," and the implementation steps are as follows: 1) Image recognition mode (for static environment): For relatively stable features such as road type (such as urban road, elevated road, residential area) and pedestrian interference in the forward view, the system performs parallel traversal recognition of single frame images at a frequency of 10Hz. 2) Frequency Recognition Mode (for dynamic behavior): For driver facial and hand movements (such as looking down, using a mobile phone, or turning around to talk), due to their strong temporal correlation, single-frame recognition can easily cause jumps in results. The system packages every 50 frames (approximately 5 seconds long) into a video sequence and inputs it into the MLLM, where the model performs overall action semantic determination. 3) Boundary Precise Positioning Logic: If the video mode determines that there is a "distracting action" within the current 5 seconds, the system will automatically backtrack and start frame-by-frame image scanning within that time window to accurately locate the start (Onset) and end (Offset) time of the risky action, with the error controlled within 0.1 seconds.

[0028] Furthermore, the time-series risk prediction module 5 includes: The encoding submodule, consisting of a first-layer long short-term memory recurrent neural network, is used to receive the heterogeneous vector constructed by deep fusion of the vehicle trajectory data and the visual semantic features within the observation window, and to compress the heterogeneous vector into a fixed-length semantic hidden vector. The decoding submodule, consisting of a second-layer long short-term memory recurrent neural network and a fully connected output layer, is used to receive the semantic hidden vector as the initial stimulus and recursively generate a risk value sequence containing the prediction of the future risk trend. The prediction submodule is used to perform inference calculations on the risk numerical sequence using the sliding time window algorithm to obtain the time-series inference prediction result.

[0029] Furthermore, a two-layer LSTM neural network Seq2Seq prediction model was designed for risk time series prediction. A sliding window method was employed for training and prediction: data at 10 Hz was sampled with a sliding step of 0.1 seconds, using a historical observation window as input to predict the risk level for a certain future length (generally 1 second or longer). This sliding window design ensures high-frequency updates of risk information, meeting real-time warning requirements while guaranteeing sufficient reaction time for drivers. During training, a dataset of 12,077 car-following events was used, divided into training and validation sets in an 8:2 ratio; the training batch size was set to 16, the maximum number of iterations to 100, and the learning rate to 1×10⁻⁶. -3 (Adam optimizer, MSE loss). After the model is trained, it can output the highest risk level within a future time window based on the input multimodal risk feature sequence, for real-time risk warning.

[0030] Specifically, a sequence-to-sequence (Seq2Seq) architecture and a long short-term memory recurrent neural network (LSTM) are introduced to achieve "multi-step forward prediction." This architecture includes: 1) Encoder: Composed of a first-layer LSTM, responsible for receiving historical data sequences within the observation window. Its core function is to compress a multi-dimensional input lasting several seconds into a fixed-length "semantic hidden vector"; 2) Decoder: Consists of a second-layer LSTM and a fully connected output layer. It receives the final state of the encoder as the initial stimulus and recursively generates a sequence of risk values ​​within the future prediction window; 3) Sliding time window: In order to meet the needs of real-time early warning, the model adopts a highly overlapping sliding window mechanism in the inference phase.

[0031] Furthermore, the risk classification and interpretation module 6 includes: The risk quantification and classification submodule is used to set the risk quantification and classification threshold based on the relative reciprocal collision time, and to classify the risk level of the time series inference prediction results according to the risk quantification and classification threshold. The quantitative attribution module is used to analyze and calculate the marginal contribution change of each input feature across all possible feature subsets using the contribution algorithm, and outputs the interaction effect between paired features as the quantitative attribution and technical explanation.

[0032] Furthermore, the risk quantification and classification submodule sets the risk quantification and classification thresholds as four levels of early warning thresholds: no risk, low risk, medium risk, and high risk.

[0033] Specifically, the relative reciprocal collision time (rTTC) was used as the benchmark. Based on existing research and real-vehicle testing, four levels of risk warning thresholds were determined: No risk: rTTC 0; Low risk: 0 <rTTC Medium risk: 0.20 ≤ rTTC < 0.33; High risk: rTTC ≥ 0.33. To address the "black box" concerns surrounding artificial intelligence, a game theory-based SHAP explanatory model is introduced, and the interaction effects between paired features are analyzed. It provides a quantitative explanation for each risk prediction result by calculating the marginal contribution of each input feature across all possible subsets. The calculation formula is as follows: ; The formula for calculating the global total importance of joint features is as follows: ; in, This represents the SHAP interaction value of a single sample. Represents eigenvalues Changes in marginal contribution when all are included.

[0034] Furthermore, such as Figure 2-4 As shown, Figure 2 This diagram illustrates the multimodal driving training set in the driving risk prediction system and method based on a multimodal large language model, as described in this embodiment. The training set is constructed in a "user instruction-model input-model output" format. First, there is the user instruction, which provides the large model with a prompt, allowing it to understand the user's needs and complete tasks in a specific domain according to the instructions, thus improving the accuracy of the large model's recognition and analysis. In this study, the instruction provided to the large model is: "You are an experienced driving risk monitor; you need to accurately determine the risk factors in video image data." Next is the user input. The visual language model used in this study requires input in three formats: text, image, and video (mp4). The text is the text that the user needs to provide to the large model, including contextual knowledge and the questions to be analyzed. Finally, there is the large model output. Following the user instruction prompts, the large model will perform large model analysis and reasoning based on the user-input text question and video image information, outputting the final answer. Figure 3 This diagram illustrates the fine-tuning training process of the multimodal large language model-based driving risk prediction system and method in this embodiment. The process involves fine-tuning the multimodal training set. Input data is divided into two information sources: driving visual data and driving feature labels. The driving visual data primarily includes road scene images or video frame sequences collected during vehicle operation. This data first enters the visual encoding module, where a visual encoding network extracts features from the original image information, converting high-dimensional pixel information into structured visual feature vectors. This enables the representation of the road environment, vehicle position relationships, traffic participants, and traffic scene semantics. Simultaneously, the driving feature label data also undergoes visual encoding or feature encoding processing, transforming the original driving behavior labels, risk level information, or traffic event annotations into feature representations that can be processed by the model. Subsequently, after obtaining the visual feature representation and the driving semantic feature representation, the two types of features are input into the text-image fusion processing layer. In this fusion layer, through cross-modal feature alignment and information fusion mechanisms, the visual scene information and driving semantic label information are uniformly modeled, enabling the model to establish the association between visual content and driving behavior semantics, and generate a unified multimodal semantic feature representation. The fused multimodal features are then input into the large language model decoding module. In this module, the sequence modeling capabilities of the large language model are used to perform semantic reasoning and text generation on the fused features. Through the decoding process, the corresponding driving behavior explanation, risk assessment, or traffic event description results are output.

[0035] Figure 4 This diagram illustrates the semantic reading method for large-scale driving features in the driving risk prediction system and method based on a multimodal large-scale language model, as described in this embodiment. The technology first preprocesses the acquired raw driving video data, segmenting the continuous video stream into frames according to a preset time interval or keyframe extraction strategy to obtain multiple consecutive image frame sequences. Each frame reflects the road traffic environment information of the vehicle at a specific moment. Subsequently, the segmented video frames are input into the large-scale model image recognition module. A large-scale visual model based on deep learning automatically identifies and semantically understands the image content. This recognition process detects and analyzes key elements in the road scene, including vehicles, pedestrians, lane lines, traffic signs, traffic lights, and the relative positions of vehicles. It also combines scene context information to comprehensively judge driving behavior. After completing image feature extraction and semantic recognition, the system further determines risk behavior based on the recognition results. Using preset risk driving behavior recognition rules or risk assessment models, it automatically analyzes and judges whether dangerous driving behaviors exist in the current driving scenario, such as rapid acceleration, rapid deceleration, abnormal lane changes, following too closely, or potential collision risks. When the identification result indicates the presence of risky driving behavior, the system will output the corresponding risk identification result and related scenario information for subsequent driving risk prediction; when no risky behavior is detected, the system will output the result of normal driving status.

[0036] This embodiment also provides a driving risk prediction method based on a multimodal large language model, including: Collect multi-source driving information, including: natural driving video and vehicle trajectory data; A multimodal training dataset containing visual information, sensor information, and behavioral semantic labels was constructed from the multi-source driving information. The parameters of the multimodal large language model are optimized using the multimodal training dataset, and the feature extraction capability of the multimodal large language model is enhanced by the low-rank adaptation method, resulting in a fine-tuned multimodal large language model. The natural driving video is input into the finely tuned multimodal large language model to extract visual semantic features; The vehicle trajectory data and visual semantic features are deeply integrated to construct a heterogeneous vector, and the prediction of future risk trends is realized based on a two-layer sequence-to-sequence long short-term memory recurrent network and a sliding time window algorithm, thus obtaining the temporal inference prediction result. A risk quantification and classification threshold based on the relative reciprocal collision time is used, and a contribution algorithm analysis is introduced to achieve quantitative attribution and technical explanation of the time-series inference prediction results.

[0037] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0038] This document uses specific examples to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. Furthermore, those skilled in the art will recognize that, based on the ideas of the present invention, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A driving risk prediction system based on a multimodal large language model, characterized in that, include: A multi-source driving information acquisition module is used to collect multi-source driving information, wherein the multi-source driving information includes: natural driving video and vehicle trajectory data; A multi-source driving dataset construction module is used to construct a multimodal training dataset containing visual information, sensor information, and behavioral semantic labels from the multi-source driving information. The large model fine-tuning training module is used to optimize the parameters of the multimodal large language model using the multimodal training dataset, and to enhance the feature extraction capability of the multimodal large language model through low-rank adaptation, so as to obtain the fine-tuned multimodal large language model. The driving feature semantic extraction module is used to input the natural driving video into the fine-tuned multimodal large language model to extract visual semantic features; The temporal risk prediction module is used to deeply fuse the vehicle trajectory data and the visual semantic features to construct a heterogeneous vector, and to predict future risk trends based on a two-layer sequence-to-sequence long short-term memory recurrent network and a sliding time window algorithm, thereby obtaining temporal inference prediction results. The risk grading and interpretation module is used to quantify and grade risk based on the relative reciprocal collision time threshold, and introduces contribution algorithm analysis to achieve quantitative attribution and technical interpretation of the time-series inference prediction results.

2. The driving risk prediction system based on a multimodal large language model according to claim 1, characterized in that, The multi-source driving information collection module includes: The video acquisition submodule is used to acquire real road environment data containing multi-view driving footage to obtain the natural driving video. The trajectory acquisition submodule is used to acquire vehicle dynamics characteristics and driving position sequences to obtain the vehicle trajectory data.

3. The driving risk prediction system based on a multimodal large language model according to claim 1, characterized in that, The multi-source driving dataset construction module includes: The semantic labeling submodule is used to perform multi-dimensional manual annotation on the natural driving video to obtain the behavioral semantic tags; The feature combination submodule is used to extract driving environment and vehicle status from the natural driving video as visual information, extract physical features from the vehicle trajectory data as sensor information, and combine the visual information, the sensor information and the behavioral semantic labels to construct the multimodal training dataset.

4. The driving risk prediction system based on a multimodal large language model according to claim 1, characterized in that, The large model fine-tuning training module includes: The parameter freezing submodule is used to load the pre-trained base model and set the original weight matrix to a frozen state; The bypass construction submodule is used to introduce two consecutive small matrices into the converter layer to form a bypass, and to initialize the two consecutive small matrices with random Gaussian initialization and zero matrix initialization respectively, to construct the underlying architecture of the low-rank adaptation method. The parameter optimization submodule is used to simultaneously feed the input vector of the multimodal training dataset into the original branch and the low-rank bypass branch for parameter optimization during training iterations. The feature fusion submodule is used to sum and fuse the outputs of the original branch and the low-rank bypass branch to obtain the fine-tuned multimodal large language model.

5. A driving risk prediction system based on a multimodal large language model according to claim 1, characterized in that, The driving feature semantic extraction module includes: The static image extraction submodule is used to segment the natural driving video into single-frame images at a specified frequency and input them into the fine-tuned multimodal large language model to extract the visual semantic features; The dynamic video extraction submodule is used to package multiple consecutive images into a video sequence and input it into the fine-tuned multimodal large language model for overall action semantic determination, so as to extract the visual semantic features related to the driver's state.

6. The driving risk prediction system based on a multimodal large language model according to claim 1, characterized in that, The time-series risk prediction module includes: The encoding submodule, consisting of a first-layer long short-term memory recurrent neural network, is used to receive the heterogeneous vector constructed by deep fusion of the vehicle trajectory data and the visual semantic features within the observation window, and to compress the heterogeneous vector into a fixed-length semantic hidden vector. The decoding submodule, consisting of a second-layer long short-term memory recurrent neural network and a fully connected output layer, is used to receive the semantic hidden vector as the initial stimulus and recursively generate a risk value sequence containing the prediction of the future risk trend. The prediction submodule is used to perform inference calculations on the risk numerical sequence using the sliding time window algorithm to obtain the time-series inference prediction result.

7. A driving risk prediction system based on a multimodal large language model according to claim 1, characterized in that, The risk classification and interpretation module includes: The risk quantification and classification submodule is used to set the risk quantification and classification threshold based on the relative reciprocal collision time, and to classify the risk level of the time series inference prediction results according to the risk quantification and classification threshold. The quantitative attribution module is used to analyze and calculate the marginal contribution change of each input feature across all possible feature subsets using the contribution algorithm, and outputs the interaction effect between paired features as the quantitative attribution and technical explanation.

8. A driving risk prediction system based on a multimodal large language model according to claim 7, characterized in that, The risk quantification and grading submodule sets the risk quantification and grading thresholds as four levels of early warning thresholds: no risk, low risk, medium risk, and high risk.

9. A driving risk prediction method based on a multimodal large language model, characterized in that, include: Collect multi-source driving information, including: natural driving video and vehicle trajectory data; A multimodal training dataset containing visual information, sensor information, and behavioral semantic labels was constructed from the multi-source driving information. The parameters of the multimodal large language model are optimized using the multimodal training dataset, and the feature extraction capability of the multimodal large language model is enhanced by the low-rank adaptation method, resulting in a fine-tuned multimodal large language model. The natural driving video is input into the finely tuned multimodal large language model to extract visual semantic features; The vehicle trajectory data and visual semantic features are deeply integrated to construct a heterogeneous vector, and the prediction of future risk trends is realized based on a two-layer sequence-to-sequence long short-term memory recurrent network and a sliding time window algorithm, thus obtaining the temporal inference prediction result. A risk quantification and classification threshold based on the relative reciprocal collision time is used, and a contribution algorithm analysis is introduced to achieve quantitative attribution and technical explanation of the time-series inference prediction results.