Networked automatic driving trajectory prediction method based on large language model thinking chain distillation

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a thought chain distillation method based on a large language model, combined with a two-stream coding architecture and a multi-task loss function, the interpretability and reasoning ability issues of trajectory prediction for connected autonomous driving are addressed. This approach achieves efficient trajectory prediction and natural language interpretation, thereby improving the system's credibility and resource utilization efficiency.

CN122264142APending Publication Date: 2026-06-23BEIJING UNIV OF TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING UNIV OF TECH
Filing Date: 2026-03-16
Publication Date: 2026-06-23

Application Information

Patent Timeline

16 Mar 2026

Application

23 Jun 2026

Publication

CN122264142A

IPC: G06N5/045; G06N5/04; G06F18/213; G06F18/25; G06F40/284; G06F40/30; G06N3/045; G06N3/096; G06N3/098

AI Tagging

Application Domain

Semantic analysis Biological models

Technology Topics

Semantic alignment Linguistic model

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Classroom scene target detection method based on multi-scale feature enhancement and semantic alignment
CN122289647AImprove detection robustnessStable detectionSemantic alignment Data set
Music auxiliary image generation method based on musical elements extraction
CN122265463ABiological models Speech recognition Semantic alignment Feature extraction
A pluggable target speaker voice recognition method and system
CN121963713BRetain original universal recognition capabilitiesreduce error rateSpeech recognition Semantic alignment Feature extraction
An AI data semantic analysis method and system based on dialogue interaction
CN122263902ABiological models Natural language data processing Semantic alignment Engineering
An image retrieval method based on hyperbolic curvature semantic alignment hashing
CN122309795AFeature extraction Visual technology

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing methods for predicting the trajectory of connected autonomous vehicles lack interpretability and deep reasoning capabilities, and general-purpose large language models lack native perception capabilities of high-precision map vector information, leading to erroneous decisions and wasted resources.

Method used

We employ a thought chain distillation method based on a large language model. The teacher model generates thought chain explanation text, and a two-stream coding architecture is used to extract temporal and topological features. The student model is trained by combining a multi-task joint loss function to achieve end-to-end trajectory prediction and interpretive output.

Benefits of technology

It improves the interpretability and logical reasoning ability of trajectory prediction, reduces computational and storage overhead, ensures stable operation of the model on a resource-constrained vehicle platform, and outputs natural language explanations containing causal logic.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122264142A_ABST

Patent Text Reader

Abstract

The application relates to the technical field of automatic driving, and particularly discloses a networked automatic driving trajectory prediction method based on large language model thinking chain distillation, which comprises the following steps: inputting original driving data into a pre-trained teacher model to generate a thinking chain explanation text; extracting double-flow space-time features based on a double-flow coding architecture; taking the geometric driving intention of a target vehicle as strong prior knowledge, and performing semantic alignment processing on the double-flow space-time features to generate a prefix prompt word vector; in the training stage, inputting the prefix prompt word vector and the thinking chain explanation text into a student model, adopting a multi-task joint loss function to perform end-to-end training on the student model, and enabling the student model to imitate the thinking chain reasoning logic of the teacher model through a distillation mechanism; and in the online reasoning stage, outputting the future driving trajectory and the explanatory text of the target vehicle based on the trained student model. The application can improve the explainability and deep reasoning capability of the driving trajectory prediction process.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of autonomous driving technology, and more specifically to a method for predicting the trajectory of connected autonomous driving based on the distillation of thought chains in a large language model. Background Technology

[0002] As autonomous driving technology moves from closed testing environments to commercial deployment on open urban roads, the accuracy and reliability of vehicle trajectory prediction (VTP) for connected and automated vehicles determine the safety boundaries of driving. In new intelligent connected hybrid traffic flows, autonomous vehicles not only need to accurately perceive the physical state of surrounding obstacles, but also need to deeply understand the long-tail game logic and potential driving intentions among traffic participants. New hybrid traffic flows where human-driven vehicles (HV), connected vehicles (CV), and connected and automated vehicles (CAV) coexist will persist for a long time. High-precision, interpretable, and reliable vehicle trajectory prediction has become crucial for achieving collaborative control of new intelligent connected hybrid traffic flows, and has significant theoretical and application prospects for improving road safety, traffic efficiency, energy conservation and emission reduction, and system reliability.

[0003] Currently, most mainstream VTP methods are based on discriminative architectures built using deep learning (DL). While they achieve high levels of prediction accuracy, they are essentially data-driven black-box systems. These models struggle to explicitly model causal reasoning processes in driving scenarios and cannot provide reliable decision-making basis for the planning and control module. With the continuous development of big data and artificial intelligence, Large Language Models (LLMs), with their powerful semantic understanding and logical reasoning capabilities, offer a new technological paradigm for cognitive autonomous driving decision-making.

[0004] Currently, DL-based VTP decision-making methods lack interpretability, while LLM-based VTP decision-making methods face challenges such as high inference latency and large computational resource consumption. In addition, general-purpose LLM lacks native perception capabilities of high-precision map vector information, which can easily lead to erroneous decisions that violate road geometric constraints.

[0005] Therefore, how to improve the interpretability and deep reasoning ability of the driving trajectory prediction process is a technical problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0006] In view of the above problems, this invention proposes a method for predicting the trajectory of connected autonomous driving based on the distillation of the mind chain of a large language model, so as to overcome the above problems or at least partially solve them.

[0007] To achieve the above objectives, the present invention adopts the following technical solution:

[0008] A method for predicting the trajectory of connected autonomous driving based on the distillation of thought chains in a large language model includes the following steps: Raw driving data is input into a pre-trained teacher model to generate a thought chain explanation text that includes driving intention, multi-vehicle interaction behavior reasoning, and active obstacle avoidance safety constraints. Based on a pre-built dual-stream coding architecture, target vehicle trajectory features containing temporal dependencies and map topology features around the target vehicle are extracted. The target vehicle trajectory features and map topology features are then fused to obtain dual-stream spatiotemporal features. The geometric driving intention of the target vehicle is used as strong prior knowledge and semantically aligned with the dual-stream spatiotemporal features to generate prefix cue word vectors. During the training phase, prefix prompt word vectors and thought chain explanation text are input into the student model. A multi-task joint loss function is used to train the student model end-to-end. A distillation mechanism is used to make the student model imitate the thought chain reasoning logic of the teacher model. During the online inference phase, real-time driving data of the target vehicle is acquired, and dual-stream spatiotemporal features and prefix cue word vectors are extracted based on the real-time driving data. These are then input into the trained student model, which outputs the future driving trajectory of the target vehicle and explanatory text.

[0009] Furthermore, the process by which the pre-trained teacher model generates thought chains to interpret text includes: A semantic serialization method based on predefined templates is used to collect the trajectory sets of the target vehicle and surrounding vehicles. The local road topology vector where the target vehicle is located Transform into natural language text description Among them, local road topology vectors A high-resolution map of the local area where the target vehicle is located, by The information consists of the center line of each lane; Natural language text description As the input context for the teacher model, the optimal thought chain explanation text is generated through greedy decoding.

[0010] Furthermore, the process of extracting the target vehicle trajectory features includes: The target vehicle passed The historical trajectory set within a second is decoupled into spatial components containing position coordinates. With motion components containing dynamic data ; spatial components The input is processed through a multilayer perceptron (MLP) for feature mapping, resulting in the fundamental geometric vectors representing the spatial attributes of the vehicle's position. ; Motion components Inputting the data into a 1D-CNN for local temporal feature extraction yields a high-order motion vector representing the vehicle's motion trend. ; Using the hyperbolic tangent function as the activation function, the fundamental geometric vector at each time step in the historical trajectory set is... and higher-order motion vectors The fusion process is performed to obtain the fused single-frame trajectory features. ; Single-frame trajectory features at each time step By splicing the sequences together, a time-step sequence is obtained. The data is then input into a Transformer encoder to obtain the target vehicle trajectory features containing temporal dependencies. .

[0011] Furthermore, the process of extracting map topological features includes: A lane line feature extraction enhancement strategy is established, which includes a TOP-N filtering operation and a random permutation operation. The TOP-N filtering operation involves selecting lane line features from local road topology vectors based on Euclidean distance. Select the set of candidate lanes closest to the target vehicle. The random permutation operation involves generating a permutation index vector and then setting the candidate lane set accordingly. The lane order in the map is randomly shuffled to generate map features. ; Based on map features For any lane in the system, feature mapping is performed independently on each sampling point of that lane using a multilayer perceptron (MLP). The gated recurrent unit (GRU) sequentially reads the point feature sequence of the lane after it has been processed by the multilayer perceptron (MLP) along the lane direction, and outputs the hidden state of the last time step as the aggregated geometric features of the lane. Map features The geometric feature sets of each lane are fed into a multi-layer dynamic graph attention network to model the topological dependencies between lanes and generate map topological features that include lane geometric attributes and global road network topology information. .

[0012] Furthermore, the fusion process of the spatiotemporal features of the two streams includes: Using the target vehicle trajectory features as the query vector and map topological features as the key and value, a multi-head cross-attention mechanism is employed to dynamically embed map information into the trajectory features, resulting in dual-stream spatiotemporal features. The definition of dual-stream spatiotemporal features is as follows:

[0013]

[0014] in, This indicates the trajectory characteristics of the target vehicle. Represents map topological features; This indicates the first [head] in a multi-head cross-attention neural network. The relevance weights of each attention target vehicle to all candidate lanes reflect the degree of attention a vehicle pays to a particular lane. H This represents the number of attention heads; Softmax represents the activation function. Indicates the first h A query weight matrix for each attention head; Indicates the first h The key weight matrix of each attention head; Indicates the dimension of the key vector; Indicates the first h The value weight matrix of each attention head; The parameter matrix representing the aggregated multi-head attention values; Presentation layer normalization operation.

[0015] Furthermore, the calculation process for the target vehicle's geometric driving intention includes: Global average pooling is applied to the spatiotemporal features of the two streams to obtain the target vehicle trajectory summary features. ; extract the characteristics of the target vehicle trajectory summary Injected into map topological features In each lane vector, a Transformer encoder layer is used to model self-attention interactions, resulting in enhanced lane interaction features. The calculation formula is as follows:

[0016] in, Representing the map feature matrix The corresponding number in the middle n Interactive features of each lane; Indicates the enhanced first n Interactive features of each lane; Calculate the target vehicle selection n Nonnormalized propensity score for each lane The calculation formula is:

[0017] in, This represents a multilayer perceptron used for feature dimensionality reduction and scoring mapping; The probability of a target vehicle entering each lane at the current intersection is calculated using the following formula:

[0018] in, Indicates that the target vehicle enters the... n The probability of a lane direction: exp represents the probability expressed in terms of the natural constant. An exponential function with base 0. u This represents the traversal index of each candidate lane involved in the summation during probability normalization calculation. U This indicates the total number of valid candidate lanes around the target vehicle in the current scenario; Using the probability of a target vehicle entering each lane at the current intersection as a weight, the map topological features are analyzed. The interaction features of all candidate lanes are weighted and aggregated, and then mapped into intent cues through a linear projection layer, serving as the geometric driving intent of the target vehicle. The calculation formula is:

[0019] Wherein, Linear represents a linear projection layer; Representing map topological features The corresponding number in the middle n Interactive features of each lane; N Representing map topological features The number of lanes in the system.

[0020] Furthermore, the expression for the prefix cue word vector is:

[0021]

[0022] in, This represents the system prompt word vectors that define the student model roles. Indicating the spatiotemporal characteristics of the two streams, This represents the joint features constructed using sequence concatenation operations; Concat represents the concatenation operation; LayerNorm represents the linear adapter. This represents the learnable weight matrix in a linear adapter. This represents the learnable bias vector in a linear adapter. This represents the learnable scaling factor; This represents the prefix prompt word vector.

[0023] Furthermore, during the training phase, all original pre-trained weight parameters of the student model are frozen, and a trainable bypass branch adaptation layer is connected in parallel next to each original linear transformation layer; during inference, the effective inference weights for each linear transformation layer in the student model are... It is obtained by adding the frozen original pre-trained weights to the low-rank updated weights of the side branch adaptation layer.

[0024] Furthermore, the teacher model is deployed on a cloud server to mine the deep causal logic of massive amounts of offline driving data and generate thought chain explanation text; the student model is deployed on in-vehicle edge computing nodes to achieve distillation training and real-time online inference.

[0025] Furthermore, the expression for the multi-task joint loss function is:

[0026] in, This represents the negative log-likelihood loss of Laplace; This represents the trajectory mode classification loss; The cross-entropy loss of text generation is represented by calculating the cross-entropy between the explanatory text generated by the student model and the explanatory text of the thought chain provided by the teacher model. The auxiliary classification loss represents lane selection and is used to supervise the lane selection probability. , , These represent the training parameters.

[0027] As can be seen from the above technical solution, compared with the prior art, the present invention has the following beneficial effects: 1. This invention utilizes a cloud-based knowledge distillation mechanism based on a thought chain to guide the training of the on-board student model, leveraging the strong reasoning capabilities of the cloud-based teacher model. This enables the on-board student model to output future trajectories and natural language explanations containing causal logic in parallel. This method not only endows the system with interpretability but also significantly enhances the logical reasoning ability of the on-board model.

[0028] 2. This invention designs a parallel dual-stream coding mechanism. In the dynamic stream, it extracts target vehicle trajectory features containing temporal dependencies. In the static stream, it models lane vector directions and aggregates road network topology features using a dynamic graph attention network. Simultaneously, combined with a lane disorder enhancement strategy, it effectively solves the problem of excessive model dependence on absolute position and significantly improves the model's feature extraction capability when facing complex topological scenarios.

[0029] 3. This invention deploys a large-scale model in the cloud as the teacher model and a lightweight model on the vehicle as the student model. The thought process chain of the large-scale cloud model serves as a transferable knowledge carrier. By designing a specific multi-task joint loss function, the vehicle-side student model learns and imitates the deep reasoning process of the cloud-based teacher model in the feature space. This allows the vehicle-side model to simultaneously generate natural language explanations containing causal logic while outputting the trajectory. This method not only improves the system's logical reasoning ability but also further enhances the system's interpretability and credibility.

[0030] 4. This invention maps dual-stream spatiotemporal features to a unified semantic space while requiring only a few fine-tuning parameters to adapt to specific driving scenarios. This method significantly reduces the computational and storage overhead of the model on edge devices, enabling complex inference tasks to run stably on in-vehicle platforms with limited resources, thereby promoting the practical application of large models in autonomous driving systems.

[0031] 5. This invention constructs a unified parallel multi-task decoding architecture that shares a high-dimensional semantic space constructed by a dual-stream coding architecture and a student model, and synchronously outputs driving intention, future trajectory, and semantic explanation text. Attached Figure Description

[0032] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0033] Figure 1 This is a flowchart of the connected autonomous driving trajectory prediction method based on large language model mind chain distillation provided in an embodiment of the present invention. Figure 2 This is an overall framework diagram of the connected autonomous driving trajectory prediction method based on large language model mind chain distillation provided in the embodiments of the present invention. Figure 3 This is a schematic diagram of the urban road network in location A provided in this embodiment of the invention; Figure 4 This is a schematic diagram of the urban road network in location B provided in this embodiment of the invention; Figure 5 The modal number provided in the embodiments of the present invention K A comparison diagram of core performance indicators under different VTP methods when =5; Figure 6 The modal number provided in the embodiments of the present invention K A comparison diagram of core performance indicators under different VTP methods when =10; Figure 7 The modal number provided in the embodiments of the present invention K A schematic diagram of the ablation experiment results when =5; Figure 8 The modal number provided in the embodiments of the present invention K A schematic diagram of the ablation experiment results when =10; Figure 9 This is a schematic diagram illustrating the multi-step prediction error of different VTP methods provided in the embodiments of the present invention. Detailed Implementation

[0034] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0035] like Figure 1 As shown, this invention discloses a method for predicting the trajectory of connected autonomous driving based on the distillation of thought chains in a large language model, comprising the following steps: S1. Input the raw driving data into the pre-trained teacher model to generate a thought chain explanation text that includes driving intention, multi-vehicle interaction behavior reasoning and active obstacle avoidance safety constraints. Based on a pre-built dual-stream coding architecture, target vehicle trajectory features containing temporal dependencies and map topology features around the target vehicle are extracted. The target vehicle trajectory features and map topology features are then fused to obtain dual-stream spatiotemporal features. S2. The geometric driving intention of the target vehicle is taken as strong prior knowledge and semantically aligned with the dual-stream spatiotemporal features to generate prefix prompt word vectors. S3. During the training phase, prefix prompt word vectors and thought chain explanation text are input into the student model. A multi-task joint loss function is used to train the student model end-to-end. A distillation mechanism is used to make the student model imitate the thought chain reasoning logic of the teacher model. S4. In the online inference stage, real-time driving data of the target vehicle is obtained, and dual-stream spatiotemporal features and prefix prompt word vectors are extracted based on the real-time driving data. These are then input into the trained student model, and the future driving trajectory of the target vehicle and explanatory text are output.

[0036] This invention mainly comprises four parts: vectorized scene perception and encoding (corresponding to S1), multimodal semantic alignment (corresponding to S2), knowledge-driven reasoning (corresponding to S3), and parallel multi-task decoding (corresponding to S4).

[0037] In the vectorized scene perception and encoding part, DeepSeek-V3.2 is used as the teacher model (also known as the teacher LLM) to extract features and enhance knowledge from the raw driving data, outputting a chain-of-thought (CoT) explanatory text that includes driving intention, multi-vehicle interaction behavior reasoning, and active obstacle avoidance safety constraints. A dual-stream spatiotemporal-topology coupled encoding architecture is constructed, including a temporal encoder and a map encoder, to extract dynamic spatiotemporal features of the target vehicle and its surrounding environment. In the map encoder, a lane enhancement strategy is established. By shuffling the lane order, the model avoids relying on the fixed numbering / arrangement of lanes (such as the prior positions of left lane 1, left lane 2), forcing the model to learn features from the actual associations between lanes rather than memorizing positional order, which effectively improves the model's generalization ability under different road layout scenarios.

[0038] In the multimodal semantic alignment part, the extracted spatiotemporal features are aligned with the semantic space through a linear adapter to generate contextual features that contain traffic context.

[0039] In the knowledge-driven reasoning part, Qwen2.5-0.5B is used as the student model (also known as student LLM) for driving intention reasoning and causal logic analysis. The model parameters are trained quickly using a frozen backbone parameter and low-rank adaptive fine-tuning strategy.

[0040] In the parallel multi-task decoding part, the hidden state output by the student model is combined with the distilled large model thought chain to simultaneously output interpretable trajectory prediction causal logic reasoning text and future vehicle trajectory that conforms to kinematic constraints.

[0041] The specific implementation methods for each of the above steps will be further explained below.

[0042] S1. Collect the motion state of the target vehicle and surrounding vehicles, as well as the local road topology. Perform multi-source information fusion and formal processing on this data, and output a high-dimensional feature vector, specifically including: S11, Multi-source data vectorization.

[0043] Vehicle trajectory prediction is based on historical data. Trajectory data of all observed vehicles (including the target vehicle and surrounding vehicles) at each time step, and vectorized data of the local road topology where the target vehicle is located, are used to predict the target vehicle. i In the future The trajectory at each time step is calculated as follows: (1) (2) (3) in, Xt Indicates at time step The collection of historical trajectories of all observed vehicles X j,t Indicates the first j The observation vehicle at the time step t A collection of historical trajectories j An index representing the observed vehicle (including the target vehicle and surrounding vehicles). j = i The time indicates the target vehicle. j ≠ i The time indicates the surrounding vehicles, This represents the total number of vehicles being observed in the current scene. This represents a local road topology vector, i.e., a local high-resolution map where the target vehicle is located, derived from... The information consists of the center line of each lane, among which L n Indicates the first n Center line of each lane; Y i,t Indicates the target vehicle i In the future A set of candidate trajectories at each time step. Y i,t This represents the set of best predicted trajectories for the target vehicle. Represents the set of historical trajectories of all observed vehicles in a given scene. X t With local high-definition maps Under the condition of generating future trajectory sequence Y i,t The probability distribution is shown in step 4.2, formulas (33)-(34). This represents selecting the set of future trajectories corresponding to the maximum conditional probability. Y i,t .

[0044] X j,t , Y i,t , L n The calculation formula is as follows: (4) (5) (6) (7) (8) in, Indicates the first j The observation vehicle at the time step The state vector, T obs This represents the total number of historical time steps observed. T fut This represents the total number of predicted future time steps; Indicates the first j The observation vehicle at the time step Location coordinates, Indicates the first j The observation vehicle at the time step velocity vector Indicates the first j The observation vehicle at the time step acceleration vector, Indicates the first j The observation vehicle at the time step The heading angle, Indicates the target vehicle i At time step The state vector contains only the position coordinates of the target vehicle. In equations (6) and (7), T represents the matrix transpose operation. The coordinate system is based on the target vehicle. The current position is the origin, the direction of the vehicle's head is the vertical axis, and the direction perpendicular to the vehicle's head is the horizontal axis. Indicates the first n The first of the center lines of the lanes A two-dimensional coordinate vector of discrete sampling points, where R represents any real number.

[0045] S12. To compensate for the lack of explicit causal logic in traditional data-driven trajectory prediction methods, this invention introduces a pre-trained teacher model to generate explanatory text for the thought chain. The specific process includes: A semantic serialization method based on predefined templates is used to collect the trajectory sets of the target vehicle and surrounding vehicles. The local road topology vector where the target vehicle is located Transform into natural language text description Among them, local road topology vectors A high-resolution map of the local area where the target vehicle is located, by The information consists of the center line of each lane; Natural language text description As the input context for the teacher model, the optimal thought chain explanation text is generated through greedy decoding.

[0046] The above process can be expressed by the following formula: (9) (10) in, Indicates the first position in the sequence The word at each position, when At that time, the item was empty. Indicates the total length of the thought chain text sequence; Indicates in a given word element and prompt words Under these conditions, the teacher model predicts the next word. The single-step probability; This indicates a series multiplication operation. Indicates a description in a given natural language text. Under these conditions, the teacher model generates a thought chain explanation text sequence. The joint probability, The teacher model is based on the input. The set of reasoning results includes textual descriptions of scene intent analysis, multi-vehicle interaction behavior reasoning, and active obstacle avoidance safety constraints.

[0047] The specific implementation process of greedy decoding is as follows: target thought chain text sequence It consists of a series of discrete tokens, i.e. The teacher model calculates the current sequence position. u The probability distribution of all candidate words in the vocabulary is calculated, and the word with the highest probability is selected. As the current output, The calculation formula is as follows: (11) in, The vocabulary, i.e., the set of all possible lexical units that the teacher model can output, is derived from this process. Start iterative operation, and select spliced into the context for prediction The process continues until the selected terminology becomes a terminator, thus obtaining the deterministic text sequence with the highest overall probability. .

[0048] S13. Introduce a lane line feature extraction enhancement strategy.

[0049] To eliminate the overfitting dependency of the student model on the input order of map lanes, this invention establishes a lane line feature extraction enhancement strategy, which includes a TOP-N filtering operation and a random permutation operation; wherein, the TOP-N filtering operation is: based on Euclidean distance from the local road topology vector... Select the set of candidate lanes closest to the target vehicle. The random permutation operation involves generating a permutation index vector and then setting the candidate lane set accordingly. The lane order in the map is randomly shuffled to generate map features. The entire process is represented as follows: (12) (13) (14) in, Indicates the lane closest to the target vehicle. n= 1, 2, . .. , N This indicates that the distance increases sequentially. Indicates by index The set of all possible permutations; Indicates a uniform distribution A random permutation index vector obtained from sampling is denoted as... , This represents the first index value in the random permutation vector. Indicates based on index value From ordered sets The corresponding lane feature vector is extracted from it.

[0050] Map features constructed through the above steps and historical trajectory collection The input is processed by a dual-stream coding architecture to extract features of lane geometry and temporal information.

[0051] S14. Constructing a dual-stream coding architecture: Traditional trajectory encoders typically concatenate position, velocity, and acceleration data directly into a Long Short-Term Memory (LSTM) network. This hybrid coding is susceptible to noise, and the high-order dynamic data such as velocity and acceleration fluctuate wildly, easily leading to gradient explosion. Furthermore, existing methods lack hierarchical modeling of lane-level vector maps, failing to explicitly distinguish between lane geometry and topological connections, thus limiting the model's ability to understand structured road networks. Therefore, this invention proposes a dual-stream coding architecture. On one hand, it accurately captures vehicle motion states through dynamic streams, using a Multilayer Perceptron (MLP) to extract lane geometric coordinates, combining a 1D-CNN to smoothly extract dynamic features such as velocity and acceleration, and employing a Transformer to extract long-term temporal dependency features. On the other hand, it constructs a hierarchical large-scale vector map encoder in the static stream. At the micro level, it uses an MLP and a gated recurrent unit (GRU) to extract the geometric shape and driving direction features of a single lane; at the macro level, it uses a dynamic graph attention network to model the topological connections between lanes, thereby achieving multi-scale representation of the road structure. Dual-stream information leverages a multi-head cross-attention mechanism to achieve deep fusion of vehicle temporal motion features and road spatial topological features, jointly supporting accurate prediction of vehicle trajectories and behavior understanding.

[0052] S141. The process of extracting the trajectory features of the target vehicle includes: The target vehicle passed The historical trajectory set within a second is decoupled into spatial components containing position coordinates. With motion components containing dynamic data Spatial components The input is processed through a multilayer perceptron (MLP) for feature mapping, resulting in the fundamental geometric vectors representing the spatial attributes of the vehicle's position. ; to motion components Inputting the data into a 1D-CNN for local temporal feature extraction yields a high-order motion vector representing the vehicle's motion trend. The specific calculation process is as follows: (15) in, and All include The feature sequence at each time step, denoted as . ,and , T obs This represents the total number of historical time steps observed.

[0053] Subsequently, to enhance training stability and suppress outliers, the hyperbolic tangent (tanh) function was used as the activation function for the fundamental geometric vector at each time step in the historical trajectory set. and higher-order motion vectors The fusion process is performed to obtain the fused single-frame trajectory features. Defined as: (16) in, Indicates the fusion weight. Indicates fusion bias. This indicates a splicing operation.

[0054] Finally, in order to capture the long-range temporal dependencies of the trajectory sequence, the single-frame trajectory features at each time step are... By splicing the sequences together, a time-step sequence is obtained. The data is then input into a Transformer encoder, which aggregates the entire sequence information through a self-attention mechanism to ultimately obtain vehicle trajectory features containing temporal dependencies. The calculation process is as follows: (17) S142, Map Topological Feature Encoding: Map encoder receives map features ,because The lane order in the image has been randomly shuffled. Therefore, the map encoder of this invention does not rely on the absolute index position of the lanes in the sequence when extracting map features. Instead, it extracts local map topological connectivity features using a multilayer perceptron (MLP) structure. Simultaneously, to capture the inherent geometric properties of lanes, such as their curvature and traffic flow direction, this invention employs a gated recurrent unit (GRU) for sequence modeling and uses the lane geometric features aggregated by the GRU as initial node features, which are then input to... L A layered dynamic graph attention network captures rich map topological connections between lanes. Lane geometry features. The calculation formula is as follows: (18) (19) in, This indicates the extracted number after disordering. One lane; This represents a multilayer perceptron for map features. any lane Each sampling point in the algorithm is independently feature-mapped; the GRU is used to sequentially read the point feature sequence processed by the MLP along the lane direction and output the hidden state of the last time step as the aggregated first feature. n Geometric features of lane .

[0055] The specific process includes: The point feature sequence of the lane after MLP processing is denoted as The GRU network is arranged according to the lane order from the start to the end, starting from the first point. To begin, calculate the hidden state for the first time step based on the initial hidden state. ; Subsequently, in processing the first Points At that time, GRU compares the feature input of the current point with the hidden state of the previous time step. Using both as input, the hidden state of the current step is updated through an internal gating mechanism. Through this chain-like transmission, the geometric information of the lane is gradually accumulated into the hidden state.

[0056] After processing the last point The final output hidden state This contains the complete geometric topology information of the entire lane, denoted as... .

[0057] Finally, map features middle N geometric feature set of lanes The data is fed into a multi-layer dynamic graph attention network to model the topological dependencies between lanes and generate map topological features that include lane geometric attributes and global road network topology information. .

[0058] Dynamic Graph Attention Network Layer output features The definition is as follows: (20) (twenty one) in, Indicates the first l The first layer of a neural network h The spatial attention score generated by each attention head is used to measure the degree of correlation between different lanes in the topological space; This represents the lane feature matrix output from the previous layer, when... l When =1, ; , and They represent the first The first layer of a neural network A query, key, and value parameter matrix. This represents the matrix transpose operation. The dimension representing the key. The parameter matrix representing the aggregated multi-head attention values. This indicates a vector concatenation operation. The resulting vector concatenation will then be... go through L Layer residual connection aggregation generates the final output map topological features. .

[0059] S143. Cross-modal feature interaction, specifically including: To enable vehicle trajectories to perceive the surrounding road topology, the target vehicle trajectory features are used as the query vector, and map topology features are used as the key and value. A multi-head cross-attention mechanism is employed to dynamically embed map information into the trajectory features, resulting in dual-stream spatiotemporal features. The definition of dual-stream spatiotemporal features is as follows: (twenty two) (twenty three) in, This indicates the trajectory characteristics of the target vehicle. Represents map topological features; This indicates the first [head] in a multi-head cross-attention neural network. The relevance weights of each attention target vehicle to all candidate lanes reflect the degree of attention a vehicle pays to a particular lane. H This represents the number of attention heads; Softmax represents the activation function. Indicates the first h A query weight matrix for each attention head; Indicates the first h The key weight matrix of each attention head; This represents the dimension of the key vector, used to scale the dot product result to prevent gradient vanishing; Indicates the first h The value weight matrix of each attention head; The parameter matrix representing the aggregated multi-head attention values; Presentation layer normalization operation. Final output It not only preserves the vehicle's historical motion trend but also integrates the geometric and topological information of the surrounding lanes, providing complete scene context information for subsequent large-scale model inference.

[0060] S2. Multimodal semantic alignment: This step receives vehicle trajectory features. and map topological features By combining lane selection intent as strong prior knowledge, a linear adapter is constructed to align lane geometry features with the LLM semantic space. A student model is then used for causal logical reasoning of high-level driving intent, specifically including: S21. Lane Perception Intent Prediction and Prior Guidance. Addressing the issue that traditional VTP methods cannot make a clear choice between left turn and straight ahead at intersections, only outputting a middle path, this invention employs an attention mechanism. By explicitly calculating the matching degree between the vehicle and the center lines of each lane, it pre-locks high-probability target lanes. This explicit geometric intent is encoded as a strong prior feature and injected into the student LLM, enabling it to generate compliant trajectories that conform to road geometric constraints.

[0061] The calculation process for the target vehicle's geometric driving intention includes: 1) Global average pooling is applied to the spatiotemporal features of the two streams to obtain the target vehicle trajectory summary features. This feature compresses the dynamic trajectory sequence of a vehicle over time into a fixed-length global semantic vector, summarizing the overall motion state and driving trend of the target vehicle over a period of time. This vector is then used for similarity matching with candidate lanes in the feature space. The target vehicle trajectory summary feature... Injected into map topological features In each lane vector, a Transformer encoder layer is used to model self-attention interactions, resulting in enhanced lane interaction features. The calculation formula is as follows: (twenty four) in, Representing the map feature matrix The corresponding number in the middle n Interactive features of each lane; Indicates the enhanced first n Interactive features of lanes.

[0062] 2) Calculate the target vehicle selection number n Nonnormalized propensity score for each lane The calculation formula is: (25) in, This represents a multilayer perceptron used for feature dimensionality reduction and scoring mapping.

[0063] 3) Calculate the probability that the target vehicle will enter each lane at the current intersection. The calculation formula is as follows: (26) in, Indicates that the target vehicle enters the... n The probability of a lane direction: exp represents the probability expressed in terms of the natural constant. An exponential function with base 0. u This represents the traversal index of each candidate lane involved in the summation during probability normalization calculation. UThis indicates the total number of valid candidate lanes around the target vehicle in the current scenario.

[0064] 4) Using the probability of the target vehicle entering each lane at the current intersection as a weight, the map topological features are analyzed. The interaction features of all candidate lanes are weighted and aggregated, and then mapped into intent cues through a linear projection layer, serving as the geometric driving intent of the target vehicle. This will be injected as prior knowledge into the input sequence of the student model, and the calculation formula is as follows: (27) Wherein, Linear represents a linear projection layer; Representing map topological features The corresponding number in the middle n Interactive features of each lane; N Representing map topological features The number of lanes in the system.

[0065] S22. Geometric Semantic Space Alignment: To address the inconsistency between the lane geometric vector space and the student model's pre-trained semantic space, this invention constructs a linear adapter to map the obtained geometric driving intentions and dual-stream spatiotemporal features into the student model's semantic space, generating prefix prompt word vectors. The definition is as follows: (28) (29) in, This represents the system prompt word vectors that define the student model roles. Indicating the spatiotemporal characteristics of the two streams, This represents the joint features constructed using sequence concatenation operations; Concat represents the concatenation operation; LayerNorm represents the linear adapter. This represents the learnable weight matrix in a linear adapter. This represents the learnable bias vector in a linear adapter. This represents the learnable scaling factor; This represents the prefix prompt word vector. , B Indicates the training batch size. L This indicates the sequence length of the prompt word vector. D llm This represents the hidden layer dimension of a large language model.

[0066] S3. Knowledge-driven reasoning: Addressing the significant bottlenecks of existing large-scale trajectory prediction methods, such as high inference latency, high computational resource consumption, and resource waste due to redundant general capabilities, this invention establishes a lightweight inference mechanism based on LoRA. It employs a pre-trained student model as the inference foundation to generate implicit states containing understanding of vehicle spatiotemporal interaction intentions. The specific process is as follows: First, construct a complete input sequence. The sequence is composed of two parts concatenated in chronological order. One part is the prefix cue word vector obtained by geometric semantic alignment. The other part consists of optional historical dialogue records or thought chain interpretation text vectors generated by the teacher model. .

[0067] Then, during training, all the original pre-trained weight parameters of the student model are frozen, and a trainable side branch adaptation layer is connected in parallel with each original linear transformation layer; during inference, the effective inference weights for each linear transformation layer in the student model are... It is obtained by adding the frozen original pre-trained weights to the low-rank updated weights of the side branch adaptation layer, and the calculation formula is as follows: (30) in, This represents the original pre-trained weight matrix. and These represent the dimensionality reduction and dimensionality increase matrices in the bypass branch adaptation layer, respectively. Indicates the scaling factor. The rank represents the size of the rank. Using formula (30), the output of each layer of the student model is calculated as the sum of the output of the original pre-trained weights and the output of the bypass low-rank branch. In this way, model updates can be achieved by fine-tuning only a very small number of low-rank parameters.

[0068] Finally, the constructed input sequence The data is fed into a LoRA-adapted student model for feature extraction and deep inference. The state features of the last sequence position in the last layer output sequence of the student model are extracted. The definition is as follows: (31) in, This is a high-dimensional semantic vector representing the vehicle's historical movement trend, local map topology, and potential interaction intent. This vector, as the final encoding result of the student model, will be fed into the downstream decoder for specific trajectory coordinate regression and intent classification.

[0069] S4. Parallel Multi-Task Decoding: Addressing the issue that vehicle trajectory prediction systems often require the separate deployment of independent intent classification models, trajectory regression models, and text generation models, leading to redundant consumption of computational resources and a lack of information exchange between modules, this invention establishes a unified parallel multi-task decoding architecture. This architecture uses a dual-stream coding architecture and a student model as a shared backbone. Through optimization of the multi-task joint loss function, it decodes explanatory text, driving intent, and probabilistic trajectories in parallel within a unified feature space, achieving efficient reuse of features and computing power. Specifically, it includes: S41. Natural language interpretation generation based on large models.

[0070] To endow autonomous driving systems with interpretable decision-making logic, this invention utilizes a text generation head to process the hidden state sequence output by the student model. Mapping back to the natural language space, the text generation head consists of an MLP and a LayerNorm, where the model is in the first... t The probability distribution of the vocabulary output at each time step The calculation is as follows: (32) in, This indicates that the last layer of the student model is in the first... The hidden state vector output at each time step has a dimension of . D ; This represents the first layer of a fully connected network, which will take the input dimension as an example. D Mapping to intermediate feature dimensions Perform feature transformation and dimensionality reduction; This represents a Gaussian error linear activation function, introducing nonlinear transformation capability; This represents the second fully connected layer, responsible for mapping the normalized features to this table size. Subsequently, during the inference phase, a greedy search strategy is employed to generate a word-by-word sequence of predicted text describing driving intentions and environmental risks. .

[0071] S42. Trajectory generation based on Laplace mixture model: To accurately predict the future trajectory of vehicles, this invention abandons the traditional mean squared error regression and adopts a Laplace mixture model as the trajectory decoder, which incorporates deep semantic hidden states. As input, parallel output K Each modality at time step Distribution parameters: location mean Scale parameters Modal confidence , Based on these parameters, the conditional probability density function shown in equation (1) The calculation is as follows: (33) (34) in, The two-dimensional position vector defined by formula (7), Indicates the target vehicle at time step In position The single-point conditional probability, Represents the L1 norm; Indicates the first k Confidence level of each prediction mode k This represents the total number of predicted multimodal trajectories. T fut This represents the total number of predicted future time steps; An index representing a future time step.

[0072] S43. During the model training phase, this invention employs a multi-task joint loss function for end-to-end optimization. The multi-task joint loss function is defined as follows: (35) (36) (37) (38) (39) in, This represents the negative log-likelihood loss of Laplace. Indicates the target vehicle's steps in the future. The actual trajectory, which only contains the position coordinates of the target vehicle; and These represent the trajectory decoder at time steps. t Location mean and scale parameter; The trajectory modality classification loss is used to characterize the accuracy error of the student model in assessing the confidence of multimodal trajectories. The training parameters represent the trajectory modality classification loss. Indicates an indicator function, when the parentheses contain... k = k The value is 1 if the condition is met, and 0 otherwise. This represents the index of the optimal mode that has the closest Euclidean distance to the true trajectory; The cross-entropy loss represents the loss generated by the text generation model, calculated from the explanatory text sequence generated by the student model located at the vehicle end. Explanatory text of thought chain provided by the teacher model Cross-entropy between The training parameters represent the cross-entropy loss of text generation; U Indicates the total length of the explanatory text sequence of the thought chain; Indicates the feature of a given prefix and the former Given that n words have already been generated, the student model generates the nth word through autoregression. each word element The probability of; The auxiliary classification loss represents the lane selection probability and is used to supervise the lane selection probability. , The training parameters represent the auxiliary classification loss for lane selection. N This indicates the total number of candidate lanes in the current scenario. This indicates that the student model predicted the target vehicle would enter the... n The probability of each candidate lane; This represents the lane index that minimizes the interior distance function. Indicates the first i The number of discrete trajectory points contained in each candidate lane. Indicates the first i The first candidate lane j The coordinates of a discrete trajectory point. Indicates the target vehicle in the future predicted time step The true two-dimensional coordinates of time, This indicates the calculation of the Euclidean distance between two coordinate points.

[0073] In one embodiment, such as Figure 2 As shown, this invention is logically divided into a cloud-based knowledge production layer and a vehicle-side fine-tuning inference layer. The cloud-based knowledge production layer is defined as follows: a teacher model with hundreds of billions of parameters is deployed on a cloud server, responsible for mining the deep causal logic of massive amounts of offline driving data, generating high-quality thought chain explanation text, and constructing a knowledge-enhanced dataset rich in logical labels. The vehicle-side fine-tuning inference layer is defined as follows: a lightweight student LLM (Limited Linear Model) with efficiently fine-tuned parameters is deployed on the vehicle's edge computing node. This layer includes two stages: a model distillation training stage, where the vehicle-side student model receives the logic-enhanced dataset from the cloud-based teacher model for fine-tuning training, mimicking the thought chain inference logic of the cloud-based teacher model through a distillation mechanism; and a real-time online inference stage, where, during vehicle operation, the vehicle-side student model is in inference mode, receiving historical trajectory streams and local map data collected by sensors in real time, combining the learned cloud-based thought chain inference logic, and outputting the vehicle's future trajectory, driving intention, and natural language explanation in parallel.

[0074] The overall training process is as follows: a) Set the total number of training roundsE =30, the number of data samples in the mini-batch training process. B =64, number of modes K The values are 5 and 10, representing the rank of the LoRA low-rank matrix. Scaling factor Initial learning rate Data sampling interval Seconds, observation time domain Seconds, prediction time domain Second.

[0075] b) The cloud server teacher model receives offline raw driving data containing historical trajectories and high-definition maps, and constructs a set of vehicle trajectories according to formulas (2)-(8). With local road topology vector Next, the teacher model uses the mapping function and formulas (9)-(11) to perform causal reasoning and generate explanatory text labels for the thought chain. Then, formulas (12)-(14) are used to refine the local high-precision map. Implement lane line disorder enhancement processing to generate map vectors. Finally, construct the system containing... Logical Enhancement Dataset The data was then distributed to the onboard student models.

[0076] c) The vehicle-mounted edge node receives the logically enhanced dataset sent from the cloud. The student model on the vehicle side will undergo logic distillation training. and As input, dual-stream feature encoding is performed in parallel. On the one hand, MLP is used to extract positional geometric features. Extracting dynamic temporal features using 1D-CNN By fusing the above features according to formulas (15)-(17), a vehicle trajectory feature with time-dependent characteristics is generated. On the other hand, map encoding is performed according to formulas (18)-(21) to obtain map topology features containing global road topology information. The obtained vehicle trajectory features and map topological features After aggregation using the attention mechanism through formulas (22)-(23), vehicle trajectory features incorporating road topology features are obtained. Next, according to formulas (24)-(29), the joint features combining lane intention priors are... Mapping to the semantic space of the pre-trained student LLM, thereby obtaining prefix cue word vectors adapted to the semantic space of the student LLM. .

[0077] d) The student model uses LoRA technology to receive prefix cue word vectors. To obtain the hidden state containing deep intent understanding Next, the acquired hidden state will be... The text is predicted by parallel output of the trajectory using formulas (30)-(32). Conditional probability distribution of the target vehicle at future times .

[0078] e) Based on formulas (33)-(39), the predicted text of the vehicle-mounted edge computing node's computational trajectory and the explanatory text of the thought chain generated by the cloud-based teacher model are compared. Cross-entropy loss between And jointly predict the regression loss between the predicted trajectory and the actual ground trajectory. Trajectory mode classification loss Auxiliary classification loss with lane selection Backpropagation is then performed. Through this step, the on-board student model inherits and optimizes the cloud-based inference logic during parameter updates.

[0079] f) Repeat steps b)-e) until the total number of training rounds is reached. E The vehicle-mounted edge computing node stores the parameters of the trained student LLM model.

[0080] g) During vehicle operation, the student model deployed on the on-board edge node is in inference mode, receiving historical trajectory streams and local map data collected by sensors in real time, and quickly extracting vehicle trajectory features. Map topological features Lane Intent Prompt .

[0081] h) The system commands, lane intention cues, map features, and trajectory features are aggregated and processed, then input into the student model for calculation, outputting the hidden state of the last time step of the sequence. Subsequently, the obtained high-level hidden states are input into the Laplace trajectory decoding head and the text generation head, respectively, to obtain the vehicle's future driving trajectory and explanatory text.

[0082] To evaluate the performance of the proposed method (referred to as CoTDLA-MVTP), it was compared with several other methods, including a multi-head attention-based joint modeling method (MHA-JAM), a Transformer-based autoregressive prediction method (Autobot), a graph neural network-based prediction method (PGP), a Transformer method incorporating lane perception constraints (LAformer), and a trajectory prediction method based on a pre-trained large language model (Traj-LLM). Experiments were conducted on the nuScenes large-scale autonomous driving dataset. Figure 3 and Figure 4 As shown, this dataset contains 1000 driving scenarios, collected from real urban roads in locations A and B, covering complex road topologies and dense traffic interactions. The experiments were conducted in... K Verification was performed under a multimodal setting of =5. K =5 evaluates the accuracy of the top 5 high-confidence trajectories generated by the evaluation model, reflecting the reliability of the decision. K =10 evaluates the coverage of the first 10 trajectories to the potential intent, and verifies the completeness and safety of the model when dealing with complex long-tail scenarios.

[0083] Experimental results are as follows Figure 5 and Figure 6 As shown, in different trajectory modes K Under these conditions, the CoTDLA-MVTP method proposed in this invention significantly outperforms other comparative methods in key indicators such as minimum average displacement error, minimum endpoint displacement error, and false negative rate. This verifies that the CoTDLA-MVTP method proposed in this invention, through its lane perception intent-guided geometric semantic alignment mechanism and cloud-based knowledge distillation mechanism, enables the lightweight edge-side model to more easily learn high-dimensional scene reasoning capabilities. Simultaneously, the parallel multi-task joint decoding strategy achieves ultra-low inference latency, meeting real-time requirements.

[0084] Next, an ablation experiment will be conducted, such as... Figure 7 and Figure 8 As shown, the ablation experiments reveal that without low-rank adaptive learning (LoRA) to fine-tune the parameters of the large language model, the model cannot effectively align the continuous physical trajectory space with the discrete semantic space of the large model, resulting in suboptimal error metrics. Furthermore, removing the knowledge distillation of the thought chain (CoT) based on the cloud-based teacher model significantly degrades the false negative rate in complex interaction scenarios. This indicates that the purely data-driven variant model cannot deeply understand multi-vehicle game relationships. When lane-aware intent is removed, the model's minimum final position error (minFDE) and false negative rate significantly increase. This demonstrates that without explicit trajectory-topology coupling prior guidance, the model is prone to generating invalid trajectories that do not conform to physical road constraints. The complete model of this invention achieves optimal levels in minimum average displacement error (minADE), minimum final displacement error, and false negative rate, proving the necessity and effectiveness of the innovative modules proposed in this invention in improving model prediction performance.

[0085] like Figure 9 As shown, in the early stage of prediction ( T =1s to T=2s), the error differences among the models are small, but the model proposed in this invention still maintains the lowest displacement error. This is due to the dynamic graph attention feature extraction mechanism of this invention, which can capture real-time traffic conditions with high fidelity. As the prediction time step increases ( T =3s to T =6s), different baseline models all exhibited severe error divergence due to their own architectural defects. Traditional models (such as MHA-JAM) suffered from severe error accumulation due to their autoregressive stepwise prediction mechanism. Although the Autobot model adopted a parallel decoding mechanism, its reliance on data-driven latent variable sampling and lack of explicit lane geometry topology constraints and high-level semantic logic reasoning made the predicted trajectory prone to deviating from physical norms over long periods. Even the comparative scheme (Traj-LLM), which also introduced a large language model, still exhibited logical discontinuities when dealing with long-sequence predictions because it only used the large language model as a black-box feature extractor, lacked an explicit causal reasoning mechanism based on thought chains, and did not perform deep dynamic coupling processing of heterogeneous spatial coordinates and high-dimensional dynamic features. This invention employs a unique dual-stream encoding mechanism, lane perception intent mechanism, thought chain interpretation text mechanism, and parallel multi-task decoding architecture. The combination of these mechanisms makes the error growth curve of this invention the most gradual, even under extreme long-sequence prediction ( T =6s), with a cumulative error of only 1.96 meters, significantly better than all comparison models. This fully verifies the efficient synergy between the core modules and their strong robustness under complex long-term scales.

[0086] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0087] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for predicting the trajectory of connected autonomous driving based on the distillation of thought chains in a large language model, characterized in that, Includes the following steps: Raw driving data is input into a pre-trained teacher model to generate a thought chain explanation text that includes driving intention, multi-vehicle interaction behavior reasoning, and active obstacle avoidance safety constraints. Based on a pre-built dual-stream coding architecture, target vehicle trajectory features containing temporal dependencies and map topology features around the target vehicle are extracted. The target vehicle trajectory features and map topology features are then fused to obtain dual-stream spatiotemporal features. The geometric driving intention of the target vehicle is used as strong prior knowledge and semantically aligned with the dual-stream spatiotemporal features to generate prefix cue word vectors. During the training phase, prefix prompt word vectors and thought chain explanation text are input into the student model. A multi-task joint loss function is used to train the student model end-to-end. A distillation mechanism is used to make the student model imitate the thought chain reasoning logic of the teacher model. During the online inference phase, real-time driving data of the target vehicle is acquired, and dual-stream spatiotemporal features and prefix cue word vectors are extracted based on the real-time driving data. These are then input into the trained student model, which outputs the future driving trajectory of the target vehicle and explanatory text.

2. The connected autonomous driving trajectory prediction method based on large language model mind chain distillation as described in claim 1, characterized in that, The process by which a pre-trained teacher model generates thought chains to interpret text includes: A semantic serialization method based on predefined templates is used to collect the trajectory sets of the target vehicle and surrounding vehicles. The local road topology vector where the target vehicle is located Transform into natural language text description Among them, local road topology vectors A high-resolution map of the local area where the target vehicle is located, by The information consists of the center line of each lane; Natural language text description As the input context for the teacher model, the optimal thought chain explanation text is generated through greedy decoding.

3. The connected autonomous driving trajectory prediction method based on large language model mind chain distillation as described in claim 1, characterized in that, The process of extracting the trajectory features of the target vehicle includes: The target vehicle passed The historical trajectory set within a second is decoupled into spatial components containing position coordinates. With motion components containing dynamic data ; spatial components The input is processed through a multilayer perceptron (MLP) for feature mapping, resulting in the fundamental geometric vectors representing the spatial attributes of the vehicle's position. ; Motion components Inputting the data into a 1D-CNN for local temporal feature extraction yields a high-order motion vector representing the vehicle's motion trend. ; Using the hyperbolic tangent function as the activation function, the fundamental geometric vector at each time step in the historical trajectory set is... and higher-order motion vectors The fusion process is performed to obtain the fused single-frame trajectory features. ; Single-frame trajectory features at each time step By splicing the sequences together, a time-step sequence is obtained. The data is then input into a Transformer encoder to obtain the target vehicle trajectory features containing temporal dependencies. .

4. The connected autonomous driving trajectory prediction method based on large language model mind chain distillation as described in claim 1, characterized in that, The process of extracting map topological features includes: A lane line feature extraction enhancement strategy is established, which includes a TOP-N filtering operation and a random permutation operation. The TOP-N filtering operation involves selecting lane line features from local road topology vectors based on Euclidean distance. Select the set of candidate lanes closest to the target vehicle. The random permutation operation involves generating a permutation index vector and then setting the candidate lane set accordingly. The lane order in the map is randomly shuffled to generate map features. ; Based on map features For any lane in the system, feature mapping is performed independently on each sampling point of that lane using a multilayer perceptron (MLP). The gated recurrent unit (GRU) sequentially reads the point feature sequence of the lane after it has been processed by the multilayer perceptron (MLP) along the lane direction, and outputs the hidden state of the last time step as the aggregated geometric features of the lane. Map features The geometric feature sets of each lane are fed into a multi-layer dynamic graph attention network to model the topological dependencies between lanes and generate map topological features that include lane geometric attributes and global road network topology information. .

5. The connected autonomous driving trajectory prediction method based on large language model mind chain distillation as described in claim 1, characterized in that, The fusion process of spatiotemporal features of the two streams includes: Using the target vehicle trajectory features as the query vector and map topological features as the key and value, a multi-head cross-attention mechanism is employed to dynamically embed map information into the trajectory features, resulting in dual-stream spatiotemporal features. The definition of dual-stream spatiotemporal features is as follows: in, This indicates the trajectory characteristics of the target vehicle. Represents map topological features; This indicates the first [head] in a multi-head cross-attention neural network. The relevance weights of each attention target vehicle to all candidate lanes reflect the degree of attention a vehicle pays to a particular lane. H This represents the number of attention heads; Softmax represents the activation function. Indicates the first h A query weight matrix for each attention head; Indicates the first h The key weight matrix of each attention head; Indicates the dimension of the key vector; Indicates the first h The value weight matrix of each attention head; The parameter matrix representing the aggregated multi-head attention values; Presentation layer normalization operation.

6. The connected autonomous driving trajectory prediction method based on large language model mind chain distillation as described in claim 1, characterized in that, The calculation process for the target vehicle's geometric driving intention includes: Global average pooling is applied to the spatiotemporal features of the two streams to obtain the target vehicle trajectory summary features. ; extract the characteristics of the target vehicle trajectory summary Injected into map topological features In each lane vector, a Transformer encoder layer is used to model self-attention interactions, resulting in enhanced lane interaction features. The calculation formula is as follows: in, Representing the map feature matrix The corresponding number in the middle n Interactive features of each lane; Indicates the enhanced first n Interactive features of each lane; Calculate the target vehicle selection n Nonnormalized propensity score for each lane The calculation formula is: in, This represents a multilayer perceptron used for feature dimensionality reduction and scoring mapping; The probability of a target vehicle entering each lane at the current intersection is calculated using the following formula: in, Indicates that the target vehicle enters the... n The probability of a lane direction: exp represents the probability expressed in terms of the natural constant. The base is an exponential function, where u represents the traversal index of each candidate lane participating in the summation in the probability normalization calculation, and U represents the total number of valid candidate lanes around the target vehicle in the current scenario. Using the probability of a target vehicle entering each lane at the current intersection as a weight, the map topological features are analyzed. The interaction features of all candidate lanes are weighted and aggregated, and then mapped into intent cues through a linear projection layer, serving as the geometric driving intent of the target vehicle. The calculation formula is: Wherein, Linear represents a linear projection layer; Representing map topological features The corresponding number in the middle n Interactive features of each lane; N Representing map topological features The number of lanes in the system.

7. The connected autonomous driving trajectory prediction method based on large language model mind chain distillation as described in claim 6, characterized in that, The expression for the prefix suggestion word vector is: in, This represents the system prompt word vectors that define the student model roles. Indicating the spatiotemporal characteristics of the two streams, This represents the joint features constructed using sequence concatenation operations; Concat represents the concatenation operation; LayerNorm represents the linear adapter. This represents the learnable weight matrix in a linear adapter. This represents the learnable bias vector in a linear adapter. This represents the learnable scaling factor; This represents the prefix prompt word vector.

8. The connected autonomous driving trajectory prediction method based on large language model mind chain distillation as described in claim 1, characterized in that, During the training phase, all the original pre-trained weight parameters of the student model are frozen, and a trainable side branch adaptation layer is connected in parallel with each original linear transformation layer. During the inference process, for each linear transformation layer in the student model, the effective inference weights are... It is obtained by adding the frozen original pre-trained weights to the low-rank updated weights of the side branch adaptation layer.

9. The connected autonomous driving trajectory prediction method based on large language model mind chain distillation as described in claim 1, characterized in that, The teacher model is deployed on a cloud server to mine the deep causal logic of massive amounts of offline driving data and generate explanatory texts of thought chains; the student model is deployed on an in-vehicle edge computing node to achieve distillation training and real-time online inference.

10. The connected autonomous driving trajectory prediction method based on large language model mind chain distillation as described in claim 1, characterized in that, The expression for the multi-task joint loss function is: in, This represents the negative log-likelihood loss of Laplace; This represents the trajectory mode classification loss; The cross-entropy loss of text generation is represented by calculating the cross-entropy between the explanatory text generated by the student model and the explanatory text of the thought chain provided by the teacher model. The auxiliary classification loss represents lane selection and is used to supervise the lane selection probability. , , These represent the training parameters.