Autonomous Driving Planning Method and Device Based on Visual Language Action Unified Modeling
By constructing a behavioral vocabulary based on a non-uniform discretization strategy using kinematic parameters and employing a large language model with a Transformer architecture, the contradiction between efficiency and accuracy in autonomous driving technology is resolved, enabling efficient and accurate autonomous driving planning.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- COWA TECHNOLOGY CO LTD
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-30
AI Technical Summary
Existing autonomous driving technologies face a trade-off between balancing model inference accuracy and real-time performance. Methods that directly predict trajectory points are inefficient, while methods that do not predict trajectory points lack sufficient planning accuracy.
A vocabulary construction method based on non-uniform discretization of kinematic parameters is adopted to combine the high-level semantic understanding capability of visual language action models with the efficient reasoning characteristics of discrete actions. A behavioral vocabulary is constructed through a non-uniform discretization strategy, and a large language model based on the Transformer architecture is used for reasoning and decoding to generate future planning trajectories.
It achieves high-precision autonomous driving planning while improving efficiency, with an 8-fold increase in inference speed, enhanced adaptability to key scenarios, and generated trajectories that conform to vehicle dynamics characteristics.
Smart Images

Figure CN122300547A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of autonomous driving technology, and in particular to an autonomous driving planning method and apparatus based on unified modeling of visual language and actions. Background Technology
[0002] Autonomous driving technology is currently in a phase of rapid development, and Vision Language Action (VLA) models have become a core research direction for end-to-end autonomous driving systems. These models significantly improve the decision-making capabilities of autonomous driving systems in complex scenarios by unifying visual perception, semantic understanding, and action execution. With the widespread application of large language models in the field of autonomous driving, balancing model inference accuracy and real-time performance has become a critical issue that the industry urgently needs to address.
[0003] Currently, the existing technology adopts the following two solutions:
[0004] (1) OpenDriveVLA: This model aligns visual and linguistic information hierarchically, enabling the language model to effectively understand the target and map information in the scene, and finally predicts the continuous (x, y) coordinates of future trajectory points through autoregression.
[0005] (2) AutoVLA: This model clusters trajectory actions in large-scale real driving data to generate an action codebook, discretizing the continuous action space into a finite number of learnable and generateable tokens. During training and inference, AutoVLA maps real or predicted trajectories to the nearest codebook entry, thereby achieving physical action tokenization.
[0006] Option (1) directly predicts trajectory points, which has the advantage of high accuracy but sacrifices efficiency; Option (2) does not predict trajectory points but instead predicts action tokens, which improves efficiency but sacrifices planning accuracy (because it is impossible to cluster or enumerate all driving scenarios). Summary of the Invention
[0007] To address the aforementioned technical problems, this invention cleverly combines the advanced semantic understanding capabilities of VLA models with the efficient reasoning characteristics of discrete actions through an innovative vocabulary construction method based on non-uniform discretization of kinematic parameters, thus resolving the core contradiction between efficiency and accuracy in existing technologies.
[0008] To achieve the above objectives, the technical solution of this invention provides an autonomous driving planning method based on unified visual language action modeling, which includes the following steps: S1 Multimodal perception: Acquire multimodal perception data of the vehicle at the current moment and encode the multimodal data into unified BEV environment features; S2 Behavioral parameter encoding: Acquire the vehicle's historical trajectory sequence and, based on a predefined behavioral vocabulary, encode the historical trajectory sequence into a historical behavior index sequence through behavioral parameterization encoding, wherein the predefined behavioral vocabulary is constructed based on vehicle kinematic parameters through a non-uniform discretization strategy, and each behavior index corresponds to a unique parameter combination consisting of turning radius and speed; S3 VLA inference: Input the BEV environment features, user prompt text information, and historical behavior index sequence into a pre-trained visual language action VLA inference module to infer the vehicle's future behavior index sequence; S4 Kinematic decoding: Based on the future behavior index sequence and the predefined behavioral vocabulary, obtain the vehicle's future planned trajectory through kinematic decoding.
[0009] Furthermore, the VLA inference module is based on the Transformer architecture's Large Language Model (LLM) and is pre-trained according to the following steps: F1: Encode and fuse multimodal perception data into unified BEV environment features; F2: Encode the vehicle's historical trajectory sequence and future trajectory sequence into historical behavior index and future behavior index respectively through behavior parameterization encoding; F3: Construct a QA trained on the Large Language Model (SFT), constructing questions using user prompt text information, BEV environment features, and historical behavior indexes, while using the vehicle's future behavior index as the answer; F4: Use the question-answer pairs obtained in step F3 to perform supervised fine-tuning of the VLA inference module.
[0010] Furthermore, the predefined behavior vocabulary is constructed through the following steps: T1: Construct a turning radius vocabulary based on the turning radius parameter, wherein non-uniform segmentation sampling is performed according to the turning radius size; T2: Construct a speed vocabulary based on the speed parameter, wherein non-uniform segmentation sampling is performed according to the speed size; T3: Combine the first vocabulary sub-vocabulary and the second vocabulary sub-vocabulary by performing a Cartesian product to generate a complete behavior vocabulary containing multiple unique behavior indexes, each behavior index corresponding to a unique combination of turning radius and speed.
[0011] Further, step T1 specifically includes: T11: setting a minimum turning radius constraint; T12: for turning radii within a first preset range, using non-linear intervals to divide them into positive and negative directions to generate a first number of sampling levels; T13: for turning radii within a second preset range, using exponentially increasing sampling to generate a second number of sampling levels, wherein the turning radius within the second preset range is greater than the turning radius within the first preset range, and the second number is less than the first number; T14: for turning radii within a third preset range, retaining a third number of representative values, wherein the turning radius within the third preset range is greater than the turning radius within the second preset range, and the third number is less than the second number; T15: merging all sampled values to form the turning radius vocabulary sub-table.
[0012] Furthermore, in step T2, different sampling ranges are divided according to the speed, with larger sampling ranges having larger sampling intervals for larger speeds.
[0013] Further, in steps S2 and F2, the behavior parameterization encoding specifically includes the following steps: obtaining the distance and heading angle change between adjacent points through differential calculation; determining the state based on the distance and heading angle change, where the turning radius inherits from the previous moment when the state is stationary, and the turning radius is set to the maximum value when the state is straight; calculating the radius of curvature and velocity based on the distance and heading angle change; verifying the lateral acceleration and velocity to eliminate invalid parameters; and mapping the calculated radius of curvature and velocity to the nearest neighbor turning radius and velocity entries in the predefined behavior vocabulary to generate a unique behavior index.
[0014] Further, in step S4, kinematic decoding specifically includes the following steps: performing parameter lookup to obtain the corresponding turning radius and velocity through behavior index; obtaining angular velocity, change of heading angle per unit time, and displacement vector in local coordinate system through kinematic calculation; transforming the displacement vector in local coordinate system to global coordinate system; and performing trajectory accumulation to generate a physically feasible continuous trajectory by accumulating the displacement vector and the change of heading angle.
[0015] Furthermore, after step S4, the method further includes: S5 dynamic verification: performing a static state determination, when the distance between adjacent trajectory points is less than a preset value, forcing the speed to zero, ensuring that the generated trajectory meets the lateral acceleration constraint and the longitudinal acceleration constraint, and performing physical feasibility verification to reject trajectories that violate vehicle dynamics.
[0016] The technical solution of this invention also provides an autonomous driving planning device based on unified modeling of visual language and action, which includes the following modules: a multimodal perception module: acquiring multimodal perception data of the vehicle at the current moment and encoding the multimodal data into unified BEV environment features; a behavior parameter encoding module: acquiring the vehicle's historical trajectory sequence and encoding the historical trajectory sequence into a historical behavior index sequence based on a predefined behavior vocabulary through behavior parameterization encoding, wherein the predefined behavior vocabulary is constructed based on vehicle kinematic parameters through a non-uniform discretization strategy, and each behavior index corresponds to a unique parameter combination consisting of turning radius and speed; a VLA inference module: inputting BEV environment features, user prompt text information, and historical behavior index sequence into a pre-trained visual language and action VLA inference module to infer the vehicle's future behavior index sequence; and a kinematic decoding module: obtaining the vehicle's future planned trajectory through kinematic decoding based on the future behavior index sequence and the predefined behavior vocabulary.
[0017] The present invention also provides a computer-readable storage medium containing a computer program, which, when executed by one or more processors, performs the autonomous driving planning method based on unified modeling of visual language and action as described above. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a system architecture diagram of the present invention;
[0020] Figure 2 This is a schematic diagram of the encoding process of the present invention;
[0021] Figure 3 This is a schematic diagram of the decoding process of the present invention. Detailed Implementation
[0022] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0023] like Figure 1As shown, the technical solution of the present invention includes four core modules:
[0024] (1) Multimodal perception module: processes multi-source sensor data (camera / LiDAR), outputs BEV features, which provide environmental information representation, and then inputs them into the VLA inference module.
[0025] (2) VLA inference module: It is based on the Transformer architecture LLM (Large Language Model), accepts user prompt words (such as system prompts, navigation instructions, user instructions, vehicle history status, etc.) and BEV features, and outputs discrete behavior index sequences (rather than continuous trajectory points).
[0026] (3) Behavior parameterization encoding module: converts continuous trajectory points into discrete behavior indices (each discrete behavior index corresponds to a special token in the vocabulary during language model training or inference). This series of special tokens converted from the trajectory is used as the ground truth of the vehicle's future trajectory during the training of the VLA inference module for supervision.
[0027] (4) Kinematic decoding module: This module is the inverse process of the behavior parameterization encoding module, that is, in the inference stage, the behavior index output by the VLA inference module is converted into a physically feasible continuous trajectory.
[0028] The implementation process of the technical solution of the present invention includes:
[0029] Training process:
[0030] Step 1: Obtain environmental feature information. First, the multimodal perception module receives multimodal data (camera / LiDAR) and encodes and fuses it into a unified BEV feature, which includes semantic information about the vehicle's surrounding environment.
[0031] Step 2: Encode the vehicle's historical trajectory and future trajectory into behavior indices using the behavior parameterization encoding module. The behavior index, after encoding the vehicle's historical trajectory, will serve as input to the VLA inference module, representing the vehicle's historical state. The vehicle's future behavior index will be used as the answer for training the large language model SFT, for supervision.
[0032] Step 3: Construct QA (question-answer pairs) for SFT training (supervised fine-tuning training) of the large language model. Use the textual information of user prompts (such as system prompts, navigation instructions, user commands, vehicle history status, etc.) and the environmental feature information (i.e., BEV features) obtained in Step 1 to construct the questions, and use the vehicle's future behavior index as the answer.
[0033] Step 4: Use the question-answer pairs obtained in Step 3 to supervise and fine-tune the VLA inference module.
[0034] Reasoning process:
[0035] Step 1: Obtain environmental feature information. First, the multimodal perception module receives multimodal data (camera / LiDAR) and encodes and fuses it into a unified BEV feature, which includes semantic information about the vehicle's surrounding environment.
[0036] Step 2: Utilize the behavior parameterization encoding module to generate the vehicle's historical trajectory. The behavior index after encoding the vehicle's historical trajectory will be used as input to the VLA inference module, i.e., the vehicle's historical state.
[0037] Step 3: Construct a question using the text information of user prompts (such as system prompts, navigation instructions, user instructions, vehicle history status, etc.) and the environmental feature information (i.e. BEV features) obtained in Step 1 as input to the VLA inference module to infer the vehicle's future behavior index.
[0038] Step 4: Use the kinematic decoding module to decode the vehicle's future behavior index obtained in Step 3 into the vehicle's future planned trajectory.
[0039] This invention proposes a non-uniform discretization strategy based on vehicle kinematic parameters to construct a two-parameter behavioral vocabulary:
[0040] 1. Discretization of turning radius ( )
[0041] Physical constraints: Minimum turning radius ≥ 2 meters (meeting the typical turning capability of passenger vehicles);
[0042] Non-uniform segmented sampling:
[0043] 0–10 m: Employs non-linear fine division (such as logarithmic or square root intervals), with a total of 58 levels, covering high curvature scenarios such as U-turns and narrow road turns;
[0044] 10–1000 m: Using exponential growth sampling, with a total of 38 levels, suitable for urban curves;
[0045] >=1000 m: Considered as straight line, only 5 representative values are retained (such as 1000 m or 10,000 m).
[0046] Symmetrical expansion: Merge positive and negative values (left / right turn), the final turning radius vocabulary has a total of 101 entries.
[0047] 2. Velocity discretization ( )
[0048] The maximum speed is set at 80 km / h (in accordance with urban road speed limits).
[0049] Non-uniform segmentation strategy (based on driving behavior sensitivity):
[0050] 0–10 km / h: 0.5 km / h increments → 20 gears (ultra-low speed, parking / starting);
[0051] 10–30 km / h: 1 km / h increment → 20 gears (low speed, intersection / community);
[0052] 30–60 km / h: 2 km / h increments → 15 gears (medium speed, main road);
[0053] 60–80 km / h: 5 km / h increments → 5 gears (highway, expressway);
[0054] Total speed vocabulary: 60 entries.
[0055] 3. Vocabulary integration
[0056]
[0057] Where i represents the index value of the discretized turning radius (lateral); j represents the index value of the discretized speed (longitudinal); The length of the velocity (vertical) vocabulary is 60;
[0058] Generate a combination of behavioral parameters through Cartesian product:
[0059] Vocabulary size: 101 (turning radius) × 60 (speed) = 6060 discrete behaviors.
[0060] Key advantage: Each action corresponds to a unique combination of (turning radius, speed) and has a clear physical meaning.
[0061] This invention proposes a trajectory encoding and decoding mechanism, as follows:
[0062] 1. Encoding process (continuous trajectory → behavior index)
[0063] like Figure 2 As shown, the following encoding operation is performed when a continuous trajectory point sequence is input:
[0064] Input: Continuous trajectory coordinate sequence step:
[0065] a. Difference calculation: Calculate the distance between adjacent points With change in heading angle ;
[0066] b. Status determination:
[0067] like → At rest, velocity = 0, turning radius inherited from the previous moment;
[0068] like → Go straight, and set the turning radius to the maximum value (e.g., 10,000 m).
[0069] c. Parameter calculation:
[0070] radius of curvature (Chord length formula);
[0071] speed ( (corresponding to 10 Hz).
[0072] d. Physical verification:
[0073] lateral acceleration (≤0.1g);
[0074] speed ;
[0075] e. Index Mapping: Map R and v to the nearest neighbor vocabulary entries respectively to generate a unique behavior index.
[0076] 2. Decoding process (behavior index → continuous trajectory)
[0077] like Figure 3 As shown, the process of generating continuous trajectory points based on behavior index is as follows:
[0078] a. Parameter lookup table:
[0079] The corresponding turning radius and speed parameters are obtained by looking up the behavior index in the table.
[0080] b. Kinematic calculations:
[0081] (1) Calculate the angular velocity (velocity divided by the turning radius)
[0082] According to the formula for arc length in circular motion = Turning radius * The angle through which the vehicle turned (angular velocity) * Unit time )
[0083] Distance traveled by the vehicle = speed * Unit time
[0084] And in a very short period of time it can be considered ,so
[0085] Therefore, angular velocity = speed / Turning radius
[0086] (2) Calculate the change in heading angle per unit time.
[0087] Change in heading angle = angular velocity * Time interval = speed / Turning radius * Time interval
[0088] (3) Calculate the displacement vector in the local coordinate system
[0089]
[0090]
[0091] c. Transform to the global coordinate system:
[0092]
[0093]
[0094] d. Trajectory accumulation:
[0095] The current position equals the previous position plus the displacement vector, and the current heading angle equals the previous heading angle plus the change in heading angle. Repeat the above steps to generate a complete trajectory sequence.
[0096]
[0097]
[0098]
[0099] The technical solution of the present invention, when implemented, also includes the following dynamic verification and optimization mechanism:
[0100] (1) Determination of stationary state: When the distance between adjacent trajectory points is less than 0.1 meters, the forced velocity is zero;
[0101] (2) Lateral acceleration constraint: Ensure that the lateral acceleration does not exceed 0.98 m / s² (0.1 times the gravitational acceleration);
[0102] (3) Longitudinal acceleration constraint: According to comfort requirements, the rate of change of longitudinal acceleration is limited to the range of 2.5-8 m / s²;
[0103] (4) Physical feasibility verification: reject combinations of behaviors that violate vehicle dynamics (such as high speed and small radius).
[0104] In an embodiment of the present invention, an autonomous driving planning method based on unified visual language action modeling is provided, comprising the following steps: S1 Multimodal perception: acquiring multimodal perception data of the vehicle at the current moment and encoding the multimodal data into unified BEV environment features; S2 Behavioral parameter encoding: acquiring the vehicle's historical trajectory sequence and encoding the historical trajectory sequence into a historical behavior index sequence based on a predefined behavior vocabulary through behavior parameterization encoding, wherein the predefined behavior vocabulary is constructed based on vehicle kinematic parameters through a non-uniform discretization strategy, and each behavior index corresponds to a unique parameter combination consisting of turning radius and speed; S3 VLA inference: inputting the BEV environment features, user prompt text information, and historical behavior index sequence into a pre-trained visual language action VLA inference module to infer the vehicle's future behavior index sequence; S4 Kinematic decoding: obtaining the vehicle's future planned trajectory through kinematic decoding based on the future behavior index sequence and the predefined behavior vocabulary.
[0105] Furthermore, the VLA inference module is based on the Transformer architecture's Large Language Model (LLM) and is pre-trained according to the following steps: F1: Encode and fuse multimodal perception data into unified BEV environment features; F2: Encode the vehicle's historical trajectory sequence and future trajectory sequence into historical behavior index and future behavior index respectively through behavior parameterization encoding; F3: Construct a QA trained on the Large Language Model (SFT), constructing questions using user prompt text information, BEV environment features, and historical behavior indexes, while using the vehicle's future behavior index as the answer; F4: Use the question-answer pairs obtained in step F3 to perform supervised fine-tuning of the VLA inference module.
[0106] Furthermore, the predefined behavior vocabulary is constructed through the following steps: T1: Construct a turning radius vocabulary based on the turning radius parameter, wherein non-uniform segmentation sampling is performed according to the turning radius size; T2: Construct a speed vocabulary based on the speed parameter, wherein non-uniform segmentation sampling is performed according to the speed size; T3: Combine the first vocabulary sub-vocabulary and the second vocabulary sub-vocabulary by performing a Cartesian product to generate a complete behavior vocabulary containing multiple unique behavior indexes, each behavior index corresponding to a unique combination of turning radius and speed.
[0107] Further, step T1 specifically includes: T11: setting a minimum turning radius constraint; T12: for turning radii within a first preset range, using non-linear intervals to divide them into positive and negative directions to generate a first number of sampling levels; T13: for turning radii within a second preset range, using exponentially increasing sampling to generate a second number of sampling levels, wherein the turning radius within the second preset range is greater than the turning radius within the first preset range, and the second number is less than the first number; T14: for turning radii within a third preset range, retaining a third number of representative values, wherein the turning radius within the third preset range is greater than the turning radius within the second preset range, and the third number is less than the second number; T15: merging all sampled values to form the turning radius vocabulary sub-table.
[0108] Furthermore, in step T2, different sampling ranges are divided according to the speed, with larger sampling ranges having larger sampling intervals for larger speeds.
[0109] Further, in steps S2 and F2, the behavior parameterization encoding specifically includes the following steps: obtaining the distance and heading angle change between adjacent points through differential calculation; determining the state based on the distance and heading angle change, where the turning radius inherits from the previous moment when the state is stationary, and the turning radius is set to the maximum value when the state is straight; calculating the radius of curvature and velocity based on the distance and heading angle change; verifying the lateral acceleration and velocity to eliminate invalid parameters; and mapping the calculated radius of curvature and velocity to the nearest neighbor turning radius and velocity entries in the predefined behavior vocabulary to generate a unique behavior index.
[0110] Further, in step S4, kinematic decoding specifically includes the following steps: performing parameter lookup to obtain the corresponding turning radius and velocity through behavior index; obtaining angular velocity, change of heading angle per unit time, and displacement vector in local coordinate system through kinematic calculation; transforming the displacement vector in local coordinate system to global coordinate system; and performing trajectory accumulation to generate a physically feasible continuous trajectory by accumulating the displacement vector and the change of heading angle.
[0111] Furthermore, after step S4, the method further includes: S5 dynamic verification: performing a static state determination, when the distance between adjacent trajectory points is less than a preset value, forcing the speed to zero, ensuring that the generated trajectory meets the lateral acceleration constraint and the longitudinal acceleration constraint, and performing physical feasibility verification to reject trajectories that violate vehicle dynamics.
[0112] In another embodiment of the present invention, an autonomous driving planning device based on unified visual language action modeling is also provided, comprising the following modules: a multimodal perception module: acquiring multimodal perception data of the vehicle at the current moment and encoding the multimodal data into unified BEV environment features; a behavior parameter encoding module: acquiring the vehicle's historical trajectory sequence and encoding the historical trajectory sequence into a historical behavior index sequence based on a predefined behavior vocabulary through behavior parameterization encoding, wherein the predefined behavior vocabulary is constructed based on vehicle kinematic parameters through a non-uniform discretization strategy, and each behavior index corresponds to a unique parameter combination consisting of turning radius and speed; a VLA inference module: inputting BEV environment features, user prompt text information, and historical behavior index sequence into a pre-trained visual language action VLA inference module to infer the vehicle's future behavior index sequence; and a kinematic decoding module: obtaining the vehicle's future planned trajectory through kinematic decoding based on the future behavior index sequence and the predefined behavior vocabulary.
[0113] In other embodiments of the present invention, a computer-readable storage medium containing a computer program is also provided, which, when executed by one or more processors, performs the autonomous driving planning method based on visual language action unified modeling as described above.
[0114] The beneficial technical effects of the technical solution of the present invention are as follows:
[0115] Significantly improved inference efficiency: Each behavioral parameter only requires 1 token, while the traditional method requires 9 tokens per trajectory point. The number of tokens for 10 trajectory points is reduced from 90 to 10, and the inference speed is increased by 8 times.
[0116] Enhanced accuracy in key scenarios: High-resolution discretization is used in low-speed areas (0–20 km / h) and small-radius turns (<10 m) to enhance the adaptability and planning accuracy of key scenarios;
[0117] The vocabulary size is controllable: a total of 6363 entries, which is much smaller than the exhaustive search space, and supports efficient table lookup and index mapping;
[0118] Physical feasibility assurance: Through lateral acceleration verification (≤0.1g) and velocity change rate constraint (≤4m / s²), the generated trajectory is ensured to conform to the vehicle dynamics characteristics.
[0119] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. An automatic driving planning method based on visual language action unified modeling, characterized in that, Includes the following steps: S1 Multimodal Perception: Acquire multimodal perception data of the vehicle at the current moment and encode the multimodal data into unified BEV environment features; S2 Behavior Parameter Encoding: Obtain the vehicle's historical trajectory sequence and, based on a predefined behavior vocabulary, encode the historical trajectory sequence into a historical behavior index sequence through behavior parameterization encoding. The predefined behavior vocabulary is constructed based on the vehicle's kinematic parameters using a non-uniform discretization strategy, and each behavior index corresponds to a unique parameter combination consisting of the turning radius and speed. S3 VLA Inference: Input BEV environmental features, user prompt text information, and historical behavior index sequences into a pre-trained visual language action VLA inference module to infer the vehicle's future behavior index sequence; S4 Kinematic Decoding: Based on the future behavior index sequence and the predefined behavior vocabulary, the future planned trajectory of the vehicle is obtained through kinematic decoding.
2. The method according to claim 1, characterized in that, The VLA inference module is based on the Transformer architecture and is pre-trained according to the following steps: F1: Encode and fuse multimodal sensing data into unified BEV environment features; F2: Encode the vehicle's historical trajectory sequence and future trajectory sequence into historical behavior index and future behavior index respectively through behavior parameterization coding; F3: Construct QA based on the training of the large language model SFT, using user prompt text information, BEV environment features and historical behavior index as questions, and the vehicle's future behavior index as the answer; F4: Use the question-answer pairs obtained in step F3 to perform supervised fine-tuning of the VLA inference module.
3. The method according to claim 2, characterized in that, The predefined behavioral vocabulary is constructed through the following steps: T1: Construct a steering radius vocabulary based on steering radius parameters, wherein non-uniform segmentation sampling is performed according to the size of the steering radius; T2: Construct a velocity vocabulary based on velocity parameters, wherein non-uniform segmented sampling is performed according to velocity magnitude; T3: Combine the first vocabulary sub-table and the second vocabulary sub-table by performing a Cartesian product to generate a complete behavioral vocabulary containing multiple unique behavioral indexes. Each behavioral index corresponds to a unique combination of turning radius and speed.
4. The method according to claim 3, characterized in that, Step T1 specifically includes: T11: Set minimum turning radius constraint; T12: For the turning radius within the first preset range, a non-linear interval is used to divide it into positive and negative directions to generate a first number of sampling levels; T13: For the turning radius within the second preset range, exponentially increasing sampling is used to generate a second number of sampling levels, wherein the turning radius within the second preset range is greater than the turning radius within the first preset range, and the second number is less than the first number; T14: For the turning radius of the third preset range, retain a third number of representative values, wherein the turning radius of the third preset range is greater than the turning radius of the second preset range, and the third number is less than the second number; T15: Merge all sampled values to form the turning radius vocabulary sub-table.
5. The method according to claim 4, characterized in that, In step T2, different sampling ranges are divided according to the speed. The greater the speed, the larger the sampling interval.
6. The method according to any one of claims 1-5, characterized in that, In steps S2 and F2, the behavior parameterization encoding specifically includes the following steps: The distance between adjacent points and the change in heading angle are obtained through differential calculation; The state is determined based on the distance and the change in heading angle. When the state is stationary, the turning radius is inherited from the previous moment. When the state is straight, the turning radius is set to the maximum value. Calculate the radius of curvature and velocity based on the distance and the change in heading angle; Verify lateral acceleration and velocity to eliminate invalid parameters; The calculated radius of curvature and velocity are mapped to the nearest neighbor turning radius and velocity entries in the predefined behavior vocabulary to generate a unique behavior index.
7. The method according to claim 6, characterized in that, In step S4, kinematic decoding specifically includes the following steps: Perform a parameter lookup table and obtain the corresponding turning radius and speed through the behavior index; Through kinematic calculations, the angular velocity, the change in heading angle per unit time, and the displacement vector in the local coordinate system are obtained. Transform the displacement vector in the local coordinate system to the global coordinate system; Trajectory accumulation is performed by accumulating the changes in displacement vector and heading angle to generate a physically feasible continuous trajectory.
8. The method according to any one of claims 1-5, characterized in that, After step S4, the method further includes: S5 Dynamic Verification: Performs static state determination. When the distance between adjacent trajectory points is less than a preset value, the speed is forced to zero to ensure that the generated trajectory meets the lateral acceleration constraints and longitudinal acceleration constraints. It also performs physical feasibility verification to reject trajectories that violate vehicle dynamics.
9. An autonomous driving planning device based on unified modeling of visual language and actions, characterized in that, Includes the following modules: Multimodal perception module: acquires the vehicle's current multimodal perception data and encodes the multimodal data into unified BEV environmental features; Behavior parameter encoding module: acquires the vehicle's historical trajectory sequence and, based on a predefined behavior vocabulary, encodes the historical trajectory sequence into a historical behavior index sequence through behavior parameterization encoding. The predefined behavior vocabulary is constructed based on the vehicle's kinematic parameters using a non-uniform discretization strategy, and each behavior index corresponds to a unique parameter combination consisting of the turning radius and speed. VLA Inference Module: Input BEV environmental features, user prompt text information, and historical behavior index sequences into the pre-trained Visual Language Action VLA Inference Module to infer the vehicle's future behavior index sequence; Kinematic decoding module: Based on the future behavior index sequence and the predefined behavior vocabulary, the future planned trajectory of the vehicle is obtained through kinematic decoding.
10. A computer-readable storage medium containing a computer program, characterized in that, When the computer program is executed by one or more processors, it performs the autonomous driving planning method based on visual language action unified modeling as described in any one of claims 1-8.