Trajectory-condition-driven spatiotemporal diffusion four-dimensional occupancy prediction method for intelligent driving scenarios

By constructing semantic embedding and temporal feature alignment, dynamic-static decoupling latent modeling, trajectory conditional diffusion prediction, and dynamic feature alignment and self-distillation optimization, the problem of insufficient characterization of the coupling relationship between vehicle motion and scene in autonomous driving environment modeling is solved, and efficient and semantically consistent future occupancy prediction is achieved.

CN122222005APending Publication Date: 2026-06-16CHINA UNIV OF MINING & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA UNIV OF MINING & TECH
Filing Date
2026-03-13
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing methods for autonomous driving environment modeling and future scenario prediction struggle to depict the coupling relationship between vehicle motion and scene in complex and dynamic traffic environments, lack trajectory consistency, and suffer from low computational efficiency and insufficient semantic expression capabilities.

Method used

By constructing semantic embedding and temporal feature alignment, dynamic-static decoupling latent modeling, trajectory conditional diffusion prediction, and dynamic feature alignment and self-distillation optimization mechanisms, four-dimensional occupancy prediction is achieved, thereby improving the training stability and semantic expressive ability of the model.

🎯Benefits of technology

It enables the generation of future occupancy results consistent with a given trajectory in complex and dynamic traffic environments, improving prediction accuracy and robustness, and enhancing the model's generalization ability and training efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0PP2IDJNVEVVZS1DNXWA3D8AXWSMB8GTPEBY2X2Q
    Figure 0PP2IDJNVEVVZS1DNXWA3D8AXWSMB8GTPEBY2X2Q
  • Figure 14HYLZYKKUVLNFW0D2B3IN4Y5SHKBHC9AL9RW1HM
    Figure 14HYLZYKKUVLNFW0D2B3IN4Y5SHKBHC9AL9RW1HM
  • Figure 19HVMXVFKUVV7GKZKHU4OLKNXYYMARHA44P5IY3W
    Figure 19HVMXVFKUVV7GKZKHU4OLKNXYYMARHA44P5IY3W
Patent Text Reader

Abstract

The application discloses a trajectory condition-driven intelligent driving scene space-time diffusion four-dimensional occupation prediction method, which comprises the following steps: semantic embedding construction and time sequence feature alignment; occupation semantic prior construction based on a teacher network; dynamic-static decoupled time sequence latent prior modeling and fusion; trajectory condition four-dimensional occupation prediction prior construction based on a diffusion model; the three-dimensional occupation sequence data of the surrounding environment of the vehicle are taken as input, and the future four-dimensional occupation result consistent with the historical trajectory constraint is directly output, so that the end-to-end trajectory condition-driven four-dimensional occupation prediction is realized; the application can realize controllable diffusion generation of the future four-dimensional occupation sequence in the latent space, so that the scene evolution process is consistent with the trajectory of the ego vehicle, and the prediction accuracy and robustness of the automatic driving system in a complex dynamic environment are improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a trajectory prediction method, specifically a trajectory condition-driven four-dimensional occupancy prediction method for spatiotemporal diffusion in intelligent driving scenarios, belonging to the field of autonomous driving environment modeling and future scenario prediction technology. Background Technology

[0002] With the rapid development of autonomous driving technology, intelligent environmental perception systems, and 3D occupancy modeling methods, the problem of environmental understanding and future scenario prediction for vehicles in complex and dynamic traffic environments has received widespread attention. In real traffic scenarios, the environmental state not only includes the dynamic changes of multiple traffic participants but is also affected by the vehicle's trajectory, exhibiting obvious spatiotemporal coupling characteristics. Traditional prediction methods based on static scene representation or short-term motion extrapolation are insufficient to depict the continuous evolution of the scene in the time dimension, especially when vehicle motion changes are involved, as the scene structure will be significantly reconstructed with the motion. Therefore, how to model the coupling relationship between historical occupancy information and vehicle trajectory within a unified spatiotemporal framework and generate a future four-dimensional occupancy sequence consistent with the trajectory has become one of the core research problems in the field of autonomous driving environment prediction.

[0003] Existing environmental modeling and future prediction methods mainly include rule-based reasoning, dynamic models, and end-to-end prediction methods based on deep learning. Rule-based and dynamic methods typically rely on manually set physical assumptions and motion constraints, exhibiting some stability in simple or weakly interactive scenarios, but they struggle to cover the uncertainties arising from multi-agent interactions and semantic changes in complex traffic environments. In recent years, deep learning-based temporal prediction methods have emerged, encoding historical observations and predicting future states through neural networks, thus improving expressive power to some extent. However, most of these methods focus on trajectory or local target prediction, lacking a unified modeling framework for the overall three-dimensional occupancy structure and its temporal evolution, making it difficult to form a complete four-dimensional environmental representation.

[0004] With the development of generative modeling techniques, diffusion models, due to their advantages in high-dimensional data generation and uncertainty modeling, have been introduced into video prediction and temporal series generation tasks to characterize the uncertainty of future states and generate diverse results. However, existing diffusion-based scene or occupancy prediction methods still have significant limitations: First, the diffusion process usually only uses historical scene features as conditions, lacking explicit conditional embedding of vehicle trajectories, making it difficult to ensure that the generated future occupancy sequence remains aligned with the given trajectory; Second, diffusion modeling is mostly performed directly in the original space, without structured compression and semantic reorganization of the highly sparse 3D occupancy data, resulting in low computational efficiency and difficulty in fully exploring the semantic correlations between voxels; Third, the training phase mainly relies on numerical losses such as denoising mean square error, lacking external semantic feature guidance and a stable self-distillation mechanism, leading to slow convergence, insufficient semantic expressive ability, and limited generalization ability in complex scenes during the early training phase.

[0005] Meanwhile, in 3D occupancy modeling tasks, the original occupancy data is highly sparse, with a large number of voxels representing air regions. Direct end-to-end modeling of this data leads to low computational efficiency and redundant feature representation. Furthermore, existing methods do not adequately mine the spatial semantic relationships within the occupancy data and fail to fully utilize the semantic correlations between voxels at the same height layer, thus limiting the model's ability to express scene structure.

[0006] In summary, current environmental modeling and future occupancy prediction in autonomous driving scenarios still face multiple technical challenges in complex and dynamic traffic environments. On the one hand, existing 3D occupancy or scene prediction methods mostly rely on extrapolation modeling based on historical observations, failing to adequately characterize the coupling relationship between vehicle motion and scene evolution, making it difficult to generate future occupancy results consistent with a given driving trajectory. In real traffic scenarios, the environmental structure dynamically reconstructs with changes in vehicle direction and speed. Without explicit modeling of trajectory information, prediction results are prone to inconsistencies with actual movement trends, thus affecting the system's controllability and coordination. On the other hand, while some deep learning-based occupancy prediction methods can learn temporal change patterns from historical data, they primarily focus on numerical-level reconstruction and optimization, paying insufficient attention to the high-level semantic structure of the scene, resulting in deficiencies in semantic consistency and structural integrity in the generated results. Summary of the Invention

[0007] To address the aforementioned technical shortcomings, the purpose of this invention is to provide a trajectory-condition-driven method for predicting the spatiotemporal diffusion of four-dimensional occupancy in intelligent driving scenarios. This method enables the controllable diffusion generation of future four-dimensional occupancy sequences in the potential space, ensuring that the scenario evolution process remains consistent with the vehicle's trajectory. Furthermore, it enhances training stability and semantic expression capabilities without increasing inference complexity, thereby improving the prediction accuracy and robustness of autonomous driving systems in complex dynamic environments.

[0008] To achieve the above objectives, this invention provides a trajectory-condition-driven four-dimensional occupancy prediction method for spatiotemporal diffusion in intelligent driving scenarios, comprising the following steps: S1. Semantic embedding construction and temporal feature alignment; Based on the 3D occupancy representation of the vehicle's surrounding environment, the space is divided into a 3D voxel grid, and each voxel grid cell is assigned a corresponding discrete semantic label to obtain occupancy semantic features. Introducing learnable semantic embedding vectors The discrete semantic labels are mapped to continuous semantic embedding vectors, and the semantic features are further... Remodeling into a tensor For occupied sequence data consisting of consecutive time frames, the embeddings of each frame are stacked and aligned along the time axis to form a unified BEV feature tensor. ; S2. Construction of occupation semantic priors based on teacher networks; Remove the occupied semantic features obtained in step S1 Empty voxels in the dataset, combined with their strong correlation with each other, are sorted by category and weighted and aggregated, ultimately mapping to a two-dimensional feature map from a bird's-eye view. Subsequently, principal component analysis (PCA) was used to reduce the two-dimensional feature map. Dimensionality, to obtain dimensionality reduction features ; Dimensionality reduction features Input to feature extractor Obtain prior features This vector can serve as a semantic prior for occupying the scene; S3. Dynamic-static decoupled temporal latent prior modeling and fusion; Prior features constructed based on step S2 Determine the BEV feature tensor The dynamic and static regions are defined, and a dynamic mask is constructed. With static mask Then, the dynamic region features were respectively Features of static regions Input to independent coding modeling modules and In this process, the encoder maintains temporal consistency during sampling to obtain a dynamic latent feature representation. Compared with static latent feature representation : (1) Then, through static weights Representation of dynamic latent features Compared with static latent feature representation The fusion feature is obtained by performing a weighted summation. : (2) Then, the fusion features Decomposed along the channel dimension and Furthermore, variance increments are introduced for dynamic regions. to obtain latent variables , (3) Subsequently, random perturbation sampling was performed on the latent variables to obtain the input of the diffusion model. : (4) in: Indicates random noise; S4. Construction of the four-dimensional occupancy prediction prior based on the diffusion model; Based on the diffusion model input obtained in step S3 First, a trajectory conditional embedding representation is established, then a spatiotemporal diffusion model is established, and random noise is gradually denoised and reconstructed to recover a structured future occupancy potential representation. After decoding, the original future occupancy is generated, thereby generating a predicted scene of the future scene structure.

[0009] Step S4, the trajectory conditional embedding representation and the establishment of the spatiotemporal diffusion model, are specifically as follows: S41: Trajectory Conditional Embedding Representation: Introducing Historical Trajectories As a conditional constraint signal, the historical trajectory The data is projected into a high-dimensional embedding space and fed into a trained trajectory encoder, then fused with the temporal embedding to form a trajectory conditional embedding representation. : (5) in: and These are time step embedding and trajectory encoding functions, respectively; S42: Spatiotemporal diffusion model: Input the diffusion model Divided into historical occupation sequence With future occupying sequence First, the historical sequence is used. The trajectory conditional embedding representation obtained in step S41 As a condition for joint control, the future occupying sequence is gradually determined. A noise perturbation is applied, and then the pure noise obtained after the noise perturbation is input into the DiT diffusion transformer, which is then processed by a parameterized spatiotemporal denoising network. Estimate and output the predicted noise: (6) in: For noisy latent representation; Indicates a time step; Subsequently, the spatiotemporal diffusion model occupied the historical sequence. The trajectory conditional embedding representation obtained in step S41 Under the constraint of diffusion time, noise is gradually removed through reverse iteration to obtain the future occupancy sequence consistent with the trajectory. Finally, the diffusion model is input. The original occupied resolution is reconstructed by a decoder composed of 3D deconvolution layers.

[0010] This invention also includes training optimization of a conditional diffusion model based on dynamic feature alignment and self-distillation; After completing steps S1, S2, S3, and S4, a two-stage supervision strategy is introduced, consisting of external visual prior and diffusion self-distillation stages, through a scheduling factor. and A smooth transition is achieved between external semantic guidance and model self-supervision, thereby completing the optimization of the spatiotemporal diffusion model, where the scheduling factor is defined as: (7) Indicates the current training round; This indicates a predefined transition round for controlling the switch from external supervision to self-supervision.

[0011] External visual prior: An external teacher model is introduced to impose semantic alignment constraints on the intermediate feature representations of the spatiotemporal diffusion model, and feature alignment loss is adopted. Guide the internal representation of the spatiotemporal diffusion model to evolve towards semantic features, and use the output of the diffusion transformer DiT at the nth layer. Through trainable projection layers Mapped to the same feature space and compared with the prior features obtained in step S2 Perform similarity alignment; (8) in: , and These represent the Diffusion Transformer (DiT), the Variational Autoencoder (VAE), and the Trainable Projection Layer, respectively. The output of the DiT diffusion transformer at layer m is used to train the projection layer. Parameters; Indicates cosine similarity; Indicates the number of image patches; Diffusion self-distillation: This involves updating the parameters of a teacher branch consistent with the student model structure using the exponential moving average (EMA), followed by... Self-distillation minimizes the differences in student and teacher characteristics. The updated diffusion self-distillation loss is obtained: (9) in It is a predefined distance function.

[0012] Compared with existing technologies, this invention proposes a trajectory-condition-driven four-dimensional occupancy prediction method for spatiotemporal diffusion in intelligent driving scenarios. By constructing a spatiotemporal diffusion prediction framework that incorporates semantic embedding representation, a dynamic-static decoupling latent modeling mechanism, and trajectory condition control, it achieves four-dimensional occupancy prediction with temporal consistency and trajectory controllability in complex dynamic traffic scenarios. This method constructs a unified end-to-end modeling framework that includes a semantic embedding construction and temporal feature alignment module, a teacher network-based occupancy semantic prior construction module, a dynamic-static decoupling latent modeling and fusion module, a trajectory condition diffusion prediction module, and a dynamic feature alignment and self-distillation optimization mechanism. This framework can generate structured and semantically consistent future occupancy sequences under historical occupancy and trajectory condition constraints.

[0013] The semantic embedding and temporal feature alignment mechanism constructed in this invention maps three-dimensional discrete voxel semantic labels into learnable continuous embedding vectors and stacks and aligns them in the temporal dimension to form a unified BEV temporal feature tensor, which effectively solves the problem that discrete occupancy representations are difficult to model directly. This method realizes a smooth transition from discrete semantic space to continuous feature space, providing a stable and structured input representation for subsequent latent space modeling and diffusion prediction.

[0014] Building upon this foundation, this invention introduces a teacher network-based occupancy semantic prior construction mechanism. By removing empty voxels, sorting and weighting voxels with the same height using semantic correlation, and combining PCA dimensionality reduction and focc feature extractor, spatial prior features with high semantic expressive power are obtained. These spatial prior features serve as semantic guidance signals during end-to-end training, aligning the intermediate representations of the diffusion model, thereby significantly enhancing the semantic consistency and expressive power of latent features.

[0015] Furthermore, this invention proposes a dynamic-static decoupled latent modeling and fusion mechanism. By constructing dynamic and static masks, the dynamic and static regions are independently encoded and modeled, and fused using learnable weights. At the same time, variance increments are introduced into the dynamic region to enhance modeling flexibility. It fully considers the differences in the temporal evolution of dynamic entities and static structures, improves the latent space's ability to express motion changes, and thus more accurately characterizes the evolution process of dynamic objects in future occupancy prediction tasks.

[0016] In the prediction phase, this invention employs a trajectory-driven spatiotemporal diffusion model for four-dimensional occupancy prediction. By embedding historical occupancy sequences and historical trajectories as joint control conditions, it progressively adds noise and reverses noise to model the potential representation of future occupancy, thereby achieving generative prediction of the future scene structure. This enables the model to generate future occupancy results consistent with the trajectory under given trajectory constraints, achieving controllability and structural consistency in the prediction process. At the same time, the diffusion model models uncertainty in the potential space, effectively improving the stability and robustness of long-term prediction.

[0017] Furthermore, this invention proposes a two-stage training optimization mechanism combining dynamic feature alignment and diffusion self-distillation. In the early stages of training, an external visual prior is introduced, and feature alignment loss guides the internal representation of the diffusion model to converge towards the semantic space. In the later stages, a gradual transition to self-distillation supervision is implemented, updating the teacher branch through EMA and minimizing the feature differences between students and teachers to achieve consistent optimization within the model. This smooth transition mechanism significantly accelerates training convergence speed and improves the model's generalization ability and prediction stability without increasing inference costs.

[0018] Compared with the prior art, the present invention has the following significant advantages: 1. This invention can achieve true end-to-end joint optimization of four-dimensional occupancy modeling, breaking through the information bottleneck problem caused by traditional multi-stage or pre-trained VAEs; 2. This invention improves the accuracy and temporal consistency of expressing dynamic scene evolution through a dynamic-static decoupling modeling mechanism; 3. This invention achieves unified modeling of the controllability and uncertainty of future occupancy generation through trajectory condition embedding and spatiotemporal diffusion modeling; 4. This invention significantly improves training efficiency, semantic consistency, and prediction stability through a dynamic distillation mechanism that combines external semantic alignment with self-distillation; 5. The overall framework of this invention exhibits stronger generalization ability and robustness in complex dynamic traffic environments, providing a stable and reliable foundation for world model construction and downstream planning decisions in autonomous driving systems. Attached Figure Description

[0019] Figure 1 This is an overall flowchart of the present invention; Figure 2 This is a schematic diagram of the overall structure of the present invention; Figure 3 This is a flowchart of the dynamic-static decoupled temporal latent prior modeling and fusion module in this invention; Figure 4 This is a schematic diagram comparing E2EOcc and mainstream baselines in the method of this invention under two input settings: ground truth 3D occupancy data and FBOCC module prediction based on camera input. Detailed Implementation

[0020] It should be noted that the following detailed descriptions are exemplary and intended to provide further explanation of the invention. Unless otherwise specified, all technical and scientific terms used by users of this invention have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. It should be noted that the terminology used in this section is for the purpose of describing particular embodiments only and is not intended to limit the scope of protection according to the invention. The invention will now be further described in conjunction with the accompanying drawings.

[0021] To achieve unified modeling and high-precision prediction of the temporal evolution of the 3D environment in complex dynamic traffic scenarios, this invention proposes a trajectory-condition-driven spatiotemporal diffusion four-dimensional occupancy prediction method for intelligent driving scenarios. It takes the 3D occupancy sequence data of the vehicle's surrounding environment as input and directly outputs the future four-dimensional occupancy result consistent with historical trajectory constraints, realizing end-to-end trajectory-condition-driven four-dimensional occupancy prediction. Its overall structure is as follows: Figure 2 As shown, the system comprises a semantic embedding construction and temporal feature alignment module, a teacher network-based occupancy semantic prior construction module, a dynamic-static decoupled latent modeling and fusion module, a trajectory conditional diffusion prediction module, and a unified end-to-end modeling framework with dynamic feature alignment and self-distillation optimization mechanisms. Based on multi-sensor input data, the system jointly performs semantic feature learning and spatiotemporal diffusion generation modeling in the latent space. End-to-end optimization is achieved through a cross-branch feature alignment mechanism and representation consistency loss, ultimately outputting a future occupancy prediction result with spatiotemporal consistency.

[0022] like Figure 1 and Figure 2 As shown, this invention provides a trajectory-condition-driven four-dimensional occupancy prediction method for spatiotemporal diffusion in intelligent driving scenarios, comprising the following steps: S1. Semantic embedding construction and temporal feature alignment; Based on the 3D occupancy representation of the vehicle's surrounding environment, the space is divided into a 3D voxel grid, and each voxel grid cell is assigned a corresponding discrete semantic label to obtain occupancy semantic features. Introducing learnable semantic embedding vectors The discrete semantic labels are mapped to continuous semantic embedding vectors, and the semantic features are further... Remodeling into a tensor For occupied sequence data consisting of consecutive time frames, the embeddings of each frame are stacked and aligned along the time axis to form a unified BEV feature tensor. ; S2. Construction of occupation semantic priors based on teacher networks; Remove the occupied semantic features obtained in step S1 Empty voxels in the dataset, combined with their strong correlation with each other, are sorted by category and weighted and aggregated, ultimately mapping to a two-dimensional feature map from a bird's-eye view. Subsequently, principal component analysis (PCA) was used to reduce the two-dimensional feature map. Dimensionality, to obtain dimensionality reduction features ; Dimensionality reduction features Input to feature extractor Obtain prior features This vector can serve as a semantic prior for occupying the scene; S3. Dynamic-static decoupled temporal latent prior modeling and fusion; Prior features constructed based on step S2 Determine the BEV feature tensor The dynamic and static regions are defined, and a dynamic mask is constructed. With static mask Subsequently, the dynamic region features and static region features are input into independent coding and modeling modules respectively. and In this process, the encoder maintains temporal consistency during sampling to obtain a dynamic latent feature representation. Compared with static latent feature representation : (1) Then, through static weights Representation of dynamic latent features Compared with static latent feature representation The fusion feature is obtained by performing a weighted summation. : (2) Then, the fusion features Decomposed along the channel dimension and ,in and The parameters represent the mean and standard deviation, respectively, and an incremental variance is introduced for the dynamic region. To obtain latent variables, (3) Subsequently, random perturbation sampling was performed on the latent variables to obtain the input of the diffusion model. : (4) in: Indicates random noise; This invention introduces variance increments only to the features of the dynamic region when randomly perturbing the latent variables, while keeping the variance of the features of the static region unchanged, thus achieving feature enhancement in the dynamic region.

[0023] S4. Construction of the four-dimensional occupancy prediction prior based on the diffusion model; Based on the diffusion model input obtained in step S3 First, a trajectory conditional embedding representation is established, then a spatiotemporal diffusion model is established, and random noise is gradually denoised and reconstructed to recover a structured future occupancy potential representation. After decoding, the original future occupancy is generated, thereby generating a predicted scene of the future scene structure. Step S4, the trajectory conditional embedding representation and the establishment of the spatiotemporal diffusion model, are specifically as follows: S41: Trajectory Conditional Embedding Representation: Introducing Historical Trajectories As a conditional constraint signal, the historical trajectory The data is projected into a high-dimensional embedding space and fed into a trained trajectory encoder, then fused with the temporal embedding to form a trajectory conditional embedding representation. : (5) in: and These are time step embedding and trajectory encoding functions, respectively; The historical trajectory of this invention is the historical motion trajectory of the vehicle, and the trajectory encoder is a pre-trained deep neural network used to encode and extract the high-dimensional embedding features of the historical trajectory.

[0024] S42: Spatiotemporal diffusion model: Input the diffusion model Divided into historical occupation sequence With future occupying sequence First, the historical sequence is used. The trajectory conditional embedding representation obtained in step S41 As a condition for joint control, the future occupying sequence is gradually determined. A noise perturbation is applied, and then the pure noise obtained after the noise perturbation is input into the DiT diffusion transformer, which is then processed by a parameterized spatiotemporal denoising network. Estimate and output the predicted noise: (6) in: For noisy latent representation; Indicates a time step; Subsequently, the spatiotemporal diffusion model occupied the historical sequence. The trajectory conditional embedding representation obtained in step S41 Under the constraint of diffusion time, noise is gradually removed through reverse iteration to obtain the future occupancy sequence consistent with the trajectory. Finally, the diffusion model is input. The original occupied resolution is reconstructed by a decoder composed of 3D deconvolution layers.

[0025] This invention also includes training optimization of a conditional diffusion model based on dynamic feature alignment and self-distillation; After completing steps S1, S2, S3, and S4, a two-stage supervision strategy is introduced, consisting of external visual prior and diffusion self-distillation stages, through a scheduling factor. and A smooth transition is achieved between external semantic guidance and model self-supervision, thereby completing the optimization of the spatiotemporal diffusion model, where the scheduling factor is defined as: (7) Indicates the current training round; This indicates a predefined transition round for controlling the switch from external supervision to self-supervision.

[0026] External visual prior: An external teacher model is introduced to impose semantic alignment constraints on the intermediate feature representations of the spatiotemporal diffusion model, and feature alignment loss is adopted. Guide the internal representation of the spatiotemporal diffusion model to evolve towards semantic features, and use the output of the diffusion transformer DiT at the nth layer. Through trainable projection layers Mapped to the same feature space and compared with the prior features obtained in step S2 Perform similarity alignment; (8) in: , and These represent the Diffusion Transformer (DiT), the Variational Autoencoder (VAE), and the Trainable Projection Layer, respectively. The output of the DiT diffusion transformer at layer m is used to train the projection layer. Parameters; Indicates cosine similarity; Indicates the number of image patches; The feature alignment loss is constructed by calculating the cosine similarity between the intermediate output of the diffusion transformer DiT and the prior features of step S2, thereby guiding the intermediate features of the diffusion model to remain consistent with the scene semantic prior.

[0027] Diffusion self-distillation: The parameters are updated via a teacher branch consistent with the student model structure and using the exponential moving average (EMA) algorithm, followed by... Self-distillation minimizes the differences in student and teacher characteristics. The updated diffusion self-distillation loss is obtained: (9) in It is a predefined distance function.

[0028] EMA is an exponential moving average algorithm used to update the model parameters of the teacher branch in real time, ensuring that the feature representation of the teacher branch is always better than that of the student model; the diffusion self-distillation loss is constructed by calculating the difference between student and teacher features using a predefined distance function.

[0029] This reasoning process introduces external semantic alignment and self-distillation consistency constraints to provide structured guidance for the expression of potential features during diffusion denoising training, ultimately significantly improving the model's convergence speed, semantic consistency, and prediction stability, thereby enhancing the accuracy and robustness of future occupancy generation results.

[0030] Example In this embodiment, to verify the trajectory-condition-driven spatiotemporal diffusion four-dimensional occupancy prediction method proposed in this invention, a complete experimental system was constructed to perform functional verification and performance evaluation of this invention, and systematic experimental verification and comparative analysis were conducted on the public nuScenes dataset.

[0031] like Figures 1-3 The diagram shows the overall structure of the trajectory-condition-driven end-to-end spatiotemporal diffusion four-dimensional occupancy prediction method of this invention. This method constructs a unified end-to-end modeling framework comprising a semantic embedding construction and temporal feature alignment module, an occupancy semantic prior construction module based on a teacher network, a dynamic-static decoupled latent modeling and fusion module, a trajectory-conditional diffusion prediction module, and a dynamic feature alignment and self-distillation optimization mechanism. Within the end-to-end training framework, this method unifies the modeling of the VAE encoder and the spatiotemporal diffusion prediction module. The encoder not only undertakes the task of occupancy feature compression but also serves as a key bridge between the input observation and the downstream prediction module, preserving geometric structure and semantic information in the latent space. The diffusion module generates and models future 4D occupancy within the unified latent space. Simultaneously, a dynamic distillation loss mechanism is used to achieve cross-stage consistency optimization, thereby improving prediction stability and expressive power.

[0032] To verify the feasibility and effectiveness of the method of this invention, the experiments were conducted based on the publicly available large-scale autonomous driving benchmark dataset nuScene. This dataset contains 700 occupancy sequences for training and 150 sequences for validation. Each sequence contains approximately 40 frames, with a sampling frequency of 2Hz. The spatial resolution of the occupancy data in each frame is [0.4, 0.4, 0.4] meters, the perception range is [-40m, -40m, -1m, 40m, 40m, 5.4m], and the corresponding occupancy grid size is [200, 200, 16]. Each voxel is assigned 17 semantic labels. Evaluation was conducted using IoU and mIoU metrics to assess the occupancy reconstruction quality and 4D occupancy prediction performance at the voxel level. Higher IoU and mIoU indicate stronger information retention during potential compression and future prediction, reflecting both reconstruction accuracy and the ability to model dynamic environmental evolution.

[0033] To comprehensively evaluate the performance of the method of this invention, occupancy reconstruction experiments and 4D occupancy prediction experiments were conducted. In the occupancy reconstruction task, compared with representative methods based on occupancy tokenization, the results show that the method of this invention achieves significant improvements in both IoU and mIoU, verifying the advantages of E2EOcc-VAE in terms of geometric structure preservation and semantic representation. In the 4D occupancy prediction task, the occupancy state for the next 3 seconds was predicted using 2 seconds of historical frames, and comparisons were made under two input settings: one was the actual occupancy input (-O), and the other was the occupancy input predicted by the camera through the FB-OCC module (-F). The results show that the comprehensive evaluation results achieve an IoU of 32.36 and an mIoU of 23.10, as shown in the figure. Figure 4 As shown, E2EOcc consistently achieves optimal performance, demonstrating that the present invention possesses stronger spatiotemporal modeling capabilities and prediction accuracy in complex dynamic environments.

[0034] During implementation, the E2EOcc-VAE encoder and spatiotemporal diffusion prediction module were jointly trained within a unified end-to-end framework, and optimized using a dynamic distillation loss mechanism. The training process used a 2-second historical sequence as input to predict the 4D occupancy state for the next 3 seconds. The model performed generative modeling within the latent space through a diffusion mechanism, while also incorporating alignment loss. With self-distillation loss Dynamic distillation loss is incorporated to enhance cross-stage representation consistency and improve training stability.

[0035] To further analyze the role of key modules, ablation experiments were conducted to study the impact of different training strategies and loss function combinations on performance. First, three VAE training methods were compared: unrestricted end-to-end tuning (mIoU 21.06, IoU 29.72), end-to-end combined with diffusion loss (mIoU 15.21, IoU 22.03), and end-to-end combined with dynamic distillation loss (mIoU 23.92, IoU 32.60). The results showed that using diffusion loss alone weakens potential expression capacity, while introducing a dynamic distillation mechanism significantly improves reconstruction and generation performance.

[0036] In the ablation experiment during the prediction phase, for and Perform combinatorial analysis. Traditional two-stage training (without...) and Only mIoU 13.28 and IoU 25.10 were obtained, and convergence required 5M training steps; only using It achieved mIoU 23.27 and IoU 28.16, with approximately 800k convergence steps; using only It achieved mIoU 22.26 and IoU 31.28, with approximately 1 million convergence steps; simultaneously using and The optimal results were achieved when using dynamic distillation loss, with mIoU of 23.92 and IoU of 32.60, and the number of training steps was reduced to 200k. Experimental results show that the joint loss not only significantly improves prediction performance but also greatly accelerates model convergence.

[0037] Furthermore, generalization experiments were conducted on a challenging subset containing complex behaviors, and the results showed that the method of the present invention still maintains stable performance in complex dynamic scenarios, further verifying its adaptability to diverse motion patterns.

[0038] In summary, this embodiment verifies the feasibility and advancement of the present invention for 4D occupancy modeling and future state prediction in dynamic traffic scenarios. Compared with existing technologies, the present invention significantly improves occupancy reconstruction accuracy, prediction accuracy, and training efficiency in complex environments through an end-to-end unified diffusion modeling mechanism and a dynamic distillation training strategy. This method has good engineering feasibility and is suitable for environmental understanding and dynamic world modeling tasks in autonomous driving systems.

[0039] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the claimed invention.

Claims

1. A trajectory-condition-driven four-dimensional occupancy prediction method for spatiotemporal diffusion in intelligent driving scenarios, characterized in that, Includes the following steps: S1. Semantic embedding construction and temporal feature alignment; Based on the 3D occupancy representation of the vehicle's surrounding environment, the space is divided into a 3D voxel grid, and each voxel grid cell is assigned a corresponding discrete semantic label to obtain occupancy semantic features. Introducing learnable semantic embedding vectors The discrete semantic labels are mapped to continuous semantic embedding vectors, and the semantic features are further... Remodeling into a tensor For occupied sequence data consisting of consecutive time frames, the embeddings of each frame are stacked and aligned along the time axis to form a unified BEV feature tensor. ; S2. Construction of occupation semantic priors based on teacher networks; Remove the occupied semantic features obtained in step S1 Empty voxels in the dataset, combined with their strong correlation with each other, are sorted by category and weighted and aggregated, ultimately mapping to a two-dimensional feature map from a bird's-eye view. Subsequently, principal component analysis (PCA) was used to reduce the two-dimensional feature map. Dimensionality, to obtain dimensionality reduction features ; Dimensionality reduction features Input to feature extractor Obtain prior features This vector can serve as a semantic prior for occupying the scene; S3. Dynamic-static decoupled temporal latent prior modeling and fusion; Prior features constructed based on step S2 Determine the BEV feature tensor The dynamic and static regions are defined, and a dynamic mask is constructed. With static mask Subsequently, the dynamic region features and static region features are input into independent coding and modeling modules respectively. and In this process, the encoder maintains temporal consistency during sampling to obtain a dynamic latent feature representation. Compared with static latent feature representation : (1) Then, through static weights Representation of dynamic latent features Compared with static latent feature representation The fusion feature is obtained by performing a weighted summation. : (2) Then, the fusion features Decomposed along the channel dimension and ,in and Do the parameters represent the mean and standard deviation, respectively? And is a variance increment introduced for the dynamic region? to obtain latent variables , (3) Then, the latent variables were analyzed. Random perturbation sampling is performed to obtain the diffusion model input. : (4) in: Indicates random noise; S4. Construction of the four-dimensional occupancy prediction prior based on the diffusion model; Based on the diffusion model input obtained in step S3 First, a trajectory conditional embedding representation is established, then a spatiotemporal diffusion model is established, and random noise is gradually denoised and reconstructed to recover a structured future occupancy potential representation. After decoding, the original future occupancy is generated, thereby generating a predicted scene of the future scene structure.

2. The trajectory-condition-driven four-dimensional occupancy prediction method for spatiotemporal diffusion in intelligent driving scenarios as described in claim 1, characterized in that, Step S4, the trajectory conditional embedding representation and the establishment of the spatiotemporal diffusion model, are specifically as follows: S41: Trajectory Conditional Embedding Representation: Introducing Historical Trajectories As a conditional constraint signal, the historical trajectory The data is projected into a high-dimensional embedding space and fed into a trained trajectory encoder, then fused with the temporal embedding to form a trajectory conditional embedding representation. : (5) in: Indicates temporal embedding; Represents the trajectory encoding function; S42: Spatiotemporal diffusion model: Input the diffusion model Divided into historical occupation sequence With future occupying sequence First, the historical sequence is used. The trajectory conditional embedding representation obtained in step S41 As a condition for joint control, the future occupying sequence is gradually determined. A noise perturbation is applied, and then the pure noise obtained after the noise perturbation is input into the DiT diffusion transformer, which is then processed by a parameterized spatiotemporal denoising network. Estimate and output the predicted noise: (6) in: Represents a noisy latent representation; Indicates a time step; Subsequently, the spatiotemporal diffusion model occupied the historical sequence. The trajectory conditional embedding representation obtained in step S41 Under the constraint of diffusion time, noise is gradually removed through reverse iteration to obtain the future occupancy sequence consistent with the trajectory. Finally, the diffusion model is input. The original occupied resolution is reconstructed by a decoder composed of 3D deconvolution layers.

3. The trajectory-condition-driven four-dimensional occupancy prediction method for spatiotemporal diffusion in intelligent driving scenarios as described in claim 2, characterized in that, It also includes training and optimization of conditional diffusion models based on dynamic feature alignment and self-distillation; After completing steps S1, S2, S3, and S4, a two-stage supervision strategy is introduced, consisting of external visual prior and diffusion self-distillation stages, through a scheduling factor. and A smooth transition is achieved between external semantic guidance and model self-supervision, thereby completing the optimization of the spatiotemporal diffusion model, where the scheduling factor is defined as: (7) Indicates the current training round; This indicates a predefined transition round for controlling the switch from external supervision to self-supervision.

4. The trajectory-condition-driven four-dimensional occupancy prediction method for spatiotemporal diffusion in intelligent driving scenarios as described in claim 3, characterized in that, External visual prior: An external teacher model is introduced to impose semantic alignment constraints on the intermediate feature representations of the spatiotemporal diffusion model, and feature alignment loss is adopted. Guide the internal representation of the spatiotemporal diffusion model to evolve towards semantic features, and use the output of the diffusion transformer DiT at the nth layer. Through trainable projection layers Mapped to the same feature space and compared with the prior features obtained in step S2 Perform similarity alignment; (8) in: , and These represent the Diffusion Transformer (DiT), the Variational Autoencoder (VAE), and the Trainable Projection Layer, respectively. The output of the DiT diffusion transformer at layer m is used to train the projection layer. Parameters; Indicates cosine similarity; Indicates the number of image patches; Diffusion self-distillation: This involves updating the parameters of a teacher branch consistent with the student model structure using the exponential moving average (EMA), followed by... Self-distillation minimizes the differences in student and teacher characteristics. The updated diffusion self-distillation loss is obtained: (9) in It is a predefined distance function.