A self-supervised street pedestrian three-dimensional reconstruction method based on weakly supervised semantic recognition
By employing weakly supervised semantic recognition and self-supervised optimization, this technology addresses the issues of 3D annotation dependence, low semantic separation accuracy, and insufficient motion modeling in existing pedestrian reconstruction techniques. It achieves efficient and accurate 3D reconstruction of pedestrians on streets, making it suitable for autonomous driving systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUZHOU UNIVERSITY
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-19
AI Technical Summary
Existing 3DGS-based street scene reconstruction methods suffer from problems such as high dependence on 3D annotations, low accuracy of pedestrian semantic separation, insufficient non-rigid motion modeling, and poor temporal consistency when reconstructing pedestrians, making it difficult to meet the high-quality 3D reconstruction requirements of autonomous driving systems.
A self-supervised method based on weakly supervised semantic recognition is adopted. Three-dimensional Gaussian primitives are initialized by sparse point clouds. Combined with a multi-resolution six-plane spatiotemporal encoder and a multi-head Gaussian decoder, a self-attention mechanism and residual Gaussian deformation prediction are used to apply pedestrian-specific motion constraints to achieve end-to-end optimization.
It eliminates the need for 3D bounding box annotation, significantly reducing manual annotation costs, improving pedestrian region recognition accuracy and motion coordination, enhancing the temporal coherence and geometric accuracy of reconstruction results, and meeting the real-time reconstruction needs of autonomous driving scenarios.
Smart Images

Figure CN122244277A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a self-supervised 3D reconstruction method for pedestrians on streets based on weakly supervised semantic recognition. Background Technology
[0002] The safety verification of autonomous driving technology heavily relies on high-quality 3D reconstruction of complex street scenes. As the most critical vulnerable road users, pedestrians' accurate 3D reconstruction not only provides necessary spatial perception information for autonomous driving systems, but can also be used to construct high-fidelity simulation test scenarios, significantly reducing the cost and risk of real-vehicle testing.
[0003] 3DGaussian Splatting (3DGS) technology uses anisotropic 3D Gaussian ellipsoids as scene representation primitives and achieves end-to-end optimization through a differentiable rasterization rendering pipeline. It achieves a good balance between real-time rendering efficiency and reconstruction quality, providing an effective technical foundation for dynamic scene reconstruction.
[0004] However, existing 3DGS-based street scene reconstruction methods have the following main drawbacks when applied to pedestrian reconstruction:
[0005] First, there is a heavy reliance on 3D annotation. Existing methods such as StreetGaussians and DrivingGaussians generally rely on manually annotated 3D bounding boxes to separate static backgrounds from dynamic targets, resulting in high annotation costs and difficulty in meeting the automated processing needs of large-scale urban scenes.
[0006] Second, the semantic separation capability between pedestrians and vehicles is insufficient. Although existing methods such as HUGS incorporate semantic information, they mainly model rigid targets such as vehicles and lack a fine recognition mechanism specifically for pedestrians, resulting in limited accuracy in pedestrian region recognition.
[0007] Third, the ability to model non-rigid motion of pedestrians is insufficient. Existing methods usually use a uniform spatiotemporal deformation field to model all dynamic targets in the scene without differentiation, lacking specific constraints for the motion characteristics of pedestrian limbs; while pedestrian motion is non-rigid and has the characteristics of multi-joint coupling, which makes it impossible for the uniform deformation field to effectively model the motion coordination relationship between limbs, ultimately resulting in obvious jitter and non-physical abrupt changes in the reconstruction results in the temporal dimension.
[0008] Fourth, poor temporal consistency. Purely data-driven deformation fields lack explicit constraints on the motion coherence between adjacent frames of pedestrians, and cannot effectively suppress non-cooperative deformation of local areas of pedestrian limbs.
[0009] Therefore, there is an urgent need for a method that can perform fine semantic recognition of pedestrians in street scenes and achieve high-quality self-supervised 3D reconstruction without the need for 3D bounding box annotation. Summary of the Invention
[0010] This invention provides a self-supervised 3D reconstruction method for pedestrians on streets based on weakly supervised semantic recognition to address the problems existing in the prior art.
[0011] The technical solutions adopted in this invention are as follows:
[0012] A self-supervised 3D reconstruction method for pedestrians on streets based on weakly supervised semantic recognition includes the following steps:
[0013] S1: Extract sparse point clouds from a multi-frame RGB image sequence, initialize a set of three-dimensional Gaussian primitives in the normal space based on the sparse point clouds, and assign a learnable semantic attribute vector to each three-dimensional Gaussian primitive.
[0014] S2: The RGB image sequence is semantically segmented frame by frame using a pre-trained two-dimensional semantic segmentation network to generate pixel-level pseudo-labels containing three categories: pedestrians, vehicles, and background. The three-dimensional Gaussian meta-algorithms are driven to learn the semantic attribution of pedestrians, vehicles, and background through confidence-weighted cross-entropy loss and semantic entropy regularization term.
[0015] S3: Input the normalized spatial position coordinates and timestamp of each three-dimensional Gaussian primitive into the multi-resolution six-plane spatiotemporal encoder. After extracting the features of each resolution layer by bilinear interpolation, use the self-attention mechanism to perform cross-scale feature fusion and calculate the temporal difference features of adjacent time points to obtain the fused spatiotemporal feature vector.
[0016] S4: Input the fused spatiotemporal feature vector into the multi-head Gaussian decoder, and predict the position offset, rotation increment, scaling adjustment and opacity change through four parallel multilayer perceptron branches respectively. The residuals are then superimposed onto the position, rotation, scaling and opacity parameters of the three-dimensional Gaussian element in the normal space to obtain the deformed three-dimensional Gaussian element parameters at each time step.
[0017] S5: Temporal smoothing constraints and local consistency constraints are applied only to the 3D Gaussian elements of the pedestrian category; the temporal smoothing constraints are used to suppress non-physical abrupt changes in pedestrian motion, and the local consistency constraints are used to maintain the local motion coordination of pedestrians;
[0018] S6: Renders RGB images, depth maps, and semantic segmentation maps through a differentiable Gaussian splash rendering pipeline, and performs end-to-end joint optimization of all parameters based on a multi-task loss function.
[0019] Furthermore, in S1, sparse point clouds are extracted from a multi-frame RGB image sequence using the structure-reconstruction-motion method.
[0020] Further, in S1, the semantic attribute vector is a C-dimensional semantic logit vector, which is normalized by softmax to obtain the category probability distribution. The number of semantic categories C=3 corresponds to the three categories of pedestrians, vehicles, and background. After training convergence, all three-dimensional Gaussian elements are divided into background three-dimensional Gaussian elements, vehicle three-dimensional Gaussian elements, and pedestrian three-dimensional Gaussian elements according to the category corresponding to the maximum semantic probability of each three-dimensional Gaussian element.
[0021] Furthermore, in S2, the confidence-weighted cross-entropy loss uses the maximum probability of each pixel category predicted by the two-dimensional semantic segmentation network as the confidence weight.
[0022] Furthermore, in S3, the multi-resolution six-plane spatiotemporal encoder projects four-dimensional spatiotemporal coordinates onto three spatial planes and three spatiotemporal planes, for a total of six two-dimensional feature planes; the three spatial planes are xy, xz, and yz planes, and the three spatiotemporal planes are xt, yt, and zt planes.
[0023] Furthermore, in S3, the multi-resolution six-plane spatiotemporal encoder contains L resolution levels, each with a feature dimension of D, and the first level... The resolution of the layer feature plane is based on The rules are incremental, among which, Based on resolution, .
[0024] Furthermore, in S4, the update rules for residual methods are as follows: position updates use additive residuals, rotation updates use quaternion multiplication residuals, scaling updates use exponential mapping residuals, and opacity updates use sigmoid residuals.
[0025] Furthermore, in S5, the temporal smoothing constraint is achieved by penalizing the second-order difference of the pedestrian position deformation in adjacent frames, and the local consistency constraint is achieved by penalizing the difference in deformation between neighboring three-dimensional Gaussian elements in the norm space that satisfy the preset nearest neighbor rule.
[0026] Furthermore, the preset nearest neighbor rule is a set of k nearest neighbors that is pre-computed using the KD-Tree algorithm and kept fixed during training.
[0027] Furthermore, in S6, the multi-task loss function includes photometric consistency loss, depth supervision loss, optical flow consistency loss, semantic segmentation loss, and pedestrian motion constraint loss in S5, and adopts a three-stage progressive weight adjustment strategy for end-to-end joint optimization.
[0028] The present invention has the following beneficial effects:
[0029] (1) No three-dimensional bounding box annotation is required. Two-dimensional semantic segmentation pseudo-labels are used as weak supervision signals, which greatly reduces the cost of manual annotation and adapts to the automated processing needs of large-scale urban scenarios.
[0030] (2) By using confidence weighting and semantic entropy regularization, fine semantic separation of pedestrians, vehicles and background is achieved, which significantly improves the adaptability to occlusion and semantically blurred regions. The pedestrian region recognition accuracy is better than that of traditional hard bounding box segmentation methods.
[0031] (3) By using multi-resolution six-plane spatiotemporal coding and residual Gaussian deformation prediction, the complex non-rigid motion features of pedestrians are accurately captured, and the detail preservation, structural similarity and geometric accuracy of the reconstructed image are better than existing baseline methods.
[0032] (4) By applying temporal smoothing constraints and local consistency constraints only to pedestrian categories, the motion coordination relationship of pedestrian limbs is accurately modeled, effectively suppressing inter-frame jitter and non-physical deformation of reconstruction results, and greatly improving the temporal coherence and inter-frame stability of pedestrian motion.
[0033] (5) Based on the end-to-end optimization of the differentiable Gaussian splash rendering pipeline, the training and inference efficiency is high, which can meet the online real-time reconstruction needs of autonomous driving scenarios. Attached Figure Description
[0034] Figure 1 This is a schematic diagram of the overall framework of the present invention.
[0035] Figure 2 This is a schematic diagram of the weakly supervised semantic rendering process of the present invention.
[0036] Figure 3 This is a schematic diagram of pedestrian motion constraints according to the present invention. Detailed Implementation
[0037] The following detailed description of the self-supervised 3D reconstruction method for pedestrians on streets based on weakly supervised semantic recognition, based on the accompanying drawings, is provided in further detail.
[0038] The method of this invention is based on the construction of an end-to-end self-supervised optimization framework using 3D Gaussian splashing technology. It takes a multi-frame RGB image sequence collected in an autonomous driving scenario as the core input and combines a differentiable Gaussian splashing rendering pipeline to complete high-precision 3D reconstruction of pedestrians on the street. It effectively solves the problems of existing 3DGS-based reconstruction methods, such as 3D annotation dependence, low accuracy of pedestrian semantic separation, insufficient non-rigid motion modeling and poor temporal consistency.
[0039] The overall end-to-end framework of the method of this invention is as follows: Figure 1 As shown, using initial 4D observation data As input, the 4D data corresponds to the standard spatial coordinates and timestamps of three-dimensional Gaussian primitives. The framework includes five core modules: a multi-resolution six-plane spatiotemporal encoder, a semantic recognition module, a multi-head Gaussian decoder, a self-supervised optimization module, and a rendering layer. Each module works together to form a complete optimization loop through end-to-end backpropagation and parameter updates, and finally outputs the three-dimensional reconstruction results of pedestrians on the street.
[0040] First, the input multi-frame RGB image sequence is initialized. Then, the sparse point cloud of the scene is extracted from the RGB image sequence using the structure-in-motion method. This method is based on feature matching between images and spatial geometric constraints, which can quickly reconstruct the 3D sparse structure of the scene and provide a precise spatial location basis for the subsequent initialization of 3D Gaussian primitives.
[0041] In the canonical space, the set of three-dimensional Gaussian primitives is initialized based on the extracted sparse point cloud. At the same time, each three-dimensional Gaussian primitive is assigned a learnable semantic attribute vector, which is a C-dimensional semantic logit vector, where the number of semantic categories C=3, corresponding to the three target categories of pedestrians, vehicles and background.
[0042] After softmax normalization, the C-dimensional semantic logit vector yields the category probability distribution for each 3D Gaussian element. Once the model training converges, based on the category corresponding to the maximum semantic probability of each 3D Gaussian element, all 3D Gaussian elements are classified into background 3D Gaussian elements, vehicle 3D Gaussian elements, and pedestrian category 3D Gaussian elements. This achieves preliminary separation of static and dynamic targets in the scene, laying a precise classification foundation for applying specific motion constraints only to pedestrians. Simultaneously, the initialized 3D Gaussian elements, along with their canonical spatial positions and timestamps, constitute... Figure 1 The initial 4D observation data shown As the core input of the entire process.
[0043] After initializing the 3D Gaussian primitives, weakly supervised pedestrian region recognition processing is performed. Figure 1 The semantic recognition module shown in the yellow dashed box adopts a mask2former-guided weakly supervised semantic probability allocation mechanism. It uses a pre-trained mask2former 2D semantic segmentation network to perform frame-by-frame semantic segmentation on the input multi-frame RGB image sequence, generating pixel-level pseudo-labels for pedestrians, vehicles, and backgrounds frame by frame. These pixel-level pseudo-labels serve as weakly supervised signals to provide a basis for subsequent semantic supervision. The entire process does not rely on manually annotated 3D bounding boxes, which greatly reduces the cost and workload of manual annotation.
[0044] The weakly supervised semantic rendering process in this step is as follows: Figure 2As shown, the semantic attributes of 3D Gaussian primitives are integrally projected onto a 2D image plane along the ray direction using alpha blending technology, thus completing the cumulative rendering of semantic probabilities. First The semantic probability rendering value at time step is calculated using the formula:
[0045] ,
[0046] in, To affect pixels The visible Gaussian element set; Cumulative transmittance represents the distance of light rays from the camera to the Gaussian unit. The probability that it is not obscured; For Gorsky Yuan In pixels The contribution of opacity at the location; For Gorsky Yuan of A semantic category probability vector.
[0047] In the semantic supervision process, the maximum probability of each pixel category predicted by the two-dimensional semantic segmentation network is used as the confidence weight. Construct a confidence-weighted cross-entropy loss:
[0048] ,
[0049] in For the number of training frames; A collection of image pixels; For semantic category indexing; For pixels Confidence weights are assigned based on the maximum predicted probability. For the first Frame pixels Category The pseudo-label value; This represents the corresponding rendering semantic probability value.
[0050] Simultaneously, a semantic entropy regularization term is introduced:
[0051] ,
[0052] in, For Gorsky Yuan Category The semantic probability value.
[0053] By jointly driving the confidence-weighted cross-entropy loss and the semantic entropy regularization term, the 3D Gaussian units autonomously learn the semantic classification of pedestrians, vehicles and background. The semantic entropy regularization term can promote the semantic distribution of each Gaussian unit to tend towards single-class determinism, avoid semantic classification ambiguity, improve the recognition adaptability of occluded and semantically ambiguous regions, and finally output the semantic supervision signal to the self-supervised optimization module.
[0054] Subsequently, multi-resolution spatiotemporal feature encoding was carried out, corresponding to Figure 1 The multi-resolution six-plane spatiotemporal encoder, shown in the blue dashed box, extracts the position coordinates and corresponding timestamp in normal space for each three-dimensional Gaussian primitive, and then converts these four-dimensional spatiotemporal coordinates... The spatiotemporal features are extracted and encoded by inputting into a multi-resolution six-plane spatiotemporal encoder.
[0055] This multi-resolution six-plane spatiotemporal encoder projects four-dimensional spatiotemporal coordinates onto three spatial planes and three spatiotemporal planes, for a total of six two-dimensional feature planes. The three spatial planes are specifically the xy plane, xz plane, and yz plane, and the three spatiotemporal planes are specifically the xt plane, yt plane, and zt plane. By projecting onto multiple planes, it achieves comprehensive capture of spatiotemporal information.
[0056] The encoder contains Each resolution level has a feature dimension of [number] layers. , No. The resolution of the layer feature plane is calculated using the formula: The rules are increasing, among which Based on resolution, The value is .
[0057] For Gorski Yuan At any moment The query extracts features from six two-dimensional planar feature grids using bilinear interpolation, then aggregates them element-wise by product to obtain the first feature. Layer features The features from each layer are input into a single-head self-attention module for cross-scale feature fusion.
[0058] ,
[0059] in This indicates concatenation along the feature dimension. This is a single-head self-attention module. The self-attention mechanism allows the network to dynamically adjust the weights of features at each scale according to the type of motion. For periodic motions such as walking, it emphasizes global low-frequency features, while for fast movements such as turning and waving, it emphasizes local high-frequency features.
[0060] In addition, temporal difference features between adjacent time points are calculated to capture information about the rate of change of motion:
[0061] ,
[0062] in, For Gorsky Yuan The standardized spatial location. The temporal difference feature. and After being concatenated along the feature dimension, they are jointly input into a multi-head Gaussian decoder to provide richer spatiotemporal context information.
[0063] For the features output from each resolution layer, feature extraction is first performed using bilinear interpolation. Then, a self-attention mechanism is used to fuse the extracted features across resolution layers across scales. This self-attention mechanism dynamically adjusts the weights of features at each scale according to the motion characteristics of the scene, achieving an effective combination of global low-frequency features and local high-frequency features. After completing the cross-scale feature fusion, the temporal difference features between adjacent time steps are further calculated. The cross-scale fused features and the temporal difference features are then concatenated along the feature dimension to obtain a fused spatiotemporal feature vector. This vector fully integrates spatial location information, temporal series information, and motion change rate information, providing a rich and accurate feature foundation for subsequent Gaussian deformation prediction, and is output to the multi-head Gaussian decoder.
[0064] After obtaining the fused spatiotemporal feature vectors, Gaussian deformation prediction processing is performed, corresponding to... Figure 1 The multi-head Gaussian decoder shown in the green dashed box takes the fused spatiotemporal feature vector as input and has four parallel multilayer perceptron branches. These four branches accurately predict the position offset, rotation increment, scaling adjustment, and opacity change of the three-dimensional Gaussian unit.
[0065] For each predicted deformation increment, the position, rotation, scaling, and opacity parameters of the corresponding 3D Gaussian element in the normal space are superimposed using the residual method to obtain the 3D Gaussian element parameters after deformation at each time step, thereby realizing the dynamic deformation modeling of the 3D Gaussian element in the temporal dimension.
[0066] The specific update rule for the residual method is as follows: position update uses additive residuals, directly adding the predicted position offset to the original position parameters, i.e.:
[0067] .
[0068] The rotation update uses quaternion multiplication residuals, which combine the rotation increment with the original rotation parameters through quaternion multiplication, i.e.:
[0069] ,
[0070] in This represents quaternion multiplication, ensuring the closure of the rotation group; scaling updates use exponential mapping residuals, which are processed by exponential mapping and then combined with the original scaling parameters, i.e.:
[0071] ,
[0072] in This indicates element-wise multiplication, and the exponent mapping ensures that the scaling parameter is always positive.
[0073] Opacity updates use sigmoid residuals, and the change in opacity is processed by sigmoid normalization, i.e.:
[0074] ,
[0075] Ensure that the range of the opacity parameter is maintained within Within this range, the targeted residual update method significantly reduces the learning difficulty of the network, effectively improves the stability and accuracy of Gaussian deformation prediction, and outputs the deformed 3D Gaussian primitive parameters to the self-supervised optimization module.
[0076] After obtaining the three-dimensional Gaussian element parameters after deformation at each time step, pedestrian-specific motion constraints are applied. The constraint design in this step is as follows: Figure 3 As shown. This constraint only applies to the 3D Gaussian elements of the pedestrian category; the 3D Gaussian elements of the background and vehicles do not participate in the calculation of this constraint. This enables specialized modeling for the non-rigid motion of pedestrians. Figure 1 In the self-supervised optimization module (Motion constraint) loss term.
[0077] The temporal smoothing constraint is achieved by penalizing the second-order difference of pedestrian position deformation in adjacent frames, and its loss function is calculated according to the formula:
[0078] ,
[0079] in, The number of Gaussian elements for pedestrian categories. The total number of frames in the sequence. For Gorsky Yuan At any moment The positional deformation. This constraint encourages uniform or low-acceleration motion, making the trajectory smoother for periodic movements such as walking, and suppressing false jitter for stationary pedestrians.
[0080] Local consistency constraints are achieved by penalizing the difference in deformation between nearest-neighbor 3D Gaussian elements in the gauge space, and their loss function is expressed by the formula:
[0081] ,
[0082] in, For Gorsky Yuan In the normative space Nearest neighbor set The value of is 5. This nearest neighbor set is pre-calculated by the KD-Tree algorithm before training begins and remains fixed throughout the training process. This constraint can maintain the motion coordination of local limb regions of pedestrians, while preserving the motion independence between different limb parts, avoiding excessive restriction on the relative motion at joints, and making the non-rigid motion modeling of pedestrians more in line with actual motion characteristics.
[0083] Finally, multi-task joint optimization is performed, corresponding to... Figure 1 The self-supervised optimization module, shown in the purple dashed box, is based on a differentiable Gaussian splash rendering pipeline. It uses the deformed 3D Gaussian primitive parameters at each time step to render RGB images, depth maps, and semantic segmentation maps respectively. The rendered images are compared with real RGB images, measured LiDAR depth data, and dense optical flow estimated by a pre-trained optical flow network. A multi-task loss function is constructed to jointly optimize all parameters of the model end-to-end.
[0084] The total loss function is calculated using the following formula:
[0085] ,
[0086] in, For the loss of photometric uniformity, corresponding Figure 1 In The (RGB rendering) supervision term, using a combination of L1 loss and D-SSIM, is calculated according to the following formula:
[0087] ,
[0088] in, For the first Frame of real images, For the first Frame rendering image, As a structural similarity index, the experiment set ,
[0089] For deep supervision loss, corresponding Figure 1 In The (LiDAR depth) supervision term calculates the L1 norm of the difference between the rendered depth and the measured depth only at the effective projection location of the LiDAR point cloud. Non-LiDAR projection locations are not included in the depth loss calculation.
[0090] For optical flow uniformity loss, corresponding Figure 1 In (Optical flow alignment) supervision term: Dense optical flow is estimated from adjacent RGB images using a pre-trained optical flow estimation network as a supervision signal. The rendered optical flow is then processed by Gaussian units at adjacent time steps. and The difference between the two-dimensional projection positions is calculated, and the L1 norm of the difference between the two constitutes the optical flow consistency loss.
[0091] The semantic segmentation loss is the loss term resulting from the combination of the confidence-weighted cross-entropy loss and the semantic entropy regularization term in the aforementioned steps.
[0092] and These are the temporal smoothing constraint loss and the local consistency constraint loss, respectively. Figure 1 In (Motion constraint) supervision item, These are the weighting coefficients for each loss term.
[0093] During model training, a three-stage progressive weight adjustment strategy is used to optimize the multi-task loss function. The first stage, consisting of the first 30% of iterations, focuses on establishing the scene's geometric structure and semantic separation. , , .
[0094] The second stage, the middle 40% of the iterations, introduces optical flow supervision. , ;
[0095] The third stage is the last 30% of iterations, which further reduces the motion constraint weights. We focus on improving the quality of reconstruction rendering.
[0096] The model was optimized using the AdamW optimizer, with momentum parameters set to 0.9 and 0.999, and weight decay coefficients of [value missing]. The learning rate of each parameter decays exponentially to its initial value. During training, adaptive density control is executed once every 100 iterations. The newly added 3D Gaussian primitives inherit the semantic attributes of the parent Gaussian primitives, ensuring the consistency of semantic separation. This allows the density of the 3D Gaussian primitives to be dynamically adjusted according to the complexity of the scene, further improving the detail accuracy of the 3D reconstruction of pedestrians on the street.
[0097] The optimized 3D Gaussian meta-parameters are input to... Figure 1 The rendering layer, shown in the black box, completes the rendering and compositing of the RGB image, depth map, and semantic segmentation map, ultimately outputting the 3D reconstruction results of pedestrians on the street. Figure 1The end-to-end backpropagation and parameter update, indicated by the red arrows, achieve closed-loop optimization throughout the entire process, continuously iterating to improve reconstruction accuracy.
[0098] The validation was performed on the KITTI dataset and the Waymo Open Dataset. The experiments were conducted on a workstation equipped with an NVIDIA RTX 4090 GPU. The input RGB images were uniformly scaled to 800×600 resolution, and the LiDAR point clouds were preprocessed with ground filtering and outlier removal.
[0099] Quantitative Results: On the KITTI dataset, the proposed method achieves a PSNR of 28.6 dB, SSIM of 0.912, LPIPS of 0.089, Chamfer distance of 4.2 cm, pedestrian segmentation mIoU of 85.7%, and temporal consistency error of 0.87. Compared to the S3Gaussian baseline method, the PSNR is improved by 2.4 dB, the pedestrian segmentation mIoU is improved by 14.4 percentage points, and the temporal consistency error is reduced by 62.8%. On the Waymo dataset, the PSNR reaches 27.8 dB, and the pedestrian segmentation mIoU is 83.2%, with all indicators reaching the best level among similar methods.
[0100] Ablation experiments: The weakly supervised pedestrian recognition module contributed the most to the pedestrian segmentation accuracy, increasing mIoU from 71.3% to 81.4% (an improvement of 10.1 percentage points) when added alone; the enhanced spatiotemporal coding module mainly improved rendering quality, increasing PSNR from 26.2dB to 27.4dB when added alone; the pedestrian motion constraint module significantly improved temporal consistency, reducing the temporal consistency error from 2.34 to 1.15 when added alone, a reduction of 50.9%; the synergistic effect of the three modules enabled the complete model to achieve optimal performance on all evaluation metrics, with a PSNR of 28.6dB, a pedestrian segmentation mIoU of 85.7%, and a temporal consistency error of 0.87.
[0101] The above description is only a preferred embodiment of the present invention. It should be noted that those skilled in the art can make several improvements without departing from the principle of the present invention, and these improvements should also be considered within the scope of protection of the present invention.
Claims
1. A self-supervised 3D reconstruction method for pedestrians on streets based on weakly supervised semantic recognition, characterized in that: Includes the following steps: S1: Extract sparse point clouds from a multi-frame RGB image sequence, initialize a set of three-dimensional Gaussian primitives in the normal space based on the sparse point clouds, and assign a learnable semantic attribute vector to each three-dimensional Gaussian primitive. S2: The RGB image sequence is semantically segmented frame by frame using a pre-trained two-dimensional semantic segmentation network to generate pixel-level pseudo-labels containing three categories: pedestrians, vehicles, and background. The three-dimensional Gaussian meta-algorithms are driven to learn the semantic attribution of pedestrians, vehicles, and background through confidence-weighted cross-entropy loss and semantic entropy regularization term. S3: Input the normalized spatial position coordinates and timestamp of each three-dimensional Gaussian primitive into the multi-resolution six-plane spatiotemporal encoder. After extracting the features of each resolution layer by bilinear interpolation, use the self-attention mechanism to perform cross-scale feature fusion and calculate the temporal difference features of adjacent time points to obtain the fused spatiotemporal feature vector. S4: Input the fused spatiotemporal feature vector into the multi-head Gaussian decoder, and predict the position offset, rotation increment, scaling adjustment and opacity change through four parallel multilayer perceptron branches respectively. The residuals are then superimposed onto the position, rotation, scaling and opacity parameters of the three-dimensional Gaussian element in the normal space to obtain the deformed three-dimensional Gaussian element parameters at each time step. S5: Temporal smoothing constraints and local consistency constraints are applied only to the 3D Gaussian elements of the pedestrian category; the temporal smoothing constraints are used to suppress non-physical abrupt changes in pedestrian motion, and the local consistency constraints are used to maintain the local motion coordination of pedestrians; S6: Renders RGB images, depth maps, and semantic segmentation maps through a differentiable Gaussian splash rendering pipeline, and performs end-to-end joint optimization of all parameters based on a multi-task loss function.
2. The self-supervised 3D reconstruction method for pedestrians on streets based on weakly supervised semantic recognition as described in claim 1, characterized in that: In S1, sparse point clouds are extracted from a multi-frame RGB image sequence using the structure-reconstruction-motion method.
3. The self-supervised street pedestrian 3D reconstruction method based on weakly supervised semantic recognition as described in claim 1, characterized in that: In S1, the semantic attribute vector is a C-dimensional semantic logit vector, which is normalized by softmax to obtain the category probability distribution. The number of semantic categories C=3 corresponds to the three categories of pedestrians, vehicles and background. After training converges, all 3D Gaussian ...
4. The self-supervised 3D reconstruction method for pedestrians on streets based on weakly supervised semantic recognition as described in claim 1, characterized in that: In S2, the confidence-weighted cross-entropy loss uses the maximum probability of each pixel category predicted by the two-dimensional semantic segmentation network as the confidence weight.
5. The self-supervised street pedestrian 3D reconstruction method based on weakly supervised semantic recognition as described in claim 1, characterized in that: In S3, the multi-resolution six-plane spatiotemporal encoder projects four-dimensional spatiotemporal coordinates onto three spatial planes and three spatiotemporal planes, for a total of six two-dimensional feature planes; the three spatial planes are xy, xz, and yz planes, and the three spatiotemporal planes are xt, yt, and zt planes.
6. The self-supervised street pedestrian 3D reconstruction method based on weakly supervised semantic recognition as described in claim 1, characterized in that: In S3, the multi-resolution six-plane spatiotemporal encoder contains L resolution levels, each with a feature dimension of D. The resolution of the layer feature plane is based on The rules are incremental, among which, Based on resolution, .
7. The self-supervised street pedestrian 3D reconstruction method based on weakly supervised semantic recognition as described in claim 1, characterized in that: In S4, the update rules for residual methods are as follows: position updates use additive residuals, rotation updates use quaternion multiplication residuals, scaling updates use exponential mapping residuals, and opacity updates use sigmoid residuals.
8. The self-supervised street pedestrian 3D reconstruction method based on weakly supervised semantic recognition as described in claim 1, characterized in that: In S5, the temporal smoothing constraint is achieved by penalizing the second-order difference of the pedestrian position deformation in adjacent frames, and the local consistency constraint is achieved by penalizing the difference in deformation between neighboring three-dimensional Gaussian elements in the norm space that satisfy the preset nearest neighbor rule.
9. The self-supervised street pedestrian 3D reconstruction method based on weakly supervised semantic recognition as described in claim 8, characterized in that: The preset nearest neighbor rule is a set of k nearest neighbors that is pre-computed using the KD-Tree algorithm and kept fixed during training.
10. The self-supervised 3D reconstruction method for pedestrians on streets based on weakly supervised semantic recognition as described in claim 1, characterized in that: In S6, the multi-task loss function includes photometric consistency loss, depth supervision loss, optical flow consistency loss, semantic segmentation loss, and pedestrian motion constraint loss in S5, and adopts a three-stage progressive weight adjustment strategy for end-to-end joint optimization.