A method for 3D semantic scene completion based on efficient fusion of multiple perspectives
The method for efficient fusion of multi-view 3D semantic scene completion utilizes multiple deformable convolutional networks and a Transformer encoder to fuse 3D scene feature maps across different viewpoints. This solves the problem of inflexible capture of object context relationships in existing technologies and achieves more accurate semantic scene completion.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TIANJIN UNIV
- Filing Date
- 2023-12-14
- Publication Date
- 2026-06-30
AI Technical Summary
Existing 3D semantic scene completion methods lack flexibility in capturing semantic relationships between objects, struggle to effectively capture the contextual relationships of occluded objects, and are not suitable for merging object relationships between multiple views.
A 3D semantic scene completion method based on efficient fusion of multiple perspectives is adopted. Through multiple deformable convolutional network layers and Transformer encoder, 3D scene feature maps are fused across perspectives. Cross attention is used to enhance the contextual information between objects, thereby achieving efficient fusion of multi-perspective information.
It improves the richness of semantic features of objects in the scene, achieves more accurate semantic scene completion, and enhances the accuracy of 3D semantic scene completion.
Smart Images

Figure CN118038218B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of artificial intelligence and computer vision, and relates to semantic scene completion technology, specifically a three-dimensional semantic scene completion method based on efficient fusion of multiple perspectives. Background Technology
[0002] 3D semantic scene completion aims to infer the category label of each voxel within the field of view, which plays a crucial role in many computer vision applications, such as autonomous driving, scene reconstruction, and robot navigation. In recent years, with the continuous development of deep learning technology, more and more deep learning-based semantic scene completion methods have been proposed, and their performance on large-scale datasets has been significantly improved. To enhance the accuracy of semantic category inference for occluded objects, some studies consider cross-view object contextual information, thereby better enhancing semantic information.
[0003] Contextual information propagation studies how relationships between objects in space are learned and represented as voxel features. Popular 3D semantic scene completion methods use 3D convolutional neural networks to learn feature representations from voxelized objects. Current state-of-the-art context propagation methods use spatial pyramid pooling, which uses convolutional kernels with receptive fields of multiple scales to allow the network to capture object relationships at different ranges. However, the receptive field of the pyramid convolutional kernel is pre-manually constructed, thus lacking the flexibility to exclude irrelevant voxels.
[0004] Contextual information enhancement occurs during the propagation of object contextual information across multiple viewpoints. It captures the relationships between objects from different viewpoints to further enhance the contextual information of occluded objects. Current research on semantic scene completion fuses sequences of contextual information captured at different times from a single viewpoint to obtain enhanced contextual information. However, these methods are not well-suited for considering object relationships between multiple views, which provides crucial object context for semantic scene completion. Summary of the Invention
[0005] This invention provides a 3D semantic scene completion method based on efficient multi-view fusion. It utilizes efficient viewpoint compression representation to guide the fusion of contextual information from multiple perspectives, thereby effectively increasing the efficiency of scene contextual information fusion. Simultaneously, the contextual information from multiple perspectives allows objects in the scene to acquire richer interaction information with their surroundings, enabling voxels in the scene to acquire richer semantic features and achieve more accurate semantic scene completion.
[0006] A method for 3D semantic scene completion based on efficient fusion of multiple perspectives includes the following steps:
[0007] The first step is to extract feature maps from the input color image and depth image to obtain color feature maps and TSDF depth feature maps. In the obtained color feature map, each voxel records a vector of C-class semantic segmentation confidence. In the obtained TSDF depth feature map, each voxel stores the symbolic distance value from the voxel to its nearest surface.
[0008] The second step is to use a 3D convolutional neural network encoder to extract the initial 3D scene feature map;
[0009] The third step involves using multiple deformable convolutional network layers to calculate multiple 3D scene feature maps containing potential information from different perspectives.
[0010] The fourth step involves fusing the 3D scene feature maps containing potential information from different perspectives to obtain an enhanced 3D scene feature map. This includes: using a Transformer encoder to compress and represent the 3D scene feature maps containing potential information from different perspectives, outputting view labels with lower dimensions; and using cross attention to use the view labels from each perspective as information carriers to enhance the 3D scene feature maps from multiple perspectives, and fusing them across perspectives to obtain an enhanced 3D scene feature map.
[0011] The fifth step involves using neural network calculations to obtain the 3D semantic scene completion prediction results.
[0012] The sixth step is to calculate the cross-entropy loss between the 3D semantic scene completion prediction results and the semantic completion ground truth of the input image;
[0013] Step 7: Based on the cross-entropy loss, the network is optimized using gradient descent. After training the network model, a 3D semantic scene completion model is obtained.
[0014] Furthermore, the method for the first step is as follows:
[0015] For color images, a two-dimensional semantic segmentation network is selected to extract two-dimensional semantic labels for each pixel in the color image; using reprojection technology, the two-dimensional semantic labels are reprojected to obtain a color feature map, and each voxel records a vector of C-class semantic segmentation confidence.
[0016] For a depth image, each pixel is reprojected onto a view frustum to reconstruct a TSDF depth feature map, where each voxel stores the symbolic distance value from that voxel to its nearest surface.
[0017] Furthermore, the selected two-dimensional semantic segmentation network is DeepLabV3.
[0018] Furthermore, the second step is as follows: the color feature map and the TSDF depth feature map are converted to the same dimensional space and the result is added together and then input into the 3D convolutional neural network encoder; the encoder consists of two identical network layers, each containing 4 DDR modules; the last DDR module of each network layer downsamples the resolution of the input feature map to half of its original value, and the 3D convolutional neural network encoder outputs the initial 3D scene feature map.
[0019] Furthermore, the third step is as follows: randomly initialize I deformable convolutional network layers; using the initial 3D scene feature map as input, the I deformable convolutional network layers output I 3D scene feature maps containing potential information from different perspectives.
[0020] Furthermore, in the fourth step, the method of using a Transformer encoder to compress and represent the 3D scene feature map containing potential information from different viewpoints, and outputting view labels with lower dimensionality, is as follows: For any viewpoint i (i = 1, ..., I), the 3D scene feature map F of this viewpoint is... i Learnable view markers E i and learnable positional encoding P i The input is fed into the i-th Transformer encoder, which outputs a view label E′ with lower dimensions. i ∈R N ×C , where N represents the resolution of the view marker.
[0021] Furthermore, in the fourth step, cross-attention is used to enhance the 3D scene feature maps from multiple perspectives by using the view labels of each viewpoint as information carriers. The method for obtaining the enhanced 3D scene feature map through cross-view fusion is as follows: All view labels E′ i Connecting them yields a global view marker ε used to store cross-viewpoint blending information; the global view marker ε and the 3D scene feature map F of a single viewpoint are then used to store this information. i Establish cross-attention operations between them, for F i Enhancement is performed; the cross-attention operation described above is applied to the 3D scene feature maps from all viewpoints to obtain the enhanced 3D scene feature map {F′}. i ∈R H×W×D×C}|i=1,…,I}.
[0022] Furthermore, the fifth step is as follows: the obtained multi-view enhanced 3D scene feature maps are summed and passed to the 3D convolutional neural network decoder; the 3D convolutional neural network decoder uses two deconvolutional layers to upsample the 3D scene feature maps, which are then processed by a multilayer perceptron for classification to obtain the 3D semantic scene completion prediction result.
[0023] The beneficial effects of the technical solution provided by this invention are as follows: Existing semantic scene completion methods all use convolution with a fixed receptive field to capture the semantic relationships between objects. This not only lacks the flexibility to exclude incoherent voxels, but also makes it difficult to capture the contextual relationships of occluded objects. This invention employs a 3D semantic scene completion method based on efficient multi-view fusion to learn the semantic and geometric features of a 3D scene, capturing rich contextual information between objects. By utilizing the rich contextual information between objects within the scene, more accurate semantic scene completion results are achieved. Attached Figure Description
[0024] Figure 1 A flowchart of a 3D semantic scene completion method based on efficient fusion of multiple perspectives;
[0025] Figure 2 This is a comparison of the experimental results of the method of the present invention with the existing best method; Detailed Implementation
[0026] The technical solutions of this invention will now be clearly and completely described with reference to the accompanying drawings. All other embodiments obtained by those skilled in the art based on the technical solutions of this invention without inventive effort are within the scope of protection of this invention.
[0027] First, some concepts of this invention are explained. TSDF (truncated signed distance function) represents the encoding depth, where each voxel stores the distance value of each voxel to its nearest surface, the sign of which indicates whether the voxel is in visible or invisible space. DDR (Dimensional Decomposition Residual) modules are a variant of the basic 3D convolutional module, forming a novel residual network structure by decomposing the basic 3D convolution into three consecutive layers along each dimension. The Transformer encoder contains multiple identical layers, each consisting of two sub-layers: a self-attention mechanism and a fully connected feedforward neural network. Through the combination of these layers, the Transformer module enables the model to capture global dependencies when processing data and facilitates parallel computation during training, improving training efficiency.
[0028] The datasets used for experimental feasibility verification in this invention are the NYU datasets. The NYU dataset provides 1449 pairs of color and depth images, of which 795 pairs are training data and 654 pairs are test data. The NYU dataset contains a total of 11 semantic categories.
[0029] The specific implementation method consists of the following steps:
[0030] The first step is to extract feature maps from the input color and depth images. Specifically, the lightweight 2D semantic segmentation network DeepLabV3 is selected to extract two-dimensional semantic labels for each pixel in the color image. Then, using reprojection techniques, these two-dimensional semantic labels are reprojected to obtain a color feature map with dimensions of 60×36×60×C, where each voxel records a vector of C-class semantic segmentation confidence. For the depth image, each pixel is reprojected onto a view truncation cone to reconstruct a TSDF depth feature map with dimensions of 60×36×60×1, where each voxel stores the signed distance value from that voxel to its nearest surface.
[0031] The second step involves using a 3D convolutional neural network encoder to extract the initial 3D scene feature map. Specifically, the color feature map and the TSDF depth feature map are transformed to the same dimensional space and then summed before being input into the 3D convolutional neural network encoder. This encoder consists of two identical network layers, each containing four DDR modules. The last DDR module of each network layer downsamples the resolution of the 3D scene feature map to half its original value. Therefore, the 3D convolutional neural network encoder outputs a 3D scene feature map with a resolution of 15×9×15×256.
[0032] The third step involves using multiple sets of deformable convolutional network layers to calculate multiple sets of 3D scene feature maps containing potential information from different viewpoints. Specifically, multiple sets of deformable convolutional network layers are first randomly initialized. Using a 3D scene feature map with a resolution of 15×9×15×256 as input, the multiple sets of deformable convolutional network layers output a set of 3D scene feature maps containing potential information from different viewpoints. Because the convolutional kernels of the deformable convolutional network layers adaptively change shape according to the shape of different objects and their adjacency relationships, the multiple sets of 3D scene feature maps extracted by multiple sets of different deformable convolutional network layers contain richer interaction information between objects from different viewpoints.
[0033] The fourth step involves fusing the 3D scene feature maps from multiple perspectives to obtain an enhanced 3D scene feature map. A cross-view fusion network layer is employed to complementarily enhance the 3D scene feature maps from different perspectives. To improve the efficiency of multi-view feature fusion, this cross-view fusion network layer does not directly fuse the original high-resolution 3D scene feature maps. Instead, it calculates individual view labels for each 3D scene feature map before fusing them. Each view label has a low dimension and contains complete information about the corresponding perspective's 3D scene feature map. The cross-view fusion network layer uses these view labels as information carriers to enhance the 3D scene feature maps from each perspective, thereby outputting a set of enhanced 3D scene feature maps from multiple perspectives. The specific method is as follows:
[0034] (1) Compressively represent the 3D scene feature maps of each perspective using a Transformer encoder. For any perspective i (i = 1, …, I), input the 3D scene feature map F i ∈R H×W×D×C , the learnable view token E i ∈R N×C and the learnable position encoding P i ∈R (N+H×W×D ) ×C into the i-th Transformer encoder. Here, N represents the resolution of the view token. The Transformer encoder outputs a view token E′ i ∈R N×C to represent the entire 3D scene feature map F i with a higher dimension.
[0035] Among them, the Transformer encoder also uses a 3D convolutional network layer to transform the 3D scene feature map F i , and the transformed 3D scene feature map is injected into the view token E′ i . This step is crucial for the semantic scene completion task based on the voxelized structure, which helps the view token E′ i to capture the potential information of occluded voxels. The spatial dimension of the view token E i is lower than that of the original 3D scene feature map F i (i.e., N < H×W×D). By compressing the spatial dimension of E i , E i focuses on the global semantic information of F i , and it is injected into the view token E′ i .
[0036] The position encoding P i is jointly learned with the view token E i . Therefore, P i can be regarded as a complementary structure, which stores richer geometric information than the view token E<0|i=1,…,I}。 The cross-view fusion network layer utilizes view markers from various perspectives as information carriers to enhance the 3D scene feature maps from multiple perspectives.
[0038] First, connect the view labels of all perspectives to obtain a global view label ε∈R. N×I×C To store cross-viewpoint blending information. Next, in the overall view marker ε and the 3D scene feature map F of each individual viewpoint, i A cross-attention operation is established between them. The calculation of query q, key k, and value v involved in the cross-attention operation is shown below.
[0039] q = convolution(F i )
[0040] k = convolution(ε)
[0041] v = convolution(ε)
[0042] Next, we use the query value q, key k, and value v to calculate cross-attention.
[0043] A = v T ·softmax(k·q T )
[0044] To simplify the notation above, we have omitted the intermediate variables k,v∈R. N×I×C and q,A∈R H×W×D×C The subscript of ε is used. Cross-view hybrid information of ε is labeled using the overall view, and cross-attention is applied to calculate the effect on F. i Enhancement is performed, and the enhanced 3D scene feature map F′ is calculated. i ∈R H×W×D×C .
[0045] F′ i =F i +convolution(A T )
[0046] The cross-attention operation described above is applied to the 3D scene feature maps from all viewpoints to obtain enhanced 3D scene feature maps {F′} from multiple viewpoints. i ∈R H×W×D×C}|i=1,…,I};
[0047] The fifth step involves the neural network calculating the completion result. The cross-view fusion network layer outputs a set of enhanced 3D scene feature maps {F′} from multiple perspectives. i ∈R H×W×D×CThe features are summed together and passed to a 3D convolutional neural network decoder. This decoder uses two deconvolutional layers to upsample the 3D scene feature map to the same resolution as the original input (60×36×60×256). This enhanced 3D scene feature map is then processed by a multilayer perceptron for classification to obtain the 3D semantic scene completion prediction result.
[0048] The sixth step is to calculate the cross-entropy loss based on the 3D semantic scene completion prediction results and the ground truth values of the semantic completion of the input image.
[0049] Step 7: Optimize the entire neural network using gradient descent based on the cross-entropy loss.
[0050] Step 8: Repeat steps 1 through 7 for 500 rounds to complete the training of the entire semantic scene completion model.
[0051] The ninth step involves inputting the color and depth images from the test set into the trained semantic scene completion model to predict the probability distribution for each voxel in the 3D scene. The maximum value is then used to obtain the final semantic scene completion result.
[0052] The feasibility of the method of the present invention is verified below with specific examples, as detailed in the following description:
[0053] We conducted comparative experiments on the publicly available NYU dataset. The NYU dataset contains 11 semantic categories, 795 pairs of training images, and 654 pairs of test images. Each pair of images consists of one color image and one depth image.
[0054] The NYU dataset primarily focuses on indoor scenes. We report the accuracy of Scene Completion (SC) and Semantic Scene Completion (SSC) on this dataset. We select recall, precision, and Intersection over Union (IoU) to measure the accuracy of SC. We select the mean IoU (mIoU) calculated across all semantic categories to measure the accuracy of SSC.
[0055] according to Figure 2 Experimental results on different datasets demonstrate that the semantic scene completion accuracy of the proposed method is higher than that of the existing best method. Therefore, the feasibility and superiority of the proposed method are evident.
Claims
1. A method for 3D semantic scene completion based on efficient fusion of multiple perspectives, comprising the following steps: The first step is to extract feature maps from the input color image and depth image to obtain color feature maps and TSDF depth feature maps; The obtained color feature map, where each voxel records a vector of C-class semantic segmentation confidence; the obtained TSDF deep feature map, where each voxel stores the symbolic distance value from the voxel to its nearest surface; The second step is to use a 3D convolutional neural network encoder to extract the initial 3D scene feature map; The third step involves using multiple deformable convolutional network layers to calculate multiple 3D scene feature maps containing potential information from different perspectives. The fourth step involves fusing the 3D scene feature maps containing potential information from different perspectives to obtain an enhanced 3D scene feature map. This includes: using a Transformer encoder to compress and represent the 3D scene feature maps containing potential information from different perspectives, outputting view labels with lower dimensions; and using cross attention to use the view labels from each perspective as information carriers to enhance the 3D scene feature maps from multiple perspectives, and fusing them across perspectives to obtain an enhanced 3D scene feature map. The fifth step involves using neural network calculations to obtain the 3D semantic scene completion prediction results. The sixth step is to calculate the cross-entropy loss between the 3D semantic scene completion prediction results and the semantic completion ground truth of the input image; Step 7: Based on the cross-entropy loss, the network is optimized using gradient descent. After training the network model, a 3D semantic scene completion model is obtained.
2. The three-dimensional semantic scene completion method according to claim 1, characterized in that, The first step is as follows: For color images, a two-dimensional semantic segmentation network is selected to extract two-dimensional semantic labels for each pixel in the color image; using reprojection technology, the two-dimensional semantic labels are reprojected to obtain a color feature map, and each voxel records a vector of C-class semantic segmentation confidence. For a depth image, each pixel is reprojected onto a view frustum to reconstruct a TSDF depth feature map, where each voxel stores the symbolic distance value from that voxel to its nearest surface.
3. The three-dimensional semantic scene completion method according to claim 2, characterized in that, The selected two-dimensional semantic segmentation network is DeepLabV3.
4. The three-dimensional semantic scene completion method according to claim 1, characterized in that, The second step is as follows: the color feature map and the TSDF depth feature map are converted to the same dimensional space and the result is added together and then input into the 3D convolutional neural network encoder. The encoder consists of two identical network layers, each containing 4 DDR modules. The last DDR module of each network layer downsamples the resolution of the input feature map to half of its original value, and the 3D convolutional neural network encoder outputs the initial 3D scene feature map.
5. The three-dimensional semantic scene completion method according to claim 1, characterized in that, The third step is as follows: randomly initialize I deformable convolutional network layers; take the initial 3D scene feature map as input, and the I deformable convolutional network layers output I 3D scene feature maps containing potential information from different perspectives.
6. The three-dimensional semantic scene completion method according to claim 5, characterized in that, In the fourth step, the Transformer encoder is used to compress and represent the 3D scene feature map containing potential information from different viewpoints, outputting view labels with lower dimensions. The method is as follows: for any viewpoint i (i = 1, ..., I), the 3D scene feature map F of this viewpoint is... i Learnable view markers E i and learnable positional encoding P i The input is fed into the i-th Transformer encoder, which outputs a view label E with a lower dimension. i ′∈R N×C , where N represents the resolution of the view marker.
7. The three-dimensional semantic scene completion method according to claim 6, characterized in that, In the fourth step, cross-attention is used to enhance the 3D scene feature maps from multiple perspectives by using the view labels of each viewpoint as information carriers. The method for obtaining the enhanced 3D scene feature map through cross-view fusion is as follows: All view labels E i Connecting these elements yields a global view marker ε, used to store cross-viewpoint blending information; this global view marker ε and the 3D scene feature map F from a single viewpoint are then combined. i Establish cross-attention operations between them, for F i Enhancement is performed; the cross-attention operation described above is applied to the 3D scene feature maps from all viewpoints to obtain the enhanced 3D scene feature map {F}. i ′∈R H×W×D×C }|i=1,…,I}.
8. The three-dimensional semantic scene completion method according to claim 1, characterized in that, The fifth step is as follows: the obtained multi-view enhanced 3D scene feature maps are summed and passed to the 3D convolutional neural network decoder; the 3D convolutional neural network decoder uses two deconvolutional layers to upsample the 3D scene feature maps, which are then processed by a multilayer perceptron for classification to obtain the 3D semantic scene completion prediction result.