A robot environment high-fidelity three-dimensional reconstruction method, device and medium
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 湖南工商大学
- Filing Date
- 2026-05-21
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244340A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and in particular to a method, apparatus and medium for high-fidelity 3D reconstruction of robot environments. Background Technology
[0002] In the fields of computer vision, autonomous robot perception, and digital twin technology, high-fidelity reconstruction of 3D environments is the core foundation for realizing autonomous navigation, obstacle avoidance planning, and human-robot interaction for mobile robots. As intelligent robot systems are increasingly applied in unstructured and complex scenarios, acquiring real-time 3D environment models with complete geometric structures, clear surface details, and consistent topology from continuous video stream observations has become a key technological bottleneck for improving the robot's environmental adaptability. Especially under non-ideal observation conditions such as dynamic blurring, weak textures, or high-brightness reflections caused by rapid robot movement, the robustness and accuracy of the reconstruction algorithm directly determine the reliability of the robot's environmental understanding.
[0003] However, existing 3D reconstruction methods are insufficient in generating geometric details, and the temporal fusion mechanism lacks robustness under dynamic observation conditions. Summary of the Invention
[0004] The technical problem to be solved by the present invention is to address the above-mentioned shortcomings of the prior art by providing a high-fidelity 3D reconstruction method, device and medium for robot environments, so as to solve the problems of insufficient geometric detail generation capability of existing 3D reconstruction methods and lack of robustness of temporal fusion mechanism under dynamic observation conditions.
[0005] In a first aspect, the present invention provides a high-fidelity 3D reconstruction method for robot environments, comprising: Acquire a time-series observation sequence for robot environmental perception; Multi-scale two-dimensional features are extracted from the current frame image in the time-series observation sequence using a multi-resolution voxel grid pyramid network and a multi-scale feature back-projection module. The multi-scale two-dimensional features are then back-projected onto three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level. The sparse voxel generation and screening module generates an initial level of sparse candidate voxel set, and outputs other levels of sparse candidate voxel sets containing geometric coverage information and corresponding masks based on the initial 3D feature volume and the initial level of sparse candidate voxel set. The conditional diffusion denoising and completion module constructs a conditional set based on a sparse candidate voxel set containing geometric coverage information and a corresponding mask. The conditional set is then denoised and structurally completed to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty measure. Based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric, the adaptive temporal probability fusion module updates the global truncated symbolic distance function estimate using an adaptive gated recurrent unit and a confidence weighting mechanism. Through iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is obtained.
[0006] Furthermore, the acquisition of the time-series observation sequence for robot environmental perception specifically includes: The robot acquires RGB-D image sequences and camera pose data for environmental perception, and preprocesses the RGB-D image sequences and camera pose data to obtain the time-series observation sequence.
[0007] Furthermore, the step of extracting multi-scale two-dimensional features from the current frame image in the time-series observation sequence through a multi-resolution voxel grid pyramid network and a multi-scale feature backprojection module, and backprojecting the multi-scale two-dimensional features into three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level, specifically includes: Based on the scene size and reconstruction requirements, define the parameters of the cascaded multi-resolution voxel mesh pyramid from coarse to fine. The current frame image is input to the image backbone network to extract multi-scale two-dimensional features, and the multi-scale two-dimensional features are fused in combination with the feature pyramid to obtain a feature pyramid set. For each level in the feature pyramid set, the depth is estimated based on the intrinsic and extrinsic parameters of the level, and a back-projection mapping from pixel to voxel is established. Based on the back projection mapping from pixels to voxels, the projection weights are calculated for each pixel-voxel pair and then normalized. The initial 3D feature volume corresponding to the level is calculated based on the normalized projection weights.
[0008] Further, the step of generating an initial level of sparse candidate voxels through the sparse voxel generation and screening module, and outputting other levels of sparse candidate voxel sets containing geometric coverage information and corresponding masks based on the initial 3D feature volume and the initial level of sparse candidate voxels, specifically includes: For the initial level, all low-resolution voxels within the initialized view frustum are used as the sparse candidate voxel set for the initial level. For the other levels, upsampling is performed based on the occupancy prediction results of the previous level to generate a sparse candidate voxel set for the other levels. Historical information is initially fused using a gating mechanism. Based on the initial 3D feature volume and the sparse candidate voxel sets of other levels, a sparse candidate voxel set containing geometric coverage information and a corresponding mask for the other levels are output.
[0009] Furthermore, the conditional diffusion denoising and completion module constructs a condition set based on a sparse candidate voxel set containing geometric coverage information and its corresponding mask, and performs denoising and structured completion on the condition set to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty measure, specifically including: Based on the sparse candidate voxel set containing geometric coverage information and the corresponding mask, a condition set is constructed in conjunction with the initial 3D feature volume and the temporal context. The condition set is input into the conditional diffusion model to perform denoising and structured completion, and the output is a geometrically enhanced feature representation and the corresponding observation uncertainty metric.
[0010] Furthermore, the adaptive temporal probability fusion module, based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric, updates the global truncated symbolic distance function estimate using an adaptive gated recurrent unit and a confidence weighting mechanism. Through iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is obtained, specifically including: Based on the geometrically enhanced feature representation and the corresponding observation uncertainty measure, combined with the global historical hidden state of the previous time step, the global truncated symbolic distance function estimate is updated using an adaptive gated recurrent unit and a confidence weighting mechanism. Through coarse-to-fine iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is constructed. The iterative iterations are the step-by-step optimizations using the conditional diffusion denoising and completion module and the adaptive temporal probability fusion module.
[0011] Furthermore, after the method involves updating the global truncated symbolic distance function estimate using an adaptive temporal probability fusion module based on geometrically enhanced feature representations and corresponding observation uncertainty metrics, employing adaptive gated recurrent units and a confidence weighting mechanism, and obtaining a cascaded probabilistic TSDF 3D reconstruction network model through iterative iterations at each level, the method further includes: The data from the test set are sequentially input into the reconstruction network. The finest level of global fusion TSDF estimation and accumulated uncertainty information are used as input. Adaptive zero-crossing threshold is used to filter artifacts in high uncertainty regions, and isosurfaces are extracted by the moving cube algorithm to obtain a high-fidelity and topologically consistent three-dimensional triangular mesh model.
[0012] Secondly, the present invention provides a high-fidelity three-dimensional reconstruction device for robot environments, comprising: The acquisition module is used to acquire time-series observation sequences for robot environmental perception; An extraction backprojection module, connected to the acquisition module, is used to extract multi-scale two-dimensional features from the current frame image in the time-series observation sequence through a multi-resolution voxel grid pyramid network and a multi-scale feature backprojection module, and backproject the multi-scale two-dimensional features to a three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level. The output generation module, connected to the extraction back projection module, is used to generate an initial level of sparse candidate voxel set through the sparse voxel generation and filtering module, and output other levels of sparse candidate voxel sets containing geometric coverage information and corresponding masks based on the initial 3D feature volume and the initial level of sparse candidate voxel set. The constructed module is connected to the generated output module and is used to construct a condition set based on a sparse candidate voxel set containing geometric coverage information and a corresponding mask through the conditional diffusion denoising and completion module, and to denoise and structurally complete the condition set to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty measure. The updated module is connected to the constructed module and is used to update the global truncated symbolic distance function estimate by means of the adaptive temporal probability fusion module based on the geometrically enhanced feature representation and the corresponding observation uncertainty measure, using an adaptive gated recurrent unit and confidence weighting mechanism, and through each level of iterative loop, to obtain the cascaded probabilistic TSDF three-dimensional reconstruction network model.
[0013] Thirdly, the present invention provides a high-fidelity 3D reconstruction device for a robot environment, comprising a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to implement the high-fidelity 3D reconstruction method for a robot environment described in the first aspect.
[0014] Fourthly, the present invention provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the high-fidelity three-dimensional reconstruction method for robot environments described in the first aspect.
[0015] The present invention provides a high-fidelity 3D reconstruction method, apparatus, and medium for robot environments. First, a temporal observation sequence for robot environment perception is acquired. Then, multi-scale two-dimensional features are extracted from the current frame image in the temporal observation sequence using a multi-resolution voxel pyramid network and a multi-scale feature back-projection module. These multi-scale two-dimensional features are then back-projected into 3D space to obtain initial 3D feature volumes corresponding to each level. Next, a sparse voxel generation and filtering module generates a sparse candidate voxel set for the initial level. Based on the initial 3D feature volumes and the initial level's sparse candidate voxel set, other levels containing geometric coverage information are output. A sparse candidate voxel set and its corresponding mask are generated. Then, a conditional set is constructed based on the sparse candidate voxel set containing geometric coverage information and its corresponding mask through a conditional diffusion denoising and completion module. The conditional set is then denoised and structurally completed to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty metric. Finally, an adaptive temporal probability fusion module updates the global truncated symbolic distance function estimate based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric using an adaptive gated recurrent unit and a confidence weighting mechanism. Through iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is obtained. This invention utilizes a multi-resolution voxel grid pyramid network and a multi-scale feature back-projection module to perform calculations only within the effective geometric region. This significantly reduces memory usage while ensuring the real-time performance of the robot's perception system. Through a conditional diffusion denoising and completion module, generative priors are introduced to constrain the geometric structure, effectively overcoming the geometric fuzziness and smoothing problems caused by traditional discriminative models. This significantly improves the high-fidelity recovery capability for thin structures, sharp edges, and occluded blind spots. Furthermore, an adaptive temporal probabilistic fusion module is employed. Through gating mechanisms and confidence-weighted dynamic suppression, it effectively prevents the accumulation of errors in multi-scale space by suppressing interference from low-quality observation frames such as motion blur or positioning drift on the global map. This results in higher reconstruction accuracy, stronger structural integrity, and excellent topological consistency under complex dynamic robot scenarios and long-sequence observation conditions. It has good application value and promising prospects, solving the shortcomings of existing 3D reconstruction methods in generating geometric details and the lack of robustness of temporal fusion mechanisms under dynamic observation conditions. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of this drawing or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this drawing. For those skilled in the art, other drawings can be obtained based on the structures shown in these drawings without creative effort.
[0017] Figure 1This is a flowchart of a high-fidelity 3D reconstruction method for a robot environment according to Embodiment 1 of the present invention; Figure 2 This is a flowchart of another high-fidelity 3D reconstruction method for robot environment according to Embodiment 1 of the present invention; Figure 3 This is a model diagram of the sparse voxel generation and screening module in an embodiment of the present invention; Figure 4 This is a model diagram of the conditional diffusion denoising and completion module according to an embodiment of the present invention; Figure 5 This is a model diagram of the adaptive temporal probability fusion module according to an embodiment of the present invention; Figure 6a This is a first real-world 3D model diagram of an embodiment of the present invention; Figure 6b This is a three-dimensional model diagram of the first test result in an embodiment of the present invention; Figure 7a This is a second real-world scene diagram according to an embodiment of the present invention; Figure 7b This is a detailed diagram of the second test result model according to an embodiment of the present invention; Figure 8 This is a schematic diagram of the structure of a high-fidelity 3D reconstruction device for a robot environment according to Embodiment 2 of the present invention; Figure 9 This is a schematic diagram of the structure of a high-fidelity three-dimensional reconstruction device for a robot environment according to Embodiment 3 of the present invention. Detailed Implementation
[0018] To enable those skilled in the art to better understand the technical solution of the present invention, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
[0019] It is understood that the specific embodiments and accompanying drawings described herein are merely for explaining the invention and are not intended to limit the invention.
[0020] It is understood that, without conflict, the various embodiments and features in the embodiments of the present invention can be combined with each other.
[0021] It is understood that, for ease of description, only the parts related to the present invention are shown in the accompanying drawings, while the parts unrelated to the present invention are not shown in the drawings.
[0022] It is understood that each unit or module involved in the embodiments of the present invention may correspond to only one entity structure, or may be composed of multiple entity structures, or multiple units or modules may be integrated into one entity structure.
[0023] It is understood that, without conflict, the functions and steps marked in the flowcharts and block diagrams of this invention may occur in a different order than that marked in the accompanying drawings.
[0024] It is understood that the flowcharts and block diagrams of this invention illustrate the possible architecture, functions, and operations of systems, apparatuses, devices, and methods according to various embodiments of this invention. Each block in the flowchart or block diagram may represent a unit, module, program segment, or code, containing executable instructions for implementing the specified function. Furthermore, each block or combination of blocks in the block diagram and flowchart can be implemented using a hardware-based system to achieve the specified function, or using a combination of hardware and computer instructions.
[0025] It is understood that the units and modules involved in the embodiments of the present invention can be implemented by software or by hardware. For example, the units and modules can be located in a processor.
[0026] Application Overview In the fields of computer vision, autonomous robot perception, and digital twin technology, high-fidelity reconstruction of 3D environments is the core foundation for realizing autonomous navigation, obstacle avoidance planning, and human-robot interaction for mobile robots. As intelligent robot systems are increasingly applied in unstructured and complex scenarios, acquiring real-time 3D environment models with complete geometric structures, clear surface details, and consistent topology from continuous video stream observations has become a key technological bottleneck for improving the robot's environmental adaptability. Especially under non-ideal observation conditions such as dynamic blurring, weak textures, or high-brightness reflections caused by rapid robot movement, the robustness and accuracy of the reconstruction algorithm directly determine the reliability of the robot's environmental understanding.
[0027] The first category comprises traditional methods that rely on dense depth maps acquired by depth sensors for geometric fusion. These methods generally employ the Truncated Signed Distance Function (TSDF) as the standard volumetric representation. TSDF discretizes 3D space into a voxel grid, storing the signed distance value from each voxel to the nearest surface, thus accurately representing 3D geometry in the form of an implicit surface, effectively supporting weighted fusion of multi-view depth data. To more rigorously describe the existence of geometric surfaces under non-ideal observation conditions, existing technologies have introduced the concept of probabilistic TSDF. Unlike standard TSDF, which only stores the distance mean, probabilistic TSDF treats the geometric surface within a voxel as a probability distribution, including not only distance estimates but also explicit variance or uncertainty parameters describing the observation reliability, thereby supporting multi-view weighted fusion to some extent. However, these methods are highly dependent on the quality of the input depth map and lack geometric priors. They struggle to effectively complete the geometric structure under weak textures or noise interference, easily leading to holes or fragmentation in the reconstructed model. To address this deficiency, the second category consists of 3D reconstruction methods based on deep neural networks that have emerged in recent years. Such methods typically utilize 3D convolutional networks to directly regress 3D TSDF values from 2D image features, and employ neural networks to construct temporal feature fusion modules, attempting to address the noise and occlusion issues in single-frame observations by learning temporal context information.
[0028] However, the existing 3D reconstruction methods still have significant technical limitations when applied to robot perception of complex dynamic environments. First, in terms of geometric detail recovery, existing deep learning reconstruction networks are mostly based on discriminative models, typically using L1 or L2 norms as loss functions to regress the TSDF mean. This mean regression characteristic leads the network to tend to output overly smooth geometric predictions, making it difficult to effectively recover high-frequency geometric details such as thin structures and sharp edges. Furthermore, when facing areas with occlusion, due to the lack of effective generative inference capabilities, the model often cannot perform reasonable structural completion, resulting in holes or geometric gaps in the reconstructed model. Second, in terms of temporal fusion mechanisms, existing temporal loop structures often lack explicit modeling and adaptive filtering capabilities for the confidence of input observations. When the robot encounters image blurring or positioning drift caused by violent movement, this non-adaptive fusion mechanism cannot effectively distinguish between high-quality and low-quality observation frames, causing low-quality data to interfere with the global map. This easily leads to ghosting, tearing, and error accumulation on the reconstructed surface, severely damaging the geometric accuracy and topological consistency of the environmental map.
[0029] In summary, current TSDF reconstruction methods (3D reconstruction methods) are insufficient in generating geometric details and lack robustness of temporal fusion mechanisms under dynamic observation conditions.
[0030] To address the aforementioned technical problems, this application provides a high-fidelity 3D reconstruction method, apparatus, and medium for robot environments. Through a multi-resolution voxel pyramid network and a multi-scale feature back-projection module, computation is performed only within the effective geometric region, significantly reducing memory usage while ensuring the real-time performance of the robot's perception system. By employing a conditional diffusion denoising and completion module, generative priors are introduced to constrain the geometric structure, effectively overcoming the geometric fuzziness and smoothing problems caused by traditional discriminative models. This significantly improves the high-fidelity recovery capability for thin structures, sharp edges, and occluded blind spots. Furthermore, an adaptive temporal probabilistic fusion module is adopted, using gating mechanisms and confidence weighting to dynamically suppress interference from low-quality observation frames such as motion blur or positioning drift on the global map, effectively preventing the accumulation of errors in multi-scale space. Thus, under complex dynamic robot scenes and long-sequence observation conditions, it exhibits higher reconstruction accuracy, stronger structural integrity, and excellent topological consistency, demonstrating good application value and promising prospects. This approach at least addresses the shortcomings of existing 3D reconstruction methods in generating geometric details and the lack of robustness of temporal fusion mechanisms under dynamic observation conditions.
[0031] After introducing the basic principles of this application, various non-limiting embodiments of this application will be described in detail below with reference to the accompanying drawings.
[0032] Example 1: This embodiment provides a high-fidelity 3D reconstruction method for robot environments, such as... Figure 1 As shown, the method includes: Step S101: Obtain the time-series observation sequence for robot environmental perception.
[0033] It should be noted that a time-series observation sequence refers to a collection of multiple frames of continuous observation data acquired by a camera at different consecutive times, arranged in chronological order.
[0034] In one optional embodiment, acquiring the temporal observation sequence for robot environmental perception specifically includes: The robot acquires RGB-D image sequences and camera pose data for environmental perception, and preprocesses the RGB-D image sequences and camera pose data to obtain the time-series observation sequence.
[0035] Specifically, the RGB-D image sequences and camera pose data collected from the robot's environmental perception are preprocessed to construct a unified world coordinate system, and then strictly sorted and organized into a time-series observation sequence according to timestamps.
[0036] Specifically, this embodiment can use standard datasets such as ScanNet or data collected by real robots to obtain RGB-D image sequences and camera pose data for robot environmental perception. First, the pre-calibrated camera intrinsic parameters are called. And distortion parameters, and the extrinsic parameters of the camera pose data corresponding to each frame in the above RGB-D image sequence. The data is uniformly transformed to the world coordinate system. Furthermore, the RGB-D image is aligned with a pose sequence strictly ordered by timestamps to simulate real-time data flow.
[0037] Step S102: Extract multi-scale two-dimensional features from the current frame image in the time-series observation sequence through a multi-resolution voxel grid pyramid network and a multi-scale feature backprojection module, and backproject the multi-scale two-dimensional features to three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level.
[0038] It should be noted that the preferred number of layers is 3. The initial 3D feature volume corresponding to each layer refers to the set of layer-specific voxel 3D features that are matched one-to-one with each resolution layer and have not been subjected to sparse filtering or temporal optimization.
[0039] Specifically, the time-series observation sequence is processed using a frame-by-frame recursive method, extracting each frame in the sequence as the current time according to the timestamp order. The current frame image is used as the input. For the current frame image, a cascaded multi-resolution voxel grid pyramid network and a multi-scale feature backprojection module are constructed to extract its multi-scale two-dimensional features. Based on the projection geometry, the multi-scale two-dimensional features are backprojected into three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level. (Note: After completing the feature extraction of the current frame and the subsequent update of the three-dimensional reconstruction network model, the algorithm will continue to read the next frame in the sequence as the new current frame and repeat the above operation until the entire time series observation sequence has been traversed.)
[0040] In an optional embodiment, the step of extracting multi-scale two-dimensional features from the current frame image in the time-series observation sequence using a multi-resolution voxel grid pyramid network and a multi-scale feature backprojection module, and backprojecting the multi-scale two-dimensional features into three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level, specifically includes: Based on the scene size and reconstruction requirements, define the parameters of the cascaded multi-resolution voxel mesh pyramid from coarse to fine. The current frame image is input to the image backbone network to extract multi-scale two-dimensional features, and the multi-scale two-dimensional features are fused in combination with the feature pyramid to obtain a feature pyramid set. For each level in the feature pyramid set, the depth is estimated based on the intrinsic and extrinsic parameters of the level, and a back-projection mapping from pixel to voxel is established. Based on the back projection mapping from pixels to voxels, the projection weights are calculated for each pixel-voxel pair and then normalized. The initial 3D feature volume corresponding to the level is calculated based on the normalized projection weights.
[0041] Specifically, based on the scene size and reconstruction requirements, such as setting the total number of levels. A cascaded, coarse-to-fine multi-resolution voxel grid pyramid parameter is defined (wherein, the pyramid parameter includes the voxel size, spatial boundary, and grid resolution of each level, its function being to establish a physical dimension reference coordinate system and basic architecture for subsequent 3D spatial discretization and pixel-to-voxel backprojection). Subsequently, an image backbone network is used to extract multi-level two-dimensional features (i.e., multi-scale two-dimensional features) and combined with a feature pyramid (i.e., a network structure module that fuses high-level semantic features with low-level high-resolution features across scales through a top-down path) to perform multi-scale fusion of the multi-level two-dimensional features, resulting in a feature pyramid set. These feature maps, with resolutions ranging from low to high, correspond to the coarse, medium, and fine layers of voxel space, respectively, to capture geometrically relevant cues such as texture, edges, and normals.
[0042] Furthermore, for each level Based on the internal parameters of this layer External reference Based on the estimated depth, a back-projection mapping from pixels to voxels is established, and its transformation equation is: ; in, Indicates hierarchy Next pixel Three-dimensional points back-projected onto the world coordinate system; Represents the pixels of the corresponding layer The depth value; This is the camera intrinsic parameter matrix for the corresponding layer; These are the camera's extrinsic rotation matrix and translation vector, respectively. Represents the homogeneous coordinate vector of a pixel; superscript Indicates the current resolution level; This is the inverse of the camera intrinsic parameter matrix for the corresponding level; It is the transpose of the camera extrinsic rotation matrix.
[0043] Furthermore, for each pixel-voxel pair Calculate projection weights And perform normalization: ; in, This represents the angle between the line-of-sight vector and the voxel surface normal vector; Voxel representation Depth center value in camera coordinate system; For the corresponding layer's pixel depth (pixels) (depth value); Scale hyperparameters for controlling sensitivity to depth consistency; voxels The set of effective pixels projected onto the image plane. For set The pixel index is traversed.
[0044] Subsequently, only the candidate voxel set at the current level Perform sparse aggregation to obtain voxel 3D features. (That is, the initial three-dimensional feature volume): ; in, Indicates hierarchy Hypotoplasm The aggregated three-dimensional feature vector; Indicates hierarchy Corresponding two-dimensional image features; The normalized weighting coefficients calculated above are denoted as .
[0045] Step S103: Generate an initial level of sparse candidate voxel set through the sparse voxel generation and screening module, and output other levels of sparse candidate voxel sets containing geometric coverage information and corresponding masks based on the initial three-dimensional feature volume and the initial level of sparse candidate voxel set.
[0046] It should be noted that the sparse candidate voxel set containing geometric coverage information refers to a subset of sparse voxels that retains only the effective spatial region of the scene and can characterize the spatial distribution and geometric coverage of the scene in the current frame. The corresponding mask refers to a binary screening template that marks and distinguishes between effective operational voxels and voxels to be eliminated.
[0047] Specifically, a sparse voxel generation and screening module is constructed, employing a coarse-to-fine cascade strategy to generate candidate voxels level by level. First, an initial level (i.e., the first level, the coarsest level) set of sparse candidate voxels is generated. Then, for other more refined current levels (let's call them the first level...), a sparse candidate voxel set is generated. layer, ), to obtain the previous level (the first The effective voxels (layers) determined to be "non-empty" and their occupancy masks are used as the sparsification result. This sparsification result is spatially upsampled (e.g., one coarse voxel is split into eight fine voxels) to generate a sparse candidate voxel set for the current layer. Subsequently, combined with the initial 3D feature volume of this layer, historical information is initially fused using a gating mechanism to output a sparse candidate voxel set containing geometric coverage information and its corresponding mask for the current layer.
[0048] In an optional embodiment, the step of generating an initial level of sparse candidate voxels through a sparse voxel generation and filtering module, and outputting other levels of sparse candidate voxel sets containing geometric coverage information and corresponding masks based on the initial 3D feature volume and the initial level of sparse candidate voxels, specifically includes: For the initial level, all low-resolution voxels within the initialized view frustum are used as the sparse candidate voxel set for the initial level. For the other levels, upsampling is performed based on the occupancy prediction results of the previous level to generate a sparse candidate voxel set for the other levels. Historical information is initially fused using a gating mechanism. Based on the initial 3D feature volume and the sparse candidate voxel sets of other levels, a sparse candidate voxel set containing geometric coverage information and a corresponding mask for the other levels are output.
[0049] Specifically, differentiated initialization strategies are adopted for different resolution levels: For the coarsest level (Level 1), based on the camera intrinsic and extrinsic parameters corresponding to the current frame image and the preset depth perception range, an initial view frustum for the current moment is constructed in 3D space, and all low-resolution voxels falling inside this initial view frustum are extracted as the sparse candidate voxel set for this initial level. For the fine level (Level 2), ... Based on the occupancy prediction results of the previous level (i.e., the sparsification results of the previous level), upsampling is performed. Specifically, only the sparse voxels that are determined to be "non-empty" in the previous level are split into subdivided voxels of the current level. One coarse voxel is split into eight fine voxels, thus forming the initial subdivided voxel set of the current level that has not undergone temporal state updates and mask filtering (i.e., the basic candidate space to be processed in subsequent steps).
[0050] Furthermore, the sparse candidate voxel generation process includes three main stages: occupancy probability prediction, temporal state update, and sparse voxel screening.
[0051] Step 1: Calculate the voxel visibility score based on the view occlusion relationship and depth consistency, and use the current layer features. As input, the occupancy probability of a voxel is obtained through a lightweight classification head: ; in, Voxel representation The probability of occupancy; Use the sigmoid activation function; The weights and bias parameters for the lightweight classification head; This refers to the input voxel features of the current level.
[0052] Step 2: Use a gated recurrent network (GRU) to fuse the current voxel features with the historical hidden states of the corresponding level in the time dimension. Update the hidden state of the time sequence: ; in, To update the door; To reset the door; For the sigmoid function; Input the voxel features at the current time (i.e., the 3D feature volume extracted at the current level). ); For the previous moment at the level The hidden state; These are all learnable parameters of the linear transformation corresponding to this level; In the update gate computation, learnable parameters are used to perform a linear transformation on the voxel input features at the current time step; In the update gate computation, learnable parameters are obtained by linearly transforming the hidden state of the previous time step. Learnable parameters for linear transformation of the voxel input features at the current time step during the reset gate computation; In the reset gate computation, learnable parameters are used to linearly transform the hidden state of the previous time step. Learnable parameters for linear transformation of the voxel input features at the current time step during candidate hidden state computation; In the calculation of candidate hidden states, learnable parameters are used to perform a linear transformation on the hidden state of the previous time step after weighting by the reset gate. This represents element-wise multiplication; It is the hyperbolic tangent activation function; This is the candidate hidden state; The updated hierarchy is now hidden.
[0053] Step 3: Based on the occupancy threshold Construct a mask using the upper limit of candidate proportions and filter the set: ; in, Voxel representation Binary candidate mask; This indicates an indicator function that takes the value 1 if the condition is true and 0 otherwise. This indicates the set occupancy threshold; This represents the set of sparse candidate voxels selected at the current level (i.e., the set of sparse candidate voxels containing geometric coverage information).
[0054] Finally, for Perform connected component analysis, remove isolated voxels to preserve structural coherence, and update the hidden state. , mask With candidate set Output to the next module.
[0055] Step S104: The conditional diffusion denoising and completion module constructs a conditional set based on a sparse candidate voxel set containing geometric coverage information and a corresponding mask, and performs denoising and structured completion on the conditional set to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty measure.
[0056] It should be noted that the geometrically enhanced feature representation refers to the optimized three-dimensional voxel features with complete spatial geometric constraints, and the observation uncertainty metric refers to the confidence quantification index characterizing the reliability of the observation results of each voxel.
[0057] Specifically, a conditional diffusion denoising and completion module is constructed. A conditional set is constructed based on a sparse candidate voxel set containing geometric coverage information and a corresponding mask. The conditional set is then denoised and structurally completed to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty measure.
[0058] In an optional embodiment, the conditional diffusion denoising and completion module constructs a condition set based on a sparse candidate voxel set containing geometric coverage information and corresponding masks, and performs denoising and structured completion on the condition set to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty measure, specifically including: Based on the sparse candidate voxel set containing geometric coverage information and the corresponding mask, a condition set is constructed in conjunction with the initial 3D feature volume and the temporal context. The condition set is input into the conditional diffusion model to perform denoising and structured completion, and the output is a geometrically enhanced feature representation and the corresponding observation uncertainty metric.
[0059] Specifically, a 3D U-Net conditional denoising network is used to combine the features of the current layer. (i.e., the initial three-dimensional feature volume), mask Geometric projection information (i.e., a sparse set of candidate voxels containing geometric coverage information) and historical state (Also, the time context, the updated hidden state mentioned above) Formation of condition set The noise figure is calculated using either linear or cosine noise. In the candidate voxel set Perform the forward diffusion process above: ; in, True Voxel Geometric Distribution (TSDF); Indicates the diffusion time step The noise-increasing voxel value; This is the preset cumulative noise figure; The sampled standard Gaussian noise, This represents a matrix with a mean of 0 and a covariance equal to the identity matrix. It follows a standard normal distribution.
[0060] Furthermore, with Iteratively optimize the denoising network parameters for the training objective: ; in, Represents the mathematical expectation operation. Represents the variable , , Mathematical expectation operation, Indicates the diffusion time step. This represents the noise prediction loss function (diffusion loss) of the conditional diffusion model. The noise predicted by the conditional denoising network; The set of conditions includes camera geometry (i.e., a sparse set of candidate voxels containing geometric coverage information) and features. , mask With temporal context; These are network parameters.
[0061] Finally, in the reasoning phase, in the candidate set The method employs DDIM (Denoising Diffusion Implicit Models) with few steps of sampling for inverse denoising, recovering the geometric features after structured completion. And estimate the mean of observations at the current level based on the output of the network prediction head. (That is, the geometrically enhanced feature representation) and the observed standard deviation (Uncertainty, also known as a measure of observational uncertainty).
[0062] Step S105: Based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric, the adaptive temporal probability fusion module updates the global truncated symbolic distance function estimate using an adaptive gated recurrent unit and a confidence weighting mechanism, and obtains the cascaded probabilistic TSDF three-dimensional reconstruction network model through iterative iterations at each level.
[0063] Specifically, an adaptive temporal probability fusion module is constructed. Based on the geometrically enhanced feature representation of the output and the observation uncertainty measure, the global truncated symbolic distance function estimate is updated using an adaptive gated recurrent unit and a confidence weighting mechanism. Through iterative iteration at each resolution level, a cascaded probabilistic TSDF 3D reconstruction network model is obtained.
[0064] In an optional embodiment, the adaptive temporal probability fusion module updates the global truncated symbolic distance function estimate based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric, using an adaptive gated recurrent unit and a confidence weighting mechanism. Through iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is obtained, specifically including: Based on the geometrically enhanced feature representation and the corresponding observation uncertainty measure, combined with the global historical hidden state of the previous time step, the global truncated symbolic distance function estimate is updated using an adaptive gated recurrent unit and a confidence weighting mechanism. Through coarse-to-fine iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is constructed. The iterative iterations are the step-by-step optimizations using the conditional diffusion denoising and completion module and the adaptive temporal probability fusion module.
[0065] It should be noted that the global historical hidden state of the previous moment refers to the three-dimensional spatial feature representation accumulated and stored in the cascaded recurrent neural network by the adaptive temporal probability fusion module after processing all observation sequences before the current frame.
[0066] Specifically, this hidden state is in the previous time step ( After completing the global truncated symbolic distance function estimation update, the gated recurrent units at each resolution level output and map the feature vector set to the global spatial coordinate system. This not only contains the historical inference results of the 3D scene's geometric topology but also implicitly encodes the probability distribution information (such as mean and variance) of the environment surface under previous observations at multiple time points. This state serves as the current time step (…). The initial input of the cyclic iteration allows the network to use historical "memory" to assist in the denoising, completion and fusion of the current observations, thereby ensuring the temporal consistency of the reconstruction results under long-sequence observations.
[0067] Specifically, setting the gating scheduling coefficient and update the door Apply uncertainty-aware constraints; perform gated fusion along the temporal dimension to update the hidden state containing probabilistic information: ; ; ; in, To update the gate, control the amount of new information written; Use the sigmoid activation function; The gating scheduling coefficient is automatically optimized through network training and is used to adjust the contribution of input features and historical hidden states. These are the linear transformation parameters corresponding to this level (i.e., the learnable parameters of the linear transformation). This represents the voxel observation features at the current time after diffusion enhancement (i.e., the feature representation after geometric enhancement). This is the hidden state of the previous time step (i.e., the global history hidden state of the previous time step). This represents the candidate hidden state at the current moment; This is the hidden state as updated at the current moment.
[0068] Furthermore, confidence-weighted fusion is performed at the current level. The inverse of the variance is used as the weight to calculate the fused global TSDF estimate. (That is, update the global truncated symbolic distance function estimate) and update the global prior standard deviation. : ; in, These are the mean and standard deviation of the observed TSDF output by the current level diffusion module (i.e., the geometrically enhanced feature representation and the corresponding measurement of observation uncertainty). These are the mean and standard deviation of the historical cumulative prior TSDF, respectively; These are the fusion weights for observation and prior knowledge, respectively; This is the fused global TSDF estimate (i.e., the global truncated symbolic distance function estimate).
[0069] If the current level is not the most refined level ( ), then the fused After sparsification, it is used as the input to the next level (that is, as the sparsification result of the previous level required by the next level). Then, return to step S103 to continue to optimize using the conditional diffusion denoising and completion module and the adaptive temporal probability fusion module, and finally obtain the cascaded probabilistic TSDF three-dimensional reconstruction network model.
[0070] In an optional embodiment, after the adaptive temporal probability fusion module updates the global truncated symbolic distance function estimate based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric, using an adaptive gated recurrent unit and a confidence weighting mechanism, and obtains the cascaded probabilistic TSDF 3D reconstruction network model through iterative iterations at each level, the method further includes: The data from the test set are sequentially input into the reconstruction network. The finest level of global fusion TSDF estimation and accumulated uncertainty information are used as input. Adaptive zero-crossing threshold is used to filter artifacts in high uncertainty regions, and isosurfaces are extracted by the moving cube algorithm to obtain a high-fidelity and topologically consistent three-dimensional triangular mesh model.
[0071] Specifically, if it has reached the most refined level ( Then, the data from the pre-prepared test set can be sequentially input into the reconstruction network. The finest level of global fusion TSDF estimation and accumulated uncertainty information are used as input. Adaptive zero-crossing threshold is used to filter artifacts in high uncertainty regions, and isosurfaces are extracted through the moving cube algorithm to obtain a high-fidelity and topologically consistent three-dimensional triangular mesh model.
[0072] Specifically, in conjunction with the benchmark threshold Gain coefficient Based on the total uncertainty information, dynamically adjust the zero-crossing threshold. Perform adaptive zero-crossing determination: ; in, This is an indicator function that outputs 1 (indicating a potential surface voxel) when the condition is met, and 0 otherwise. This represents the TSDF value after the finest layer fusion. Use the baseline cutoff threshold; This is the gain coefficient, used to adjust the degree of influence of uncertainty on the threshold width; These represent the observational and prior uncertainties of the finest layer, respectively.
[0073] Furthermore, based on the zero-crossing characteristic of TSDF values, the Marching Cubes algorithm is used to extract triangular meshes. The core step is zero-crossing linear interpolation: ; in, For adjacent voxels with opposite signs of TSDF values; voxels The TSDF value; Zero cross-interpolation coefficients; Voxels Spatial geometric coordinates; These are the precise spatial coordinates of the vertices of the triangular mesh obtained through interpolation.
[0074] Finally, the extracted mesh is post-processed to obtain a high-fidelity, coherent final 3D mesh model.
[0075] It is worth mentioning that this invention belongs to the fields of computer vision, robot perception, and 3D geometric deep learning. The high-fidelity 3D reconstruction method for robot environments provided is based on a sparse diffusion probabilistic fusion mechanism. It constructs a cascaded, coarse-to-fine multi-resolution voxel grid pyramid network and a multi-scale feature back-projection module, performing computation only in the effective geometric region. This significantly reduces memory usage while ensuring the real-time performance of the robot perception system. Innovatively, a conditional diffusion denoising and completion module is constructed in the sparse voxel domain, introducing generative priors to constrain the geometric structure. This effectively overcomes the geometric fuzziness and smoothing problems caused by traditional discriminative models, significantly improving the high-fidelity recovery capability for thin structures, sharp edges, and occluded blind spots. Furthermore, an adaptive temporal probabilistic fusion module based on uncertainty metrics is employed. Through gating mechanisms and confidence-weighted dynamic suppression, it effectively prevents the accumulation of errors in multi-scale space by suppressing the interference of low-quality observation frames such as motion blur or positioning drift on the global map. This method demonstrates higher reconstruction accuracy, stronger structural integrity, and excellent topological consistency under complex dynamic robot scenarios and long-sequence observation conditions, showing good application value and promising prospects for wider application. This invention enhances geometric details by introducing generative prior constraints and enables probabilistic TSDF 3D reconstruction with adaptive temporal optimization based on observation uncertainties. It aims to address the geometric ambiguity and smoothing issues caused by discriminative models by effectively eliminating dynamic and noise interference through the introduction of observation uncertainty constraints, thereby ensuring the geometric accuracy and topological consistency of the 3D reconstruction model under complex observation conditions.
[0076] In a specific embodiment, such as Figure 2 As shown, the high-fidelity 3D reconstruction method for robot environments provided in this embodiment is based on probabilistic TSDF 3D reconstruction using sparse voxel conditional diffusion denoising and adaptive temporal fusion. Specifically, this high-fidelity 3D reconstruction method for robot environments may include: S1. The RGB-D image sequence and camera pose data of the robot's environmental perception are preprocessed to construct a unified world coordinate system and strictly sorted by timestamp into a time-series observation sequence. The sequence is then divided into a training set, a validation set, and a test set. Specifically, this embodiment uses standard datasets such as ScanNet and real robot-collected data. First, it calls the pre-calibrated camera intrinsic parameters. And distortion parameters, and the corresponding extrinsic parameters for each frame of the image. The coordinates are uniformly transformed to the world coordinate system. Furthermore, the pose sequences are strictly sorted by timestamp to simulate real-time data flow. Finally, the processed sequences are divided into training, validation, and test sets according to a standard ratio.
[0077] S2. Construct a cascaded multi-resolution voxel grid pyramid network and a multi-scale feature back-projection module from coarse to fine. Extract multi-scale two-dimensional features from the current frame image into the image backbone network, and back-project the multi-scale two-dimensional features to three-dimensional space according to the projection geometry relationship, and output the initial three-dimensional feature volume corresponding to each level. Specifically, the total number of levels is set based on the scene size and reconstruction requirements. We define the parameters of a cascaded, coarse-to-fine multi-resolution voxel grid pyramid, and use an image backbone network to extract multi-level two-dimensional features. We then combine these features with the feature pyramids for multi-scale fusion to obtain a set of feature pyramids. These feature maps, with resolutions ranging from low to high, correspond to the coarse, medium, and fine layers of voxel space, respectively, to capture geometrically relevant cues such as texture, edges, and normals.
[0078] Furthermore, for each level Based on the internal parameters of this layer External reference Based on the estimated depth, a back-projection mapping from pixels to voxels is established, and its transformation equation is: ; in, Indicates hierarchy Next pixel Three-dimensional points back-projected onto the world coordinate system; Represents the pixels of the corresponding layer The depth value; This is the camera intrinsic parameter matrix for the corresponding layer; These are the camera's extrinsic rotation matrix and translation vector, respectively. Represents the homogeneous coordinate vector of a pixel; superscript Indicates the current resolution level.
[0079] Furthermore, for each pixel-voxel pair Calculate projection weights And perform normalization: ; in, This represents the angle between the line-of-sight vector and the voxel surface normal vector; Voxel representation Depth center value in camera coordinate system; This represents the pixel depth of the corresponding layer. Scale hyperparameters for controlling sensitivity to depth consistency; voxels The set of effective pixels projected onto the image plane. For set The pixel index is traversed.
[0080] Subsequently, only the candidate voxel set at the current level Perform sparse aggregation to obtain voxel 3D features. : ; in, Indicates hierarchy Hypotoplasm The aggregated three-dimensional feature vector; Indicates hierarchy Corresponding two-dimensional image features; The normalized weighting coefficients calculated above are denoted as .
[0081] S3. Construct a sparse voxel generation and screening module, taking the initial 3D feature volume and the sparsification result of the previous level as input, generating a sparse candidate voxel set at the current resolution level through upsampling operation and lightweight occupancy prediction, and using a gating mechanism to initially fuse historical information, outputting a sparse candidate voxel set containing geometric coverage information and corresponding mask. Specifically, such as Figure 3 As shown, a differentiated initialization strategy is adopted for different resolution levels: for the coarsest level (Level 1), all low-resolution voxels within the view frustum are initialized as the initial candidate set. For fine-grained levels (Level) Based on the occupancy prediction results of the previous level, upsampling is performed. Specifically, only sparse voxels that are determined to be "non-empty" in the previous level are split into subdivided voxels of the current level, with one coarse voxel split into eight fine voxels, forming the candidate set of the current level. .
[0082] Furthermore, the sparse candidate voxel generation process includes three main stages: occupancy probability prediction, temporal state update, and sparse voxel screening.
[0083] Step 1: Calculate the voxel visibility score based on the view occlusion relationship and depth consistency, and use the current layer features. As input, the occupancy probability of a voxel is obtained through a lightweight classification head: ; in, Voxel representation The probability of occupancy; Use the sigmoid activation function; The weights and bias parameters for the lightweight classification head; This refers to the input voxel features of the current level.
[0084] Step 2: Use a gated recurrent network (GRU) to fuse the current voxel features with the historical hidden states of the corresponding level in the time dimension. Update the hidden state of the time sequence: ; in, To update the door; To reset the door; For the sigmoid function; Input the voxel features at the current time (i.e., the 3D feature volume extracted at the current level). ); For the previous moment at the level The hidden state; These are all learnable parameters of the linear transformation corresponding to this level; This represents element-wise multiplication; It is the hyperbolic tangent activation function; This is the candidate hidden state; The updated hierarchy is now hidden.
[0085] Step 3: Based on the occupancy threshold Construct a mask using the upper limit of candidate proportions and filter the set: ; in, Voxel representation Binary candidate mask; This indicates an indicator function that takes the value 1 if the condition is true and 0 otherwise. This indicates the set occupancy threshold; This represents the set of sparse candidate voxels selected at the current level.
[0086] Finally, for Perform connected component analysis, remove isolated voxels to preserve structural coherence, and update the hidden state. , mask With candidate set Output to the next module.
[0087] S4. Construct a conditional diffusion denoising and completion module. Based on the sparse candidate voxel set and corresponding mask, construct a condition set in combination with the initial three-dimensional feature volume and time series context. Input the condition set into the conditional diffusion model to perform denoising and structured completion, and output the geometrically enhanced feature representation and the corresponding observation uncertainty measure. Specifically, such as Figure 4 As shown, a 3D U-Net conditional denoising network is used to combine the features of the current layer. , mask Geometric projection information and historical status Formation condition set The noise figure is calculated using either linear or cosine noise. In the candidate voxel set Perform the forward diffusion process above: ; in, True Voxel Geometric Distribution (TSDF); Indicates the diffusion time step The noise-increasing voxel value; This is the preset cumulative noise figure; The sampled standard Gaussian noise, This represents a matrix with a mean of 0 and a covariance equal to the identity matrix. It follows a standard normal distribution.
[0088] Furthermore, with Iteratively optimize the denoising network parameters for the training objective: ; in, Represents the mathematical expectation operation. The noise predicted by the conditional denoising network; Represents the condition set, including camera geometry and features. , mask With temporal context; These are network parameters.
[0089] Finally, in the reasoning phase, in the candidate set The above uses DDIM few-step sampling for inverse denoising to recover the geometric features after structured completion. And estimate the mean of observations at the current level based on the output of the network prediction head. with observed standard deviation (Uncertainty).
[0090] S5. Construct an adaptive temporal probability fusion module. Based on the geometrically enhanced feature representation of the output and the observation uncertainty measure, combined with the global historical hidden state of the previous time step, the global truncated symbolic distance function estimate is updated using an adaptive gated recurrent unit and a confidence weighting mechanism. Through three resolution levels of coarse-to-fine iterative iteration, and with the conditional diffusion denoising and completion module and the adaptive temporal probability fusion module of progressive optimization, a cascaded probabilistic TSDF three-dimensional reconstruction network model is constructed.
[0091] Specifically, such as Figure 5 As shown, the gating scheduling coefficient is set. and update the door Apply uncertainty-aware constraints; perform gated fusion along the temporal dimension to update the hidden state containing probabilistic information: ; ; ; in, To update the gate, control the amount of new information written; Use the sigmoid activation function; The gating scheduling coefficient is automatically optimized through network training and is used to adjust the contribution of input features and historical hidden states. These are the linear transformation parameters corresponding to this level; The voxel observation features at the current moment are enhanced by diffusion. This is the hidden state from the previous moment; This represents the candidate hidden state at the current moment; This is the hidden state as updated at the current moment.
[0092] Furthermore, confidence-weighted fusion is performed at the current level. The inverse of the variance is used as the weight to calculate the fused global TSDF estimate. And update the global prior standard deviation. : ; in, These are the mean and standard deviation of the observed TSDF output by the current level diffusion module, respectively. These are the mean and standard deviation of the historical cumulative prior TSDF, respectively; These are the fusion weights for observation and prior knowledge, respectively; This is the estimated global TSDF value after fusion.
[0093] If the current level is not the most refined level ( ), then the fused After sparsification, it serves as the input for the next level, returning to S3 to continue execution; if the finest level has been reached ( If ), then proceed to S6.
[0094] S6. Input the data from the test set into the reconstruction network in sequence, using the finest level of global fusion TSDF estimation and accumulated uncertainty information as input, use adaptive zero-crossing threshold to filter artifacts in high uncertainty regions, and extract isosurfaces through the moving cube algorithm to obtain a high-fidelity and topologically consistent three-dimensional triangular mesh model.
[0095] Specifically, in conjunction with the benchmark threshold Gain coefficient Based on the total uncertainty information, dynamically adjust the zero-crossing threshold. Perform adaptive zero-crossing determination: ; in, This is an indicator function that outputs 1 (indicating a potential surface voxel) when the condition is met, and 0 otherwise. This represents the TSDF value after the finest layer fusion. Use the baseline cutoff threshold; This is the gain coefficient, used to adjust the degree of influence of uncertainty on the threshold width; These represent the observational and prior uncertainties of the finest layer, respectively.
[0096] Furthermore, based on the zero-crossing characteristic of TSDF values, the Marching Cubes algorithm is used to extract triangular meshes. The core step is zero-crossing linear interpolation: ; in, For adjacent voxels with opposite signs of TSDF values; voxels The TSDF value; Zero cross-interpolation coefficients; Voxels Spatial geometric coordinates; These are the precise spatial coordinates of the vertices of the triangular mesh obtained through interpolation.
[0097] Finally, the extracted mesh is post-processed to obtain a high-fidelity, coherent final 3D mesh model.
[0098] Real-world 3D model and this embodiment This invention employs the ScanNet standard indoor scene dataset for experiments. ScanNet, a benchmark dataset in the field of 3D computer vision, covers hundreds of realistic and complex indoor scenes, including offices, apartments, and hotels. Rich video sequences containing color, depth, and high-precision camera trajectories were acquired using a handheld RGB-D camera. These data comprehensively characterize the geometric features of scenes under different lighting conditions, texture richness, and line-of-sight occlusion, providing a sufficient data foundation for verifying the accuracy and robustness of the probabilistic TSDF reconstruction algorithm. This experiment uses the obtained high-fidelity 3D triangular mesh model as the final output target, and the RGB images and depth maps from the sequence as model inputs. In the preprocessing stage, a unified world coordinate system is first constructed based on pre-calibrated camera intrinsic distortion parameters, and the total number of layers is set according to the scene scale. A cascaded multi-resolution voxel pyramid network, ranging from coarse to fine, is constructed as a geometric reference. Subsequently, the pose sequences are strictly sorted by timestamp to simulate the robot's real-time data flow. A cascaded strategy is used to extract multi-scale features frame by frame and perform sparse voxel updates. Finally, the training set, validation set, and test set are divided according to official standards. The geometric error between the reconstructed model and the ground truth surface is calculated to evaluate the performance of this method under complex temporal observations.
[0099] The 3D geometric metrics of this implementation are shown in Table 1. The first real-world 3D model and the first test results of this embodiment are respectively as follows: Figure 6a , Figure 6b As shown, the details of the correspondence between the second real-world scene diagram and the second test result model in this embodiment are as follows: Figure 7a , Figure 7b As shown.
[0100] Table 1: Test Results 3D Geometric Indicators
[0101] The high-fidelity 3D reconstruction method for robot environments provided in this invention first acquires a temporal observation sequence for robot environment perception; then, it extracts multi-scale two-dimensional features from the current frame image in the temporal observation sequence using a multi-resolution voxel pyramid network and a multi-scale feature backprojection module, and backprojects these multi-scale two-dimensional features into 3D space to obtain initial 3D feature volumes corresponding to each level; next, it generates a sparse candidate voxel set for the initial level using a sparse voxel generation and screening module, and outputs sparse voxel sets containing geometric coverage information for other levels based on the initial 3D feature volumes and the initial level's sparse candidate voxel set. The candidate voxel set and its corresponding mask are obtained. Then, the conditional diffusion denoising and completion module constructs a conditional set based on the sparse candidate voxel set containing geometric coverage information and its corresponding mask. The conditional set is then denoised and structurally completed to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty metric. Finally, the adaptive temporal probability fusion module updates the global truncated symbolic distance function estimate based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric using an adaptive gated recurrent unit and a confidence weighting mechanism. Through iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is obtained. This invention utilizes a multi-resolution voxel grid pyramid network and a multi-scale feature back-projection module to perform calculations only within the effective geometric region. This significantly reduces memory usage while ensuring the real-time performance of the robot's perception system. Through a conditional diffusion denoising and completion module, generative priors are introduced to constrain the geometric structure, effectively overcoming the geometric fuzziness and smoothing problems caused by traditional discriminative models. This significantly improves the high-fidelity recovery capability for thin structures, sharp edges, and occluded blind spots. Furthermore, an adaptive temporal probabilistic fusion module is employed. Through gating mechanisms and confidence-weighted dynamic suppression, it effectively prevents the accumulation of errors in multi-scale space by suppressing interference from low-quality observation frames such as motion blur or positioning drift on the global map. This results in higher reconstruction accuracy, stronger structural integrity, and excellent topological consistency under complex dynamic robot scenarios and long-sequence observation conditions. It has good application value and promising prospects, solving the shortcomings of existing 3D reconstruction methods in generating geometric details and the lack of robustness of temporal fusion mechanisms under dynamic observation conditions.
[0102] Example 2: like Figure 8 As shown, this embodiment provides a high-fidelity 3D reconstruction device for a robot environment, used to perform the above-described high-fidelity 3D reconstruction method for a robot environment, including: The acquisition module 11 is used to acquire the time-series observation sequence for robot environmental perception; The extraction backprojection module 12 is connected to the acquisition module 11. It is used to extract multi-scale two-dimensional features from the current frame image in the time-series observation sequence through a multi-resolution voxel grid pyramid network and a multi-scale feature backprojection module, and backproject the multi-scale two-dimensional features to three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level. The output generation module 13 is connected to the extraction back projection module 12 and is used to generate an initial level of sparse candidate voxel set through the sparse voxel generation and screening module, and output other levels of sparse candidate voxel sets containing geometric coverage information and corresponding masks according to the initial three-dimensional feature volume and the initial level of sparse candidate voxel set. Module 14 is constructed and connected to the output generation module 13. It is used to construct a condition set based on a sparse candidate voxel set containing geometric coverage information and a corresponding mask through the conditional diffusion denoising and completion module, and to denoise and structurally complete the condition set to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty measure. The updated module 15 is connected to the constructed module 14. It is used to update the global truncated symbolic distance function estimate by means of the adaptive temporal probability fusion module based on the geometrically enhanced feature representation and the corresponding observation uncertainty measure, using an adaptive gated recurrent unit and confidence weighting mechanism. Through each level of iterative loop, a cascaded probabilistic TSDF three-dimensional reconstruction network model is obtained.
[0103] Furthermore, the acquisition module 11 specifically includes: The acquisition and preprocessing unit is used to acquire RGB-D image sequences and camera pose data for robot environmental perception, and to preprocess the RGB-D image sequences and camera pose data to obtain the time-series observation sequence.
[0104] Furthermore, the extraction back projection module 12 specifically includes: Define the unit, which is used to define the parameters of the cascaded multi-resolution voxel mesh pyramid from coarse to fine according to the scene size and reconstruction requirements; The extraction and fusion unit is used to extract multi-scale two-dimensional features from the current frame image input image backbone network, and combine the multi-scale two-dimensional features with the feature pyramid to perform multi-scale fusion to obtain a feature pyramid set. A unit is established to predict the depth for each level in the feature pyramid set and to establish a back-projection mapping from pixel to voxel. The computational processing unit is used to calculate the projection weight for each pixel-voxel pair based on the pixel-to-voxel back projection mapping and perform normalization processing. The calculation unit is used to calculate the initial three-dimensional feature volume corresponding to the level based on the normalized projection weights.
[0105] Furthermore, the output generation module 13 specifically includes: As a unit, it is used to initialize all low-resolution voxels within the initial frustum as a sparse candidate voxel set for the initial level. The output generation unit is used to upsample the occupancy prediction results of the previous level for the other levels, generate sparse candidate voxel sets for the other levels, and use a gating mechanism to initially fuse historical information. Based on the initial 3D feature volume and the sparse candidate voxel sets of the other levels, it outputs sparse candidate voxel sets containing geometric coverage information and corresponding masks for the other levels.
[0106] Furthermore, the constructed module 14 specifically includes: The construction unit is used to construct a condition set based on the sparse candidate voxel set containing geometric coverage information and the corresponding mask, combined with the initial three-dimensional feature volume and the temporal context. The output unit is used to perform denoising and structured completion on the conditional diffusion model inputting the condition set, and output the geometrically enhanced feature representation and the corresponding observation uncertainty metric.
[0107] Furthermore, the update module 15 specifically includes: An updated forming unit is used to update the global truncated symbolic distance function estimate based on the geometrically enhanced feature representation and the corresponding observation uncertainty measure, combined with the global historical hidden state of the previous time step, using an adaptive gated recurrent unit and a confidence weighting mechanism. Through each level of coarse-to-fine iterative iteration, a cascaded probabilistic TSDF 3D reconstruction network model is constructed. The iterative iteration is a step-by-step optimization using the conditional diffusion denoising and completion module and the adaptive temporal probability fusion module.
[0108] Furthermore, the device also includes: The input module is used to sequentially input the data from the test set into the reconstruction network. It takes the finest level of global fusion TSDF estimation and accumulated uncertainty information as input, uses adaptive zero-crossing threshold to filter artifacts in high uncertainty regions, and extracts isosurfaces through the moving cube algorithm to obtain a high-fidelity and topologically consistent three-dimensional triangular mesh model.
[0109] Example 3: refer to Figure 9 This embodiment provides a high-fidelity 3D reconstruction device for a robot environment, including a memory 21 and a processor 22. The memory 21 stores a computer program, and the processor 22 is configured to run the computer program to execute the high-fidelity 3D reconstruction method for a robot environment in Embodiment 1.
[0110] The memory 21 is connected to the processor 22. The memory 21 can be a flash memory, a read-only memory or other memory, and the processor 22 can be a central processing unit or a microcontroller.
[0111] Example 4: This embodiment provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the high-fidelity 3D reconstruction method for robot environment described in Embodiment 1 above.
[0112] The computer-readable storage medium includes volatile or non-volatile, removable or non-removable media implemented in any method or technology for storing information, such as computer-readable instructions, data structures, computer program modules or other data. Computer-readable storage media include, but are not limited to, RAM (Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory or other memory technologies, CD-ROM (Compact Disc Read-Only Memory), DVD or other optical disc storage, cartridges, magnetic tapes, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible to a computer.
[0113] In summary, the high-fidelity 3D reconstruction method, apparatus, and medium for robot environments provided in this embodiment of the invention first acquire a temporal observation sequence for robot environmental perception; then, multi-scale two-dimensional features are extracted from the current frame image in the temporal observation sequence through a multi-resolution voxel grid pyramid network and a multi-scale feature back-projection module, and the multi-scale two-dimensional features are back-projected onto 3D space to obtain initial 3D feature volumes corresponding to each level; next, a sparse voxel generation and screening module generates a sparse candidate voxel set for the initial level, and outputs the geometric coverage of other levels based on the initial 3D feature volumes and the sparse candidate voxel set for the initial level. The model first obtains a sparse candidate voxel set containing geometric coverage information and its corresponding mask. Then, a conditional set is constructed based on the sparse candidate voxel set containing geometric coverage information and its corresponding mask through a conditional diffusion denoising and completion module. The conditional set is then denoised and structurally completed to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty metric. Finally, an adaptive temporal probability fusion module updates the global truncated symbolic distance function estimate based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric using an adaptive gated recurrent unit and a confidence weighting mechanism. Through iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is obtained. This invention utilizes a multi-resolution voxel grid pyramid network and a multi-scale feature back-projection module to perform calculations only within the effective geometric region. This significantly reduces memory usage while ensuring the real-time performance of the robot's perception system. Through a conditional diffusion denoising and completion module, generative priors are introduced to constrain the geometric structure, effectively overcoming the geometric fuzziness and smoothing problems caused by traditional discriminative models. This significantly improves the high-fidelity recovery capability for thin structures, sharp edges, and occluded blind spots. Furthermore, an adaptive temporal probabilistic fusion module is employed. Through gating mechanisms and confidence-weighted dynamic suppression, it effectively prevents the accumulation of errors in multi-scale space by suppressing interference from low-quality observation frames such as motion blur or positioning drift on the global map. This results in higher reconstruction accuracy, stronger structural integrity, and excellent topological consistency under complex dynamic robot scenarios and long-sequence observation conditions. It has good application value and promising prospects, solving the shortcomings of existing 3D reconstruction methods in generating geometric details and the lack of robustness of temporal fusion mechanisms under dynamic observation conditions.
[0114] It is understood that the above embodiments are merely exemplary implementations used to illustrate the principles of the present invention, and the present invention is not limited thereto. For those skilled in the art, various modifications and improvements can be made without departing from the spirit and essence of the present invention, and these modifications and improvements are also considered to be within the scope of protection of the present invention.
Claims
1. A method for high-fidelity 3D reconstruction of a robot environment, characterized in that, The method includes: Acquire a time-series observation sequence for robot environmental perception; Multi-scale two-dimensional features are extracted from the current frame image in the time-series observation sequence using a multi-resolution voxel grid pyramid network and a multi-scale feature back-projection module. The multi-scale two-dimensional features are then back-projected onto three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level. The sparse voxel generation and screening module generates an initial level of sparse candidate voxel set, and outputs other levels of sparse candidate voxel sets containing geometric coverage information and corresponding masks based on the initial 3D feature volume and the initial level of sparse candidate voxel set. The conditional diffusion denoising and completion module constructs a conditional set based on a sparse candidate voxel set containing geometric coverage information and a corresponding mask. The conditional set is then denoised and structurally completed to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty measure. Based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric, the adaptive temporal probability fusion module updates the global truncated symbolic distance function estimate using an adaptive gated recurrent unit and a confidence weighting mechanism. Through iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is obtained.
2. The method according to claim 1, characterized in that, The acquisition of the time-series observation sequence for robot environmental perception specifically includes: The robot acquires RGB-D image sequences and camera pose data for environmental perception, and preprocesses the RGB-D image sequences and camera pose data to obtain the time-series observation sequence.
3. The method according to claim 1, characterized in that, The process of extracting multi-scale two-dimensional features from the current frame image in the time-series observation sequence through a multi-resolution voxel pyramid network and a multi-scale feature backprojection module, and backprojecting the multi-scale two-dimensional features into three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level, specifically includes: Based on the scene size and reconstruction requirements, define the parameters of the cascaded multi-resolution voxel mesh pyramid from coarse to fine. The current frame image is input to the image backbone network to extract multi-scale two-dimensional features, and the multi-scale two-dimensional features are fused in combination with the feature pyramid to obtain a feature pyramid set. For each level in the feature pyramid set, the depth is estimated based on the intrinsic and extrinsic parameters of the level, and a back-projection mapping from pixel to voxel is established. Based on the back projection mapping from pixels to voxels, the projection weights are calculated for each pixel-voxel pair and then normalized. The initial 3D feature volume corresponding to the level is calculated based on the normalized projection weights.
4. The method according to claim 1, characterized in that, The process of generating an initial-level sparse candidate voxel set through a sparse voxel generation and filtering module, and outputting other-level sparse candidate voxel sets containing geometric coverage information and corresponding masks based on the initial 3D feature volume and the initial-level sparse candidate voxel set, specifically includes: For the initial level, all low-resolution voxels within the initialized view frustum are used as the sparse candidate voxel set for the initial level. For the other levels, upsampling is performed based on the occupancy prediction results of the previous level to generate a sparse candidate voxel set for the other levels. Historical information is initially fused using a gating mechanism. Based on the initial 3D feature volume and the sparse candidate voxel sets of other levels, a sparse candidate voxel set containing geometric coverage information and a corresponding mask for the other levels are output.
5. The method according to claim 1, characterized in that, The conditional diffusion denoising and completion module constructs a condition set based on a sparse candidate voxel set containing geometric coverage information and its corresponding mask, and then denoises and completes the condition set to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty metric, specifically including: Based on the sparse candidate voxel set containing geometric coverage information and the corresponding mask, a condition set is constructed in conjunction with the initial 3D feature volume and the temporal context. The condition set is input into the conditional diffusion model to perform denoising and structured completion, and the output is a geometrically enhanced feature representation and the corresponding observation uncertainty metric.
6. The method according to claim 1, characterized in that, The adaptive temporal probability fusion module, based on geometrically enhanced feature representations and corresponding observation uncertainty metrics, updates the global truncated symbolic distance function estimate using adaptive gated recurrent units and a confidence-weighted mechanism. Through iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is obtained, specifically including: Based on the geometrically enhanced feature representation and the corresponding observation uncertainty measure, combined with the global historical hidden state of the previous time step, the global truncated symbolic distance function estimate is updated using an adaptive gated recurrent unit and a confidence weighting mechanism. Through coarse-to-fine iterative iterations at each level, a cascaded probabilistic TSDF 3D reconstruction network model is constructed. The iterative iterations are the step-by-step optimizations using the conditional diffusion denoising and completion module and the adaptive temporal probability fusion module.
7. The method according to claim 1, characterized in that, The method further includes, after obtaining the cascaded probabilistic TSDF 3D reconstruction network model through the adaptive temporal probability fusion module based on the geometrically enhanced feature representation and the corresponding observation uncertainty metric, updating the global truncated symbolic distance function estimate using an adaptive gated recurrent unit and a confidence weighting mechanism, and iterating through each level of recurrent iterations: The data from the test set are sequentially input into the reconstruction network. The finest level of global fusion TSDF estimation and accumulated uncertainty information are used as input. Adaptive zero-crossing threshold is used to filter artifacts in high uncertainty regions, and isosurfaces are extracted by the moving cube algorithm to obtain a high-fidelity and topologically consistent three-dimensional triangular mesh model.
8. A high-fidelity 3D reconstruction device for robot environments, characterized in that, include: The acquisition module is used to acquire time-series observation sequences for robot environmental perception; An extraction backprojection module, connected to the acquisition module, is used to extract multi-scale two-dimensional features from the current frame image in the time-series observation sequence through a multi-resolution voxel grid pyramid network and a multi-scale feature backprojection module, and backproject the multi-scale two-dimensional features to a three-dimensional space to obtain the initial three-dimensional feature volume corresponding to each level. The output generation module, connected to the extraction back projection module, is used to generate an initial level of sparse candidate voxel set through the sparse voxel generation and filtering module, and output other levels of sparse candidate voxel sets containing geometric coverage information and corresponding masks based on the initial 3D feature volume and the initial level of sparse candidate voxel set. The constructed module is connected to the generated output module and is used to construct a condition set based on a sparse candidate voxel set containing geometric coverage information and a corresponding mask through the conditional diffusion denoising and completion module, and to denoise and structurally complete the condition set to obtain a geometrically enhanced feature representation and a corresponding observation uncertainty measure. The updated module is connected to the constructed module and is used to update the global truncated symbolic distance function estimate by means of the adaptive temporal probability fusion module based on the geometrically enhanced feature representation and the corresponding observation uncertainty measure, using an adaptive gated recurrent unit and confidence weighting mechanism, and through each level of iterative loop, to obtain the cascaded probabilistic TSDF three-dimensional reconstruction network model.
9. A high-fidelity 3D reconstruction device for robot environments, characterized in that, It includes a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to implement the high-fidelity 3D reconstruction method for a robot environment as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the high-fidelity three-dimensional reconstruction method for robot environments as described in any one of claims 1-7.