Underwater large scene three-dimensional reconstruction method based on binocular images and application

By employing a segmented reconstruction method based on binocular images, local reconstruction is performed using segment-level sparse point clouds and camera pose. Point cloud registration is then performed in conjunction with Umeyama and ICP algorithms, thus solving the problems of scale drift and computational complexity in underwater 3D reconstruction and achieving high-precision, large-scale underwater 3D reconstruction.

CN122199829APending Publication Date: 2026-06-12OCEAN UNIV OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
OCEAN UNIV OF CHINA
Filing Date
2026-05-11
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing underwater 3D reconstruction technologies struggle to maintain robustness, scale recovery consistency, and large-scene extensibility under long-distance shooting and underwater optical degradation conditions. Furthermore, they require high computational resources and are difficult to support long-sequence large-scene reconstruction.

Method used

A segmented reconstruction method based on binocular images is adopted. The underwater image sequence is divided into multiple data segments, and local reconstruction is performed using segment-level sparse point clouds and camera pose. The Umeyama algorithm and ICP algorithm are combined for point cloud registration to achieve globally consistent 3D reconstruction.

🎯Benefits of technology

It achieves high-precision, large-scale 3D reconstruction in complex underwater environments, solves the problems of scale drift and computational complexity, improves reconstruction efficiency and robustness, and is suitable for marine ecological monitoring, engineering inspection and underwater robot navigation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199829A_ABST
    Figure CN122199829A_ABST
Patent Text Reader

Abstract

A kind of underwater large scene three-dimensional reconstruction method and application based on binocular image belongs to computer vision and three-dimensional reconstruction field.It includes dividing continuous binocular image sequence into multiple data sections with overlapping frames;Based on motion recovery, segment-level pose estimation and sparse reconstruction are carried out in local coordinate system for each data section;Multi-view stereo vision algorithm is used to generate segment-level dense point cloud;The known physical baseline of binocular camera is used to restore the scale of each section;Based on the pose relationship corresponding to the overlapping frames, Umeyama algorithm is used to realize coarse alignment between sections, and ICP algorithm is used to iteratively register point clouds between sections for fine alignment;All segment-level point clouds are fused and resampled to obtain complete underwater large scene three-dimensional model under unified scale.The invention effectively suppresses scale drift in large-scale underwater scene three-dimensional reconstruction and reduces global computational complexity through the technical path of segmented reconstruction, scale recovery and coarse-fine registration, and realizes high-precision underwater three-dimensional reconstruction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a method and application for 3D reconstruction of large underwater scenes based on binocular images, belonging to the field of computer vision and 3D reconstruction technology. Background Technology

[0002] Underwater 3D reconstruction technology plays a crucial role in tasks such as marine ecological monitoring, subsea pipeline and cable inspection, marine engineering facility inspection, and autonomous navigation of underwater robots. High-quality 3D scene models not only assist operators in making accurate measurements but can also be used to construct underwater environmental maps, ensure the safe operation of marine engineering equipment, and support the deployment of long-term monitoring systems. With the increasing prevalence of underwater robot platforms, the demand for 3D reconstruction for large-scale, long-range missions is growing. However, underwater optical imaging is affected by light attenuation, scattering, and absorption, and is also plagued by problems such as interference from suspended particles and insufficient target surface texture, resulting in unstable image quality, reduced contrast, and significantly weakening the robustness of vision-based 3D reconstruction.

[0003] In multi-view scenarios, the classic Structure-from-Motion (SfM) and multi-view stereo vision algorithms... The combination of MVS (Multi-View Geometry) is the most mature multi-view geometry reconstruction technology in terrestrial environments, but it has the following core problems when processing long underwater image sequences: (1) Since monocular cameras do not provide distance information, even if the relative structure can be recovered in a local view, it cannot fundamentally provide a measurable real scale, making it difficult to meet the requirements of engineering scenarios for physical scale consistency and unable to handle engineering tasks that require precise distance or volume measurement; (2) The processing time increases exponentially or in higher-order polynomials with the number of images, and the computational cost of long sequence tasks is extremely high, making it difficult to construct large scenes in real time; (3) In sequences of long-distance or long-term shooting, pose estimation errors and reconstruction errors will continue to accumulate, and global optimization is difficult to completely suppress pose drift and local detail degradation, ultimately leading to deviations or distortions in the overall geometric structure of the reconstruction results; (4) Underwater shooting conditions are unstable, and image quality changes drastically, making feature extraction and matching prone to failure, making it difficult for motion-based structure recovery methods to maintain stable pose estimation and point cloud density in large underwater scene reconstruction. Therefore, traditional multi-view geometry methods are not suitable for continuous reconstruction of large-scale underwater environments.

[0004] To overcome the lack of scale in monocular vision data, some underwater platforms employ binocular vision solutions. These solutions can recover depth and absolute scale through physical baselines, providing geometric consistency constraints. Furthermore, binocular cameras achieve pixel-level depth estimation using known baselines, maintaining high robustness in underwater structural degradation, illumination variations, or weakly textured regions. However, existing binocular image-based reconstruction methods still rely on global optimization frameworks. In long sequences, these methods not only face heavy computational burdens but also suffer from uncontrollable scale drift due to the high sensitivity of global solutions to local errors. Moreover, image degradation caused by water scattering and illumination variations further amplifies these errors, significantly reducing the robustness of global reconstruction.

[0005] On the other hand, some deep learning-based feedforward 3D reconstruction methods (such as DUST3R and VGGSfM) that have emerged in recent years have shown certain robustness, but these methods generally rely on stable, high-quality poses as input, have insufficient generalization ability to underwater environments, and have high computational resource requirements and a limited number of images that can be processed, making it difficult to support long sequence large scene reconstruction.

[0006] In summary, existing methods struggle to maintain robustness of 3D reconstruction, consistency of scale recovery, and extensibility of large scenes under conditions of long-distance shooting and underwater optical degradation, and the computational cost increases significantly with the number of images. Summary of the Invention

[0007] The purpose of this invention is to provide a method and application for underwater large-scale 3D reconstruction based on binocular images. Specifically, it is a segmented reconstruction based on binocular vision combined with segment-level scale recovery and inter-segment two-level registration. This method addresses the problems of existing underwater 3D reconstruction technologies, such as the inability to handle small-scale scenes in complex aquatic environments, the susceptibility to scale drift, insufficient reconstruction accuracy, and the rapid increase in computation time with the number of images. It is suitable for large-scale scene reconstruction tasks in complex aquatic environments, achieving high-precision and scalable 3D reconstruction of large-scale underwater environments.

[0008] A method for 3D reconstruction of large underwater scenes based on binocular images, characterized by the following steps: S1. Acquire a continuous binocular image sequence collected by an underwater mobile platform, and divide the sequence into multiple adjacent data segments with preset overlapping frames; S2. Using parallel processing, motion recovery structure reconstruction is performed on the binocular image sequence in each data segment to obtain segment-level sparse point cloud and segment-level camera pose sequence; S3. Based on the segment-level sparse point cloud and the segment-level camera pose, perform multi-view stereo vision calculations on each data segment to generate a segment-level dense point cloud; S4. Using the known physical baseline of the binocular camera, calculate the scale factor of each data segment, and perform scale recovery of the segment-level dense point cloud and segment-level camera pose in the segment-level coordinate system for each data segment, so that each data segment has a uniform physical scale. S5. Based on the camera poses corresponding to the overlapping frames between dense point clouds of adjacent data segments, calculate the transformation matrix between point clouds of adjacent data segments; S6. Based on the transformation relationship between all data segments, the dense point clouds of each data segment are transformed to a unified global coordinate system and fused to obtain a complete three-dimensional point cloud model of the underwater large scene.

[0009] After obtaining the transformation matrix in step S5, the transformation matrix is ​​used as the initial value to perform iterative nearest point registration on the dense point clouds of adjacent data segments to obtain the optimized inter-segment transformation relationship.

[0010] In step S1, the method of dividing the sequence into multiple data segments is a sliding window method based on a fixed number of frames.

[0011] In step S4, the scale factor is calculated by using the ratio of the known physical baseline length of the binocular camera to the baseline length estimated by reconstruction through the motion recovery structure described in S2 as the scale factor.

[0012] In step S5, the transformation matrix is ​​calculated using the Umeyama method.

[0013] The constraint used in the iterative nearest point registration is the point-to-point distance or the point-to-plane distance.

[0014] In step S6, the point cloud fusion process also includes post-processing steps such as denoising and / or resampling.

[0015] After acquiring the stereo image sequence and before segmentation, the process includes preprocessing steps such as distortion correction and stereo correction of the images, calibration of the stereo camera's intrinsic and extrinsic parameters, and distortion correction of the sequence images.

[0016] The number of overlapping frames should preferably be greater than or equal to one-tenth of the segment sequence length.

[0017] The underwater large-scene reconstruction method based on binocular images is applied in marine ecological monitoring, engineering inspection, seabed topography modeling, or underwater robot navigation.

[0018] The above method can be used to construct an underwater large-scene reconstruction device based on binocular vision, including: The image acquisition and segmentation module is used to acquire stereo image sequences and divide them into data segments with overlapping frames. The segment-level 3D reconstruction module is used to perform feature extraction and matching, incremental motion recovery structure calculation, and multi-view stereo vision calculation on each data segment to generate dense point clouds for each data segment. The scale restoration module is used to scale the dense point cloud and camera pose of each data segment based on the known physical baseline of the stereo camera. The point cloud registration module is used to calculate the initial transformation of the point cloud of adjacent data segments based on the camera pose of overlapping frames, and to perform fine registration using the iterative nearest point algorithm. The point cloud fusion and output module is used to convert the point clouds of all data segments to a unified coordinate system and fuse them to output a complete 3D model.

[0019] Beneficial effects Compared with existing underwater 3D reconstruction technologies, this invention has the following advantages: Solving the challenge of large-scale scene reconstruction: By introducing an "automatic sequence segmentation" mechanism, the global high-complexity problem is decomposed into multiple local problems that can be processed in parallel, breaking through the bottleneck of the traditional global SfM computational load increasing with the number of images, and significantly improving the processing capability and efficiency of large-scale scenes.

[0020] Effectively suppress scale drift: By using binocular baselines to independently perform absolute scale recovery on each data segment, the problem of scale error accumulation in long sequences in traditional monocular SfM is avoided, ensuring the uniformity and accuracy of global scale.

[0021] Robust inter-segment stitching: A two-level registration strategy combining the Umeyama algorithm and ICP is adopted to realize the entire process from initial alignment to geometric detail optimization, which effectively improves the accuracy and robustness of inter-segment point cloud fusion.

[0022] Adapting to Unstable Underwater Imaging: Segmented processing confines local disturbances such as water scattering and illumination changes to a single segment, preventing the propagation of errors throughout the system and improving the overall adaptability to complex underwater environments.

[0023] Highly scalable: It can process long-term, long-distance seabed mission data and is suitable for scenarios such as marine ecological monitoring, engineering inspection, seabed topography modeling, and underwater robot navigation. Attached Figure Description

[0024] Figure 1 This is a schematic diagram of the overall process of the present invention.

[0025] Figure 2 The diagram shows the underwater segment-level 3D reconstruction and point cloud stitching results. (a) is the reconstruction result of the first data segment, (b) is the reconstruction result of the second data segment, and (c) is the stitching result of the reconstruction of the first and second data segments.

[0026] Figure 3 This is a schematic diagram of the 3D reconstruction results of a large underwater scene.

[0027] Figure 4 This is a schematic diagram of the underwater large-scale scene reconstruction and engineering measurement results. Detailed Implementation

[0028] The present invention will be further described in detail below with reference to embodiments. Those skilled in the art should understand that the present invention is not limited to the following embodiments, and features in the embodiments can be combined with each other when there is no conflict.

[0029] The core of this invention lies in providing a "divide and conquer" scheme for underwater large-scale scene reconstruction. The basic idea is to divide a long sequence into multiple manageable short segments, achieve high-precision, scale-dependent dense reconstruction within each segment, and finally seamlessly stitch all segments together into a globally consistent complete model using a robust registration strategy. This method effectively overcomes the inherent shortcomings of traditional global reconstruction methods in terms of computational complexity and scale drift.

[0030] The method of the present invention includes the following steps: Acquiring and calibrating binocular images: Acquire a continuous sequence of binocular images from an underwater platform (such as an AUV or ROV); this step may include calibrating the intrinsic and extrinsic parameters of the binocular camera and performing distortion correction on the sequence of images; Automatic sequence segmentation: Based on temporal continuity and the number of images, long sequences are automatically divided into multiple data segments; overlapping frames are set between adjacent data segments to facilitate inter-segment pose correlation and subsequent registration; Segment-level feature extraction and matching: The motion recovery structure reconstruction used in this invention implies the establishment of visual associations for each data segment of the binocular image. In this regard, a method based on deep learning and geometric verification can be designed to establish visual associations to replace the default method of establishing visual associations. Segment-level sparse reconstruction: The pose estimation of the stereo image of each data segment is performed by the structure-of-motion reconstruction algorithm to obtain the segment-level sparse point cloud and camera pose; Segment-level dense reconstruction: The sparse point cloud and camera pose are input into a multi-view stereo vision algorithm for processing to generate dense point clouds for each segment; Scale restoration: Based on the known physical baseline of the stereo camera, calculate the scale factor of the segment-level dense point cloud, and perform scale unification processing on each segment of dense point cloud and camera pose. Inter-segment point cloud alignment: Using the camera poses corresponding to overlapping frames, the Umeyama algorithm is used to calculate the transformation matrix between adjacent data segments to achieve inter-segment point cloud alignment; Fine point cloud registration: Based on the inter-segment point cloud alignment results, the nearest point algorithm can be iterated on the dense inter-segment point clouds to achieve fine registration; Global point cloud fusion: All segment-level point clouds that have completed scale restoration and fine registration are fused together in a unified manner. Optional denoising and resampling are performed to obtain globally consistent and scale-accurate 3D reconstruction results of large underwater scenes.

[0031] Example This invention provides a method for 3D reconstruction of large underwater scenes based on binocular images. The system comprises the following modules: image acquisition and calibration module, sequence segmentation module, sparse reconstruction module, dense reconstruction module, scale restoration module, coarse alignment module, fine registration module, and point cloud fusion module. This system can be deployed on a ground station mounted on an underwater platform or on a backend data processing server.

[0032] The specific process of this method is as follows: Figure 1 As shown.

[0033] S110. Acquire the continuous binocular image sequence collected by the underwater platform and complete the camera calibration.

[0034] In this embodiment, the ROV is equipped with an industrial-grade binocular camera to acquire a continuous underwater image sequence.

[0035] The binocular camera calibration adopts Zhang Zhengyou's method + binocular extrinsic parameter calibration procedure: 1. Prepare a 9×6 or 11×8 checkerboard grid, with a grid size of 20–25 mm; 2. Collect at least 30 images from each of the left and right cameras respectively; 3. Use sub-pixel corner extraction (accuracy approximately 0.1 pixels); 4. Solve for the intrinsic parameter matrices of the left and right cameras: in, and These are the focal length and principal point of the left-eye camera, respectively. and These are the focal length and principal point of the right-eye camera, respectively; 5. Solve for distortion parameters in, and These are the radial and tangential distortion coefficients of the left eye camera, respectively. and These are the radial distortion coefficient and tangential distortion coefficient of the right eye camera, respectively; 6. Binocular extrinsic parameter calibration and baseline determination Rotation matrix Translation vector Typical baseline length: 70–120 mm.

[0036] S120. Automatically segment overlapping data segments based on time continuity. Since underwater large-scene image sequences can reach thousands of frames in length, direct global reconstruction would lead to: a. The computational load increases quadratically with the number of frames; b. Track drift is prone to occur when using the most common COLMAP method; c. Geometric feature extraction and visual association are unstable under weak underwater texture conditions.

[0037] To this end, the present invention adopts the idea of ​​"segmented reconstruction", which uses time continuity to divide long sequences, significantly reducing COLMAP drift and computational load, and dividing long sequences into several overlapping short segments.

[0038] Given a complete stereo sequence Automatic segmentation is represented as follows: Maintain M-frame overlap between segments. It is advisable to use a length greater than or equal to one-tenth of the segmented sequence length. .

[0039] S130, Segment-level Feature Extraction and Matching Traditional feature matching methods are prone to failure due to the weak texture, uneven lighting, and interference from suspended objects in the underwater environment. To address this issue, this embodiment establishes a feature extraction and matching method that combines deep learning with geometric priors using existing technologies. The specific process is as follows: (1) Feature extraction: Preferably, a key point detection and descriptor method based on deep learning, such as the DISK feature extractor, is used to improve the feature recall rate in weak texture and turbid water, thereby extracting feature points. Those skilled in the art may also consider using other features with similar robustness, such as SuperPoint.

[0040] (2) Global Retrieval: MegaLoc global descriptors are used for feature encoding and retrieval of images. By calculating the global similarity between images, image pairs that may have co-view relationships are screened out, effectively dealing with large viewpoint changes and loop closure detection.

[0041] (3) Feature matching: To further improve matching accuracy and robustness, a graph neural network-based matcher, such as LightGlue, can be used. LightGlue aggregates contextual information through an attention mechanism, effectively eliminating mismatches caused by suspended objects. Using the image pairs obtained from global retrieval as a reference, feature matching between binocular images is achieved. (4) Rig geometric constraint injection and database construction By parsing the calibration configuration file, the camera model is initialized in the database, and the feature extraction and feature matching results are written in for subsequent motion recovery structure sparse reconstruction.

[0042] S140, Segment-level sparse reconstruction based on motion recovery structure Taking COLMAP as an example, the Bundle Adjustment (BA) method is performed within the segment. By minimizing the reprojection error, the camera pose and 3D point coordinates are jointly optimized, and the output results are the camera position sequence of each segment and the sparse point cloud results of each segment.

[0043] S150, Segment-level dense reconstruction based on multi-view stereo vision algorithm Sparse reconstruction is mainly used to obtain reliable camera pose, while dense reconstruction is responsible for generating complete, structurally continuous high-resolution point clouds. This embodiment uses OpenMVS as the dense reconstruction framework to estimate depth maps and convert all depth maps into dense 3D point clouds, outputting segment-level dense point clouds.

[0044] S160. Calculate the scale factor to achieve segment-level scale recovery. (1) Calculate the segment-level scaling factor: Calculate the scaling factor based on the calibrated baseline and the estimated baseline: in, It is the physical baseline obtained by S110 dual-target positioning. It is in the first The baseline estimated during the reconstruction of sparse point clouds; (2) Segment-level point cloud scaling: If the point cloud of segment k is The scaling result is as follows: (3) Camera center scaling: If the camera pose center of segment k is The camera pose center output by COLMAP: .

[0045] S170, Segmented Point Cloud Alignment Based on Umeyama For two adjacent segments k and k+1, using their shared overlapping frames, establish the corresponding camera centers between the two segments. The sets of camera centers for the overlapping frames of the two point cloud segments are respectively... and .

[0046] Based on the aforementioned common viewpoint set, a least-squares optimization problem is constructed to solve for the similarity transformation parameters from the (k+1)th segment to the kth segment, including the scale factor s, rotation matrix R, and translation vector t, so as to minimize the error between the camera centers of the overlapping frames: .

[0047] The Umeyama algorithm is used to solve the objective function in a closed-form manner, thereby analytically obtaining the optimal s, R, t. Finally, using the calculated similarity transformation parameters, the coordinates of all sparse point clouds and camera poses in the (k+1)th segment are uniformly transformed to the coordinate system of the kth segment, completing the spatial alignment and scale unification of adjacent segments. In this embodiment, frame pairs with large common viewing areas can be preferred for calculating the transformation matrix.

[0048] S180, Optimized Point Cloud Alignment Based on ICP Algorithm After Umeyama provides the initial coarse alignment values, for and ICP registration is performed on two dense point cloud segments to obtain an optimized aligned point cloud result. To improve the convergence speed and accuracy of ICP, the point cloud can be downsampled based on the initial values ​​provided by the coarse alignment. The fused adjacent segments are shown below. Figure 2 As shown.

[0049] S190: Merge all segment-level point clouds and output a complete large-scale scene point cloud. like Figure 3 As shown, based on S180, segment-level dense point clouds are fused segment by segment, ultimately outputting a complete large-scene dense point cloud result. This embodiment not only achieves large-scene reconstruction but also ensures high geometric accuracy, as shown in the measurement results. Figure 4 As shown.

[0050] The final output format includes: PLY point cloud PCD point cloud Optionally, a grid may also be included.

[0051] Typical reconstruction scale can reach 10-30 million points, with a scene area of ​​100 square meters.

[0052] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for 3D reconstruction of large underwater scenes based on binocular images, characterized in that... Includes the following steps: S1. Acquire a continuous binocular image sequence collected by an underwater mobile platform, and divide the sequence into multiple adjacent data segments with preset overlapping frames; S2. Using parallel processing, motion recovery structure reconstruction is performed on the binocular image sequence in each data segment to obtain segment-level sparse point cloud and segment-level camera pose sequence; S3. Based on the segment-level sparse point cloud and the segment-level camera pose, perform multi-view stereo vision calculations on each data segment to generate a segment-level dense point cloud; S4. Using the known physical baseline of the binocular camera, calculate the scale factor of each data segment, and perform scale recovery of the segment-level dense point cloud and segment-level camera pose in the segment-level coordinate system for each data segment, so that each data segment has a uniform physical scale. S5. Based on the camera poses corresponding to the overlapping frames between dense point clouds of adjacent data segments, calculate the transformation matrix between point clouds of adjacent data segments; S6. Based on the transformation relationship between all data segments, the dense point clouds of each data segment are transformed to a unified global coordinate system and fused to obtain a complete three-dimensional point cloud model of the underwater large scene.

2. The method according to claim 1, characterized in that, After obtaining the transformation matrix in step S5, the transformation matrix is ​​used as the initial value to perform iterative nearest point registration on the dense point clouds of adjacent data segments to obtain the optimized inter-segment transformation relationship.

3. The method according to claim 1, characterized in that, In step S1, the method of dividing the sequence into multiple data segments is a sliding window method based on a fixed number of frames.

4. The method according to claim 1, characterized in that, In step S4, the scale factor is calculated by using the ratio of the known physical baseline length of the binocular camera to the baseline length estimated during reconstruction using the motion recovery structure described in S2 as the scale factor.

5. The method according to claim 1, characterized in that, In step S5, the transformation matrix is ​​calculated using the Umeyama method.

6. The method according to claim 2, characterized in that, The constraint used in the iterative nearest point registration is the point-to-point distance or the point-to-plane distance.

7. The method according to claim 1, characterized in that, In step S6, the point cloud fusion process also includes post-processing steps such as denoising and / or resampling.

8. The method according to claim 1, characterized in that, After acquiring the stereo image sequence and before segmentation, the process includes preprocessing steps such as distortion correction and stereo correction of the images, calibration of the stereo camera's intrinsic and extrinsic parameters, and distortion correction of the sequence images.

9. The application of any one of the methods described in claims 1-8 in marine ecological monitoring, engineering inspection, seabed topography modeling, or underwater robot navigation.