A method, system, device, and medium for stereo depth estimation
By processing stereo image pairs step by step, and using a visual basic model and stereo geometric constraints for scale calibration and iterative optimization, the problem of global consistency and local measurement accuracy in underwater stereo depth estimation is solved, generating a high-precision measurement depth map suitable for underwater robot operations and surveying tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- PEKING UNIV
- Filing Date
- 2026-03-03
- Publication Date
- 2026-06-23
Smart Images

Figure CN122265366A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, specifically to a stereo depth estimation method, system, device, and medium. Background Technology
[0002] Stereo depth estimation is a key technology in robotics, widely used in tasks such as environmental understanding, autonomous navigation, mechanical manipulation, and target detection. It enables the acquisition of accurate 3D geometric information of a scene using low-cost passive binocular cameras and effectively avoids the scale ambiguity problem commonly found in monocular depth estimation. Underwater stereo depth estimation, as an important branch, is crucial for tasks such as mapping operations of autonomous underwater vehicles (AUVs) / remotely operated underwater vehicles (ROVs), underwater infrastructure inspection, marine ecological monitoring, and underwater archaeology. Reliable underwater 3D geometric information directly determines the autonomy and safety of underwater robot operations. Existing stereo depth estimation methods are mostly trained on terrestrial scenes. When applied to underwater scenes, they need to be adapted to the characteristics of underwater imaging, while simultaneously balancing the global consistency of monocular depth estimation with the local metric accuracy of stereo matching.
[0003] Underwater imaging environments differ significantly from those on land. Influenced by wavelength-dependent attenuation, forward and backscattering, and refraction at the water-glass interface, underwater imaging breaks the photometric assumptions upon which terrestrial stereo vision relies, leading to severe domain shift problems. This makes it difficult for large-scale visual encoders trained on terrestrial scenes to be efficiently adapted to the underwater environment. Furthermore, in existing technologies, monocular depth estimation results possess global consistency but suffer from scale ambiguity, while stereo matching achieves accurate local measurements but is photometrically fragile. The failure to efficiently and tightly integrate these two approaches makes it difficult to balance global consistency with local measurement accuracy, thus failing to meet the practical needs of underwater stereo depth estimation. Summary of the Invention
[0004] The main objective of this invention is to provide a stereo depth estimation method, system, device, and medium. It achieves high-precision stereo depth estimation of the image to be processed through step-by-step processing, generating a depth map with practical metric significance. First, a stereo image pair is acquired as the basic input. Based on a visual fundamental model, monocular depth estimation is performed on the left view to quickly obtain monocular depth reference information that can be used as a benchmark for subsequent processing. Second, using the geometric constraints of the stereo image pair, the scale deviation of the monocular depth reference information is corrected, transforming the monocular depth, which lacks absolute scale, into a precise initial disparity. Finally, the stereo geometric features of the stereo image pair are extracted to iteratively optimize the initial disparity, correcting local errors and improving detail accuracy. A usable metric depth map is then generated from the final disparity, solving the problems of insufficient scale in monocular depth estimation and lack of accuracy in initial disparity, ensuring the practicality and accuracy of the depth results.
[0005] To achieve the above objectives, the embodiments of this application provide the following technical solutions: According to a first aspect of the embodiments of this application, a stereo depth estimation method is provided, the method comprising: Obtain a stereo image pair of the image to be processed, wherein the stereo image pair includes a left view and a right view of the image to be processed; The visual base model is invoked to perform monocular depth estimation on the left view to obtain monocular depth reference information; Using the geometric constraints of the stereo image pair, the monocular depth reference information is scaled to obtain the initial disparity; The initial disparity is iteratively optimized using the stereo geometric features extracted from the stereo image pair to obtain the final disparity; A depth map is generated based on the final disparity.
[0006] Optionally, the step of calling the visual baseline model to perform monocular depth estimation on the left view to obtain monocular depth reference information includes: The encoder of the visual base model is adjusted using low-rank adaptation technology to obtain an encoder adapted to the target scene, and the adapted encoder is integrated into the visual base model. The left view is input into the visual base model of the adapted encoder, and a multi-scale feature pyramid is extracted. The multi-scale feature pyramid includes image features at several resolution levels. The multi-scale feature pyramid is subjected to adjacent-scale feature aggregation to obtain aggregated features; The aggregated features are decoded to generate the monocular depth reference information.
[0007] Optionally, adjusting the encoder of the visual base model using low-rank adaptation techniques to obtain an encoder adapted to the target scene includes: A low-rank adaptation module is introduced into the encoder of the visual base model. The low-rank adaptation module includes a low-rank matrix and is used to incrementally update the base weights of the model encoder. Assign learnable importance weights to each rank component in the low-rank matrix; The low-rank matrix parameters and importance weights in the low-rank adaptation module are optimized and trained, while keeping the pre-trained basic weights of the encoder of the visual base model unchanged. A sparse regularization constraint is adopted to set the values of rank components whose importance weights are lower than a preset threshold to zero. The effective rank components whose values are not set to zero are integrated into the pre-trained base weights of the visual base model encoder to obtain an encoder adapted to the target scene.
[0008] Optionally, in the process of decoding the aggregated features to generate the monocular depth reference information, the decoding process and the adaptation parameters of the visual base model are jointly optimized using self-supervised loss, including: The left view is reconstructed based on the monocular depth reference information to obtain the reconstructed left view; Calculate the photometric reconstruction loss, which is the photometric difference between the reconstructed left view and the original left view; Calculate the edge-aware smoothness loss, where the edge-aware smoothness loss is the edge continuity constraint value of the monocular depth reference information; A self-supervised loss function is constructed based on the photometric reconstruction loss and the edge-aware smoothness loss. The loss value of the self-supervised loss function is backpropagated through backpropagation to optimize the adapted encoder parameters and decoding parameters of the decoding process in the visual base model.
[0009] Optionally, the step of using the geometric constraints of the stereo image pair to perform scale calibration on the monocular depth reference information to obtain the initial disparity includes: Feature points of the left and right views in the stereo image are extracted, and sparse correspondences are obtained based on feature point matching and bidirectional consistency checks. The sparse correspondences include the set of matching points and the corresponding depth values. Calculate the scale factor between the monocular depth reference information and the depth value of the sparse correspondence; Determine whether the scaling factor is within a preset threshold range; if it is within the preset threshold range, directly convert the monocular depth reference information into initial disparity based on the camera intrinsic parameters; if it is not within the preset threshold range, correct the monocular depth reference information using the scaling factor, and then convert it into initial disparity based on the camera intrinsic parameters; wherein, the camera intrinsic parameters include the camera focal length and the binocular baseline distance, and the binocular baseline distance is the distance between the shooting angles of the left and right views of the stereo image pair.
[0010] Optionally, the initial disparity is iteratively optimized using the stereo geometric features extracted from the stereo image pair to obtain the final disparity, including: Extract the combined contextual features of the stereo image pair, which include global features output by a visual base model adapted by low-rank adaptation technique and local detail features extracted by a lightweight convolutional neural network. The combined context features are input into the loop update module to initialize the hidden state of the loop update module; the initial disparity is assigned the current disparity at the beginning of the iteration. In each iteration, resolution-level features matching the current iteration are extracted from the stereo geometric features; the resolution-level features, the features corresponding to the current disparity, and the combined context features are fused to obtain fused features; the fused features are input into the initialized loop update module, which outputs the disparity update amount; the current disparity is adjusted using the disparity update amount to obtain the updated current disparity; Repeat the steps of each iteration until the number of iterations reaches the preset termination condition, and take the current disparity at this time as the final disparity.
[0011] Optionally, each iteration further includes: Based on the current disparity of this iteration, the right view of the stereo image pair is processed to obtain the reconstructed left view; Calculate the stereo reconstruction loss, which is the photometric difference between the reconstructed left view and the left view of the stereo image pair; The monocular depth reference information is used to distinguish the occlusion regions in the stereo image pair, and the occlusion perception loss is calculated. Calculate the disparity guidance loss, which is the difference between the current disparity and the initial disparity in the current iteration; A self-supervised joint loss function is constructed by combining the stereo reconstruction loss, the occlusion perception loss, and the disparity guidance loss. The parameters of the cyclic update module are optimized by minimizing the self-supervised joint loss function through backpropagation.
[0012] Optionally, the solid geometric features are extracted according to the following steps: Feature maps of the left and right views of the stereo image pair are extracted by a feature extraction encoder. Calculate the similarity correlation between the left view feature map and the right view feature map under different disparities, and construct the correlation volume; Multi-scale downsampling is performed on the relevant volumes to construct a multi-scale relevant feature pyramid; The multi-scale correlated feature pyramid is used as the stereo geometric features extracted from the stereo image pair.
[0013] According to a second aspect of the embodiments of this application, a stereo depth estimation system is provided, the system comprising: A stereo image acquisition module is used to acquire a stereo image pair of an image to be processed, wherein the stereo image pair includes a left view and a right view of the image to be processed; The monocular depth estimation module is used to call the visual base model to perform monocular depth estimation on the left view and obtain monocular depth reference information. The scale calibration module is used to perform scale calibration on the monocular depth reference information using the geometric constraints of the stereo image pair to obtain the initial disparity; An iterative module is used to iteratively optimize the initial disparity using the stereo geometric features extracted from the stereo image pair to obtain the final disparity; The depth measurement map module is used to generate a depth measurement map based on the final disparity.
[0014] According to a third aspect of the present application, an electronic device is provided, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method described in the first aspect above.
[0015] According to a fourth aspect of the embodiments of this application, a computer-readable storage medium is provided having computer-readable instructions stored thereon, which can be executed by a processor to implement the method described in the first aspect above.
[0016] In summary, this application provides a stereo depth estimation method, system, device, and medium. It involves acquiring a stereo image pair of an image to be processed, the stereo image pair including a left view and a right view of the image to be processed; using a visual fundamental model to perform monocular depth estimation on the left view to obtain monocular depth reference information; using the geometric constraints of the stereo image pair to perform scale calibration on the monocular depth reference information to obtain an initial disparity; using stereo geometric features extracted from the stereo image pair to iteratively optimize the initial disparity to obtain a final disparity; and generating a metric depth map based on the final disparity. This step-by-step processing achieves high-precision stereo depth estimation of the image to be processed, generating a depth map with practical metric significance. First, stereo image pairs are acquired as basic inputs. Based on a fundamental visual model, monocular depth estimation is performed on the left view to quickly obtain monocular depth reference information that can be used as a benchmark for subsequent processing. Second, the geometric constraints of the stereo image pairs are used to correct the scale bias of the monocular depth reference information, transforming the monocular depth without absolute scale into a precise initial disparity. Finally, the stereo geometric features of the stereo image pairs are extracted to iteratively optimize the initial disparity, correcting local errors and improving detail accuracy. A usable metric depth map is then generated from the final disparity, solving the problems of insufficient scale in monocular depth estimation and lack of accuracy in initial disparity, ensuring the practicality and accuracy of the depth results. Attached Figure Description
[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the structures shown in these drawings without creative effort.
[0018] The structures, proportions, sizes, etc. illustrated in this specification are only for the purpose of assisting those skilled in the art in understanding and reading the content disclosed herein, and are not intended to limit the conditions under which the present invention can be implemented. Therefore, they have no substantial technical significance. Any modifications to the structure, changes in the proportions, or adjustments to the size, without affecting the effects and objectives that the present invention can produce, should still fall within the scope of the technical content disclosed in the present invention.
[0019] Figure 1 A flowchart of a stereo depth estimation method provided in this application embodiment; Figure 2 A visualization of the core architecture and training process of the underwater 3D depth estimation scheme provided in the embodiments of this application; Figure 3 A schematic diagram comparing the architecture of the underwater 3D depth estimation method provided in the embodiments of this application with existing benchmark methods; Figure 4 A schematic diagram of a stereo depth estimation system provided in an embodiment of this application; Figure 5 This paper shows a structural diagram of an electronic device provided in an embodiment of this application; Figure 6 A diagram of a computer-readable storage medium provided in an embodiment of this application is shown.
[0020] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.
[0022] It should be noted that all directional indications (such as up, down, left, right, front, back, etc.) in the embodiments of the present invention are only used to explain the relative positional relationship and movement of each component in a certain specific posture (as shown in the figure). If the specific posture changes, the directional indication will also change accordingly.
[0023] Furthermore, in this invention, descriptions involving "first," "second," etc., are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0024] In this invention, unless otherwise explicitly specified and limited, the terms "connection," "fixed," etc., should be interpreted broadly. For example, "fixed" can mean a fixed connection, a detachable connection, or an integral part; it can mean a mechanical connection or an electrical connection; it can mean a direct connection or an indirect connection through an intermediate medium; it can mean the internal communication of two components or the interaction between two components, unless otherwise explicitly limited. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0025] Furthermore, the technical solutions of the various embodiments of the present invention can be combined with each other, but only if they are based on the ability of those skilled in the art to implement them. When the combination of technical solutions is contradictory or cannot be implemented, it should be considered that such combination of technical solutions does not exist and is not within the scope of protection claimed by the present invention.
[0026] Figure 1 This application illustrates a stereo depth estimation method provided in an embodiment, the method comprising: Step 101: Obtain a stereo image pair of the image to be processed, wherein the stereo image pair includes a left view and a right view of the image to be processed; Step 102: Call the visual base model to perform monocular depth estimation on the left view to obtain monocular depth reference information; Step 103: Using the geometric constraints of the stereo image pair, scale the monocular depth reference information to obtain the initial disparity; Step 104: Iteratively optimize the initial disparity using the stereo geometric features extracted from the stereo image pair to obtain the final disparity; Step 105: Generate a depth map based on the final disparity.
[0027] This application provides a stereo depth estimation method that addresses the problems of existing stereo depth estimation methods, such as difficulty in balancing global consistency and local measurement accuracy, and scale ambiguity in monocular depth estimation. It achieves accurate and usable stereo depth estimation, adaptable to practical applications in special scenarios such as underwater environments. Step 101: Obtain stereo image pairs of the image to be processed, providing basic input data for all subsequent depth processing steps and ensuring a reliable image source for subsequent processing. Step 102: Call the visual baseline model to perform monocular depth estimation on the left view, quickly obtaining globally consistent monocular depth reference information, providing a benchmark for subsequent disparity calculation and simplifying the initial depth calculation process. Step 103: Utilize the inherent geometric constraints of the stereo image pairs to perform scale calibration on the monocular depth reference information, effectively correcting the scale ambiguity defect in monocular depth estimation and obtaining an initial disparity with acceptable accuracy. Step 104: Extract stereo geometric features from the stereo image pairs, and iteratively optimize the initial disparity based on these features, compensating for local detail errors in the initial disparity and further improving the accuracy and fit of the disparity. Step 105: Generate a metric depth map based on the optimized final disparity, transforming the disparity result into depth information with practical physical measurement significance. It can adapt to the needs of special imaging scenarios such as underwater, and provide reliable three-dimensional geometric information support for autonomous underwater vehicle operations, underwater surveying and mapping and other related tasks.
[0028] In one possible implementation, in step 102, the step of calling the visual base model to perform monocular depth estimation on the left view to obtain monocular depth reference information includes: adjusting the encoder of the visual base model using low-rank adaptation technology to obtain an encoder adapted to the target scene, and integrating the adapted encoder into the visual base model; inputting the left view into the visual base model integrated with the adapted encoder, extracting a multi-scale feature pyramid, the multi-scale feature pyramid including image features at several resolution levels; aggregating adjacent scale features on the multi-scale feature pyramid to obtain aggregated features; and decoding the aggregated features to generate the monocular depth reference information.
[0029] In one possible implementation, step 102 involves calling the visual base model to perform monocular depth estimation on the left view to obtain monocular depth reference information. This step focuses on adapting the visual base model to the target scene and improving the accuracy of monocular depth estimation. Specifically, the process is as follows: First, the encoder of the visual base model is adjusted using low-rank adaptation technology to obtain an encoder adapted to the target scene. This adapted encoder is then integrated into the visual base model to ensure the model can adapt to the imaging characteristics of underwater and other target scenes. Next, the left view is input into the visual base model integrated with the adapted encoder, and a multi-scale feature pyramid containing image features at several resolution levels is extracted to provide comprehensive feature support for subsequent depth information extraction. Then, adjacent-scale features are aggregated from the multi-scale feature pyramid to integrate feature information at different resolution levels, improving the completeness and correlation of the features. Finally, the aggregated features are decoded to generate monocular depth reference information, providing a globally consistent foundational depth basis for scale calibration in step 103.
[0030] In one possible implementation, in step 102, adjusting the encoder of the visual base model using low-rank adaptation technology to obtain an encoder adapted to the target scene includes: introducing a low-rank adaptation module into the encoder of the visual base model, the low-rank adaptation module including a low-rank matrix, the low-rank adaptation module being used to incrementally update the base weights of the model encoder; configuring learnable importance weights for each rank component in the low-rank matrix; optimizing the training of the low-rank matrix parameters and importance weights in the low-rank adaptation module while keeping the pre-trained base weights of the visual base model encoder unchanged; using a sparse regularization constraint to set the values of rank components with importance weights below a preset threshold to zero; and integrating the valid rank components whose values are not set to zero into the pre-trained base weights of the visual base model encoder to obtain an encoder adapted to the target scene.
[0031] Regarding the specific process of "adjusting the visual base model encoder using low-rank adaptation technology" in the above implementation, a possible implementation can be further refined. Its core objective is to achieve efficient model adaptation to the target scene without changing the pre-trained weights of the visual base model and reducing training costs. Specifically, this includes: introducing a low-rank adaptation module containing a low-rank matrix into the encoder of the visual base model; incrementally updating the base weights of the model encoder through this low-rank adaptation module to avoid retraining the entire encoder weights; and configuring learnable importance weights for each rank component in the low-rank matrix to distinguish... The role of each rank component in target scene adaptation is analyzed; only the low-rank matrix parameters and importance weights of the low-rank adaptation module are optimized and trained, keeping the pre-trained basic weights of the visual base model encoder unchanged, balancing adaptation effect and training efficiency; a sparse regularization constraint is adopted to set rank components with importance weight values below a preset threshold to zero, realizing automatic pruning of redundant rank components and simplifying the model structure; the effective rank components whose values are not set to zero are integrated into the pre-trained basic weights of the visual base model encoder, finally obtaining an encoder adapted to the target scene, ensuring that the encoder can accurately adapt to the imaging characteristics of the target scene.
[0032] In one possible implementation, in step 102, during the process of decoding the aggregated features to generate the monocular depth reference information, the decoding process and the adaptation parameters of the visual base model are jointly optimized using a self-supervised loss, including: reconstructing the left view based on the monocular depth reference information to obtain a reconstructed left view; calculating a photometric reconstruction loss, where the photometric reconstruction loss is the photometric difference between the reconstructed left view and the original left view; calculating an edge-aware smoothness loss, where the edge-aware smoothness loss is the edge continuity constraint value of the monocular depth reference information; constructing a self-supervised loss function based on the photometric reconstruction loss and the edge-aware smoothness loss; and backpropagating the loss value of the self-supervised loss function through backpropagation to optimize the adapted encoder parameters and decoding parameters of the decoding process in the visual base model.
[0033] In another possible implementation, step 102, during the decoding of the aggregated features to generate monocular depth reference information, uses self-supervised loss to jointly optimize the decoding process and the adaptation parameters of the visual base model. The aim is to further improve the accuracy and edge coherence of the monocular depth reference information. Specifically, the following steps are taken: the left view is reconstructed based on the generated monocular depth reference information to obtain a reconstructed left view; the photometric difference between the reconstructed left view and the original left view is calculated as the photometric reconstruction loss to correct the photometric deviation in depth estimation; the edge continuity constraint value of the monocular depth reference information is calculated as the edge-aware smoothness loss to avoid meaningless depth abrupt changes in the monocular depth reference information; a self-supervised loss function is constructed based on the photometric reconstruction loss and the edge-aware smoothness loss to integrate the constraint effects of the two types of losses; the loss value of the self-supervised loss function is backpropagated through backpropagation to simultaneously optimize the adapted encoder parameters and decoding parameters in the visual base model, making the generated monocular depth reference information more consistent with the actual scene and providing more accurate basic data for subsequent scale calibration and iterative optimization.
[0034] In one possible implementation, in step 103, the scaling calibration of the monocular depth reference information using the geometric constraints of the stereo image pair to obtain an initial disparity includes: extracting feature points from the left and right views of the stereo image pair, and obtaining a sparse correspondence based on feature point matching and bidirectional consistency checks, wherein the sparse correspondence includes a set of matching points and corresponding depth values; calculating a scaling factor between the monocular depth reference information and the depth values of the sparse correspondence; determining whether the scaling factor is within a preset threshold range; if it is within the preset threshold range, directly converting the monocular depth reference information into an initial disparity based on camera intrinsic parameters; if it is not within the preset threshold range, correcting the monocular depth reference information using the scaling factor, and then converting it into an initial disparity based on camera intrinsic parameters; wherein the camera intrinsic parameters include camera focal length and binocular baseline distance, and the binocular baseline distance is the distance between the shooting angles of the left and right views of the stereo image pair.
[0035] In one possible implementation, step 103 utilizes the geometric constraints of the stereo image pair to perform scale calibration on the monocular depth reference information obtained in step 102 to obtain an initial disparity, thereby solving the scale ambiguity problem in the monocular depth reference information and transforming the monocular depth without absolute scale into an initial disparity with precise scale, providing a reliable foundation for subsequent iterative optimization. The specific implementation process is as follows: First, feature points of the left and right views in the stereo image pair are extracted. Through feature point matching and bidirectional consistency checks, reliable matching results are selected to obtain a sparse correspondence containing the matching point set and the corresponding depth value. This sparse correspondence can provide a depth reference with accurate scale based on the geometric association of the stereo image pair.
[0036] Next, the scaling factor between the monocular depth reference information and the depth values in the sparse correspondence is calculated. This scaling factor is used to quantify the deviation between the monocular depth reference information and the actual scale. Then, it is determined whether the scaling factor is within a preset threshold range to determine if the scale deviation of the monocular depth reference information is within an acceptable range. If the scaling factor is within the preset threshold range, it indicates that the scale deviation of the monocular depth reference information is small, and the monocular depth reference information can be directly converted into initial disparity based on the camera intrinsic parameters. If the scaling factor is not within the preset threshold range, the monocular depth reference information is first corrected using the scaling factor to eliminate the scale deviation, and then converted into initial disparity based on the camera intrinsic parameters. The camera intrinsic parameters include the camera focal length and the binocular baseline distance. The binocular baseline distance is the distance between the shooting angles of the left and right views of the stereo image pair. These camera intrinsic parameters provide necessary parameter support for the conversion of disparity and depth values, ensuring that the converted initial disparity has practical metric significance and further guaranteeing the accuracy of the entire stereo depth estimation process.
[0037] In one possible implementation, in step 104, the initial disparity is iteratively optimized using the stereo geometric features extracted from the stereo image pair to obtain the final disparity. This includes: extracting combined context features from the stereo image pair, the combined context features including global features output by a visual base model adapted with low-rank adaptation technology and local detail features extracted by a lightweight convolutional neural network; inputting the combined context features into a loop update module to initialize the hidden state of the loop update module; assigning the initial disparity as the current disparity at the beginning of the iteration; in each iteration, extracting resolution-level features matching the current iteration from the stereo geometric features; fusing the resolution-level features, the features corresponding to the current disparity, and the combined context features to obtain fused features; inputting the fused features into the initialized loop update module, which outputs a disparity update amount; adjusting the current disparity using the disparity update amount to obtain the updated current disparity; repeating the steps of each iteration until the number of iterations reaches a preset termination condition, and using the current disparity at this point as the final disparity.
[0038] In one possible implementation, step 104 uses the stereo geometric features extracted from the stereo image pair to iteratively optimize the initial disparity obtained in step 103 to obtain the final disparity. This compensates for the initial disparity's lack of local detail accuracy, improves the accuracy and fit of the disparity, and makes it more consistent with the three-dimensional geometric relationship of the actual scene, providing support for the subsequent generation of accurate depth measurement maps. The specific implementation process is as follows: First, the combined context features of the stereo image pair are extracted. These combined context features include global features output by the visual base model adapted by low-rank adaptation technology and local detail features extracted by a lightweight convolutional neural network, taking into account both global scene information and local detail information, providing comprehensive feature support for iterative optimization. Then, the combined context features are input into the cyclic update module to initialize the hidden state of the cyclic update module, providing a stable module foundation for disparity updates. At the same time, the initial disparity obtained in step 103 is assigned as the current disparity at the initial stage of iteration, determining the starting benchmark for iterative optimization.
[0039] In each iteration, the resolution-level features matching the current iteration are first extracted from the 3D geometric features to ensure precise correspondence between feature extraction and iteration progress. Then, these resolution-level features, the features corresponding to the current disparity, and the combined context features are fused to integrate multi-dimensional feature information and improve feature relevance and effectiveness. Next, the fused features are input into the initialized loop update module, which outputs the disparity update amount, providing a basis for adjusting the current disparity. Finally, the current disparity is adjusted using this disparity update amount to obtain the updated current disparity. The above steps for each iteration are repeated until the number of iterations reaches a preset termination condition. The current disparity at this point is taken as the final disparity, completing the precise optimization of the initial disparity.
[0040] In one possible implementation, step 104 further includes, in each iteration: processing the right view of the stereo image pair based on the current disparity of the current iteration to obtain a reconstructed left view; calculating the stereo reconstruction loss, which is the photometric difference between the reconstructed left view and the left view of the stereo image pair; using the monocular depth reference information to distinguish occlusion regions in the stereo image pair and calculating the occlusion perception loss; calculating the disparity guidance loss, which is the difference between the current disparity of the current iteration and the initial disparity; constructing a self-supervised joint loss function by combining the stereo reconstruction loss, the occlusion perception loss, and the disparity guidance loss; and optimizing the parameters of the loop update module by minimizing the self-supervised joint loss function through backpropagation.
[0041] In one possible implementation, a self-supervised joint loss optimization step is added to each iteration of step 104 to further improve the accuracy of iterative optimization, correct deviations in the disparity update process, and ensure the accuracy of the final disparity. Specifically, the implementation is as follows: the right view of the stereo image pair is processed based on the current disparity of the current iteration to obtain the reconstructed left view; the photometric difference between the reconstructed left view and the left view of the stereo image pair is calculated as the stereo reconstruction loss to constrain the photometric consistency of the disparity; the monocular depth reference information obtained in step 102 is used to distinguish the occlusion regions in the stereo image pair, and the occlusion perception loss is calculated to avoid interference from the occlusion regions on the disparity optimization; the difference between the current disparity and the initial disparity of the current iteration is calculated as the disparity guidance loss to constrain the global consistency between the current disparity and the initial disparity; a self-supervised joint loss function is constructed by combining the stereo reconstruction loss, the occlusion perception loss, and the disparity guidance loss to integrate the constraint effects of the three types of losses; the parameters of the iterative update module are optimized by minimizing the self-supervised joint loss function through backpropagation, so that the disparity update amount output by the iterative update module is more accurate, and the iterative optimization effect is further improved.
[0042] In one possible implementation, in step 104, the stereo geometric features are extracted according to the following steps: feature maps of the left and right views of the stereo image pair are extracted by a feature extraction encoder; the similarity correlation between the feature map of the left view and the feature map of the right view under different disparities is calculated to construct a correlation volume; the correlation volume is subjected to multi-scale downsampling processing to construct a multi-scale correlation feature pyramid; and the multi-scale correlation feature pyramid is used as the stereo geometric features extracted from the stereo image pair.
[0043] In one possible implementation, the extraction process of the required stereo geometric features in step 104 is further refined to obtain features that accurately reflect the spatial geometric relationship between the left and right views of the stereo image pair, providing reliable feature support for disparity iterative optimization. The specific extraction steps are as follows: feature maps of the left and right views of the stereo image pair are extracted by a feature extraction encoder to capture the image features of each view; the similarity correlation between the feature maps of the left and right views under different disparities is calculated to construct a correlation volume, which can characterize the degree of correlation between the features of the two views under different viewpoint offsets; multi-scale downsampling processing is performed on the correlation volume to construct a multi-scale correlation feature pyramid, integrating the correlation feature information of different scales; the multi-scale correlation feature pyramid is used as the stereo geometric features extracted from the stereo image pair to provide accurate feature input for feature extraction and disparity optimization in each subsequent iteration.
[0044] In one possible implementation, step 105, generating a depth map based on the final disparity, specifically includes: The final disparity is upsampled at full resolution to restore it to the same resolution as the stereo image pair; based on the camera's intrinsic parameters, the final disparity at full resolution is converted into a depth value with practical metric significance; the depth value is then smoothed at the edges to generate a visually coherent and metrically accurate depth map.
[0045] In one possible implementation, step 105 generates a metric depth map based on the final disparity obtained in step 104. This transforms the optimized disparity result into depth information with practical physical metric significance and visual coherence, realizing the practical application value of stereo depth estimation and providing reliable 3D geometric data support for tasks such as underwater robot operations and scene mapping. The specific implementation process is as follows: First, the final disparity is upsampled at full resolution to restore the disparity resolution to the same level as the stereo image pair, ensuring that the depth map has the same resolution as the original image, guaranteeing the spatial correspondence of depth information, and avoiding depth positioning deviations caused by resolution differences. Then, based on the camera's intrinsic parameters, the final disparity at full resolution is converted into a depth value with practical metric significance. The camera intrinsic parameters can use parameters such as camera focal length and binocular baseline distance mentioned in step 103 to ensure parameter uniformity and achieve accurate conversion from disparity to depth values. This allows the depth values to accurately reflect the actual distances of various targets in the scene, making them practically applicable. Finally, the transformed depth values are smoothed at the edges to eliminate abrupt changes and noise interference at the edges of the depth values, avoid visual discontinuities, and ultimately generate a visually coherent and accurately measured depth map, completing the entire stereo depth estimation process. This not only ensures the measurement accuracy of the depth map but also improves its visual readability, meeting the needs of practical applications for the use of three-dimensional geometric information.
[0046] In one possible implementation, the method further includes a model pre-training step, specifically: constructing a synthetic stereo dataset for the target domain, the synthetic stereo dataset containing stereo image pairs of multiple scenes, and each stereo image pair being accompanied by dense ground truth depth values and semantic masks; using the synthetic stereo dataset, performing joint pre-training of the visual base model with a parameter efficiency adaptation and iterative optimization module to improve the model's adaptability and depth estimation accuracy in the target domain.
[0047] In one possible implementation, the stereo depth estimation method of this application further includes a model pre-training step, which improves the adaptability and depth estimation accuracy of the visual base model and iterative optimization module in the target domain (such as underwater scenes) in advance, laying the model foundation for the efficient and accurate execution of subsequent steps 102-105, and solving the problem of insufficient adaptability and low estimation accuracy of the visual base model due to the difference between the training scene and the target scene. The specific implementation process is as follows: First, a synthetic stereo dataset of the target domain is constructed. This synthetic stereo dataset contains stereo image pairs of multiple scenes, which can comprehensively cover various imaging scenes in the target domain. At the same time, each stereo image pair is accompanied by dense depth ground truth and semantic mask. The dense depth ground truth provides an accurate reference benchmark for model training, and the semantic mask can help the model distinguish different targets in the scene and improve the model's understanding of the target domain scene.
[0048] Subsequently, using the constructed synthetic stereo dataset, the visual base model is jointly pre-trained with a parameter efficiency adaptation and iterative optimization module. The parameter efficiency adaptation of the visual base model can utilize the low-rank adaptation technique mentioned in step 102 to ensure consistency between the pre-training and subsequent actual processing adaptation logic. During the joint pre-training process, the adaptation parameters of the visual base model and the relevant parameters of the iterative optimization module are simultaneously optimized to achieve synergistic adaptation and avoid insufficient synergy caused by single-module pre-training. This joint pre-training effectively improves the model's adaptability to the target domain, enabling it to quickly adapt to the imaging characteristics of the target domain. Simultaneously, it significantly improves the model's depth estimation accuracy, providing more reliable model support for monocular depth estimation in step 102 and disparity iterative optimization in step 104, further ensuring the stability and accuracy of the entire stereo depth estimation process.
[0049] To make the technical solutions, implementation processes, and beneficial effects of the embodiments of this application clearer, the stereo depth estimation method provided by the embodiments of this application will be described in detail below with reference to the accompanying drawings.
[0050] Figure 2 This document presents a visualization of the core architecture and training process of the Stereo Adapter underwater stereo depth estimation scheme provided in this application embodiment. It is divided into two sub-graphs, (a) and (b), which correspond to the overall two-stage self-supervised training framework and the continuous learning mechanism of dynamic LoRA adaptation, respectively. These sub-graphs represent the core visual representation of the entire technical solution. The complete technical process of the Stereo Adapter is then broken down into five core stages: data preprocessing, monocular depth estimation, stereo depth estimation, dynamic LoRA adaptation, and model optimization and inference, which are described in detail below.
[0051] (I) Appendix Figure 2(a): Stereo Adapter two-stage self-supervised training framework (core execution process) Figure 2 (a) demonstrates the end-to-end process from underwater binocular image input to high-precision depth map output. The core is the connection and interaction between the mono stage and the stereo stage, achieving the goal of "coarse prior estimation → fine depth refinement". At the same time, the entire process adopts self-supervised training, without the need for manual depth annotation.
[0052] 1. Input Layer: Input is a pair of underwater stereo images enhanced by Mobile IE vision (left image I) L Right Image I R First, we need to eliminate distortions such as color shift, scattering, and blurring in underwater imaging to provide high-quality input for subsequent feature extraction. This is a necessary prerequisite for underwater vision tasks.
[0053] 2. Mono Stage: This stage adapts the visual baseline model encoder using low-rank adaptation techniques, extracts multi-scale features, and performs self-supervised optimization to generate globally consistent monocular depth reference information, providing a benchmark for subsequent stereo depth estimation. Specifically, it includes: Step a: Feature extraction: Using the pre-trained DepthAnythingV2 as the base encoder, extract multi-scale feature pyramids with resolutions of {H / 4, H / 8, H / 16, H / 32}. A LoRA low-rank adapter module is embedded in the Transformer layer of the encoder to adjust the basic weights. Perform low-rank decomposition:
[0054] in , Let r be a low-rank matrix. min(d,k) is an adaptive rank function that optimizes only A, B, and the decoder while freezing W0, achieving parameter-efficient underwater domain adaptation.
[0055] Step b: Feature aggregation and decoding: After aggregating multi-scale features through SDFA blocks (fusing adjacent scale contexts while maintaining spatial consistency), the decoder generates a discrete disparity volume. (N is the predefined number of disparity levels), and is converted into monocular depth prior M. mono .
[0056] Step c: Self-supervised loss constraint: The monocular training target is determined by the photometric reconstruction loss L. monorec and edge-aware smoothness loss L monosmooth composition:
[0057] Where λ1 is a predefined weight parameter; L monorec Parallax V in the right view mr and I R Reconstruct the left view ,measure With the original I L The difference in light intensity enables self-supervised optimization.
[0058] 3. Stereo Stage: First, the monocular depth reference information is scaled using the geometric constraints of the stereo image pair to obtain the initial disparity. Then, stereo geometric features are extracted, and the initial disparity is iteratively optimized through a loop update module. Finally, a self-supervised joint loss constraint optimization process is used to obtain the accurate final disparity. This stage uses the monocular depth prior M... mono Based on this, iterative optimization of disparity is achieved through stereo correlation pyramid construction, hybrid scale alignment, and Conv GRU iterative refinement, ultimately outputting a full-resolution depth metric. Specifically, this includes: Step a: Parallax initialization: Set the monocular depth prior M mono Converting to the initial disparity d(1), the formula is as follows:
[0059] Where, d (1) Let f be the initial parallax, b be the camera focal length, and M be the binocular baseline distance. mono This is a monocular depth prior (the depth map obtained from monocular depth estimation).
[0060] Using the geometric constraints of the stereo image pair, the monocular depth reference information is scaled to obtain an initial disparity. Through sparse depth matching and scale coefficient correction, the monocular depth without absolute scale is transformed into an initial disparity with accurate scale.
[0061] Step b: Construction of stereo correlation pyramid: Extract stereo geometric features from the stereo image pairs, and construct a multi-scale correlation feature pyramid by calculating the correlation between the features of the left and right views under different disparities, which will serve as the input for stereo geometric features in subsequent iterative optimization.
[0062] Binocular feature maps The formula for calculating 4D related volume is defined as follows:
[0063] in , D represents the inner product operation. maxThe maximum disparity value is then used; a multi-scale correlation pyramid is constructed using average pooling. This provides stereo matching cues of different granularities for parallax refinement, from coarse to fine.
[0064] Step c: Hybrid Scale Alignment and Refinement: Using sparse stereo correspondence as anchor points, verify and correct the scale of the initial monocular disparity. The process is as follows: Step c-1: Sparse Depth Measurement: Obtain the sparse metric depth D through feature matching with bidirectional consistency checks. sparse :
[0065] Where P matched For a sparse matching point set, d p Let be the disparity value of the matching point p.
[0066] Step c-2: Monocular scale verification: Calculate the scale factor α between monocular depth and sparse depth.
[0067] If |α If 1 | < τ (τ = 0.1, default threshold), then the monocular scale is reliable; otherwise, the correction coefficient is calculated using least squares. (Scaling) and (Offset):
[0068] Step c-3: Initial alignment depth:
[0069] Step c-4: Optimization of confidence-weighted propagation:
[0070] Where the bilateral weight w pq Used to preserve image edges during propagation correction:
[0071] σ d σ c This is the preset smoothing parameter.
[0072] Step d: Combined Context Encoder: Extracts the combined context features of the stereo image pairs, and fuses the global features output by the visual base model adapted by low-rank adaptation technology with the local detail features extracted by the lightweight convolutional neural network to provide comprehensive context information for the iterative update module.
[0073] Multiplexing the features of the VFM encoder after LoRA adaptation in the monocular stage Combined with local features extracted by lightweight CNN After channel alignment and merging, the formula is defined as:
[0074]
[0075]
[0076] Where l∈{4,8,16} represents the resolution level. Represents the recombination module that integrates LoRA weights from the monocular stage, Conv align Align convolutional blocks with learnable channels to fuse features that combine global semantic / geometric guidance with local detail accuracy.
[0077] Step e: ConvGRU iterative optimization: The initial disparity is iteratively optimized using stereo geometric features to obtain the final disparity. The disparity is then gradually corrected through a cyclic update module to improve the local detail accuracy of the disparity.
[0078] With the fused context features {h (l) Initialize the Conv GRU hidden state, and extract features c from the relevant pyramid in each iteration. (l) With current disparity features (d (l) ), contextual features g (l) After concatenation, input Conv GRU, the formula is:
[0079] in ( ) represents the operation of extracting features from the current disparity map, Δd (l) This represents the disparity update amount; the number of iterations is 32 (the optimal value verified by ablation experiments), ultimately yielding the optimized disparity d. (L) (L=32).
[0080] Step f: Full-resolution restoration and depth transformation: Generate a metric depth map based on the final parallax. Through upsampling, camera intrinsic parameter transformation, and edge smoothing, a visually coherent and metrically accurate metric depth map is generated. The same upsampling module as RAFT-Stereo is used to transform d... (L) Upsample to the full resolution of the input image, then convert to a depth metric D (directly converted using camera intrinsics).
[0081] Step g: Stereo self-supervised joint loss constraint: Define the complete stereo training objective, consisting of 4 parts: Step g-1: Stereo reconstruction loss (constraining binocular spectral consistency):
[0082] in To pass d (L) Twisted I R The reconstructed left image is obtained, where α is the weighting parameter.
[0083] Step g-2: Occlusion perception processing (using monocular prediction to process occluded areas):
[0084] Where M occ For d (1) Calculated occlusion mask, The left image shows monocular depth reconstruction; ⊙ indicates element-wise multiplication.
[0085] Step g-3: Parallax Guiding Loss (using monocular prior-regulated stereo refinement):
[0086] in x、 y is the horizontal / vertical gradient operator, M out This is an invalid reprojection pixel mask.
[0087] Step g-4: Edge-aware smoothness loss (optimizing the coherence of the final depth map):
[0088] λ3 and λ4 are preset weight parameters (the optimal values are determined through ablation experiments).
[0089] 4. Output layer: The final output is a full-resolution, globally consistent, and locally metrically accurate underwater stereo depth map D, which can directly provide three-dimensional geometric information for underwater robot navigation, obstacle avoidance, and mapping.
[0090] (II) Appendix Figure 2 (b): Continuous learning mechanism for dynamic LoRA adaptation (highly efficient parameter core) Figure 2 (b) illustrates the dynamic LoRA architecture of the Stereo Adapter, which addresses the adaptation deficiencies and parameter redundancy issues of traditional fixed-rank LoRA, achieving continuous adaptation from "land-based pre-training → underwater monocular → underwater stereo," specifically including: 1. Basic encoder: Based on the DepthAnythingV2 Transformer encoder, with weights W0 (initial pre-trained weights) from the land pre-training.
[0091] 2. Adaptive Rank Selection: Abandoning fixed-rank decomposition, we introduce learnable importance weights w∈Rr and redefine low-rank update, with the following formula:
[0092] in Let i be the learning importance of the rank component in the m-th module during the t-th iteration. , This is the low-rank decomposition vector corresponding to the rank component; The diagonal elements of the approximate singular value matrix enable the model to automatically focus on the core subspace orientation of the underwater mission.
[0093] 3. Sparse Regularization Optimization: L1 sparse regularization is added to drive the weights of redundant rank components to zero. The formula is as follows:
[0094] in To supervise the learning objective (corresponding to the self-supervised loss in the monocular / stereo stage), λ is the regularization strength; since L1 regularization is not differentiable, the proximal gradient method + soft thresholding operation is used to update the weights, as shown in the formula:
[0095] in Here, κ represents the importance weights after gradient update, and κ is the threshold parameter (increasing gradually from 0 to κ during training). max ), 1( ) is an indicator function.
[0096] 4. Two-stage training process: Clearly divided into two stages: Intensive training phase: accounting for 45%~50% of the total number of iterations, without applying soft thresholding operations, allowing all rank components to fully capture the underwater task features; Sparse training phase: Activate soft thresholding operation, prune redundant rank components, and avoid prematurely eliminating important adaptation directions.
[0097] 5. Continuous Weight Integration: After each adaptation stage (monocular, stereo), the rank components of non-zero importance weights are integrated into the base weight W0, using the following formula:
[0098] The integration eliminates the overhead of auxiliary modules during inference while retaining domain-adaptive knowledge, achieving continuous, non-redundant adaptation.
[0099] The Stereo Adapter technical solution provided in this application consists of three phases: pre-training preparation, two-stage self-supervised training, dynamic LoRA adaptation, and inference deployment. Phase 1: Pre-training preparations (prerequisites), specifically including: 1. Construction of the synthetic dataset: The UW-StereoDepth-40K dataset was generated based on Unreal Engine 5 (UE5). This dataset contains 40,000 pairs of 1280×960 resolution underwater stereo images, covering four types of scenes: coral reefs, industrial buildings, shipwrecks, and natural seabed. The camera baselines were sampled from {4cm, 10cm, 20cm, 40cm} to simulate different ROV platform configurations. Underwater effects such as caustics, floating particles, and depth-related color attenuation were added, along with dense depth ground truth and semantic segmentation masks. Data quality was ensured through automatic filtering (removing low-texture / extreme depth frames) and manual inspection.
[0100] 2. Data preprocessing stage, including: 2-1: Image Enhancement: Enhance images in all datasets by cropping, scaling, and random flipping (methods to improve generalization ability). 2-2: Visual Enhancement: Process underwater images through Mobile IE to eliminate distortion and restore visual quality. Its lightweight and fast inference features make it compatible with embedded platforms.
[0101] 3. Model initialization phase, including: 3-1: Encoder weights: Load the pre-trained weights W0 from DepthAnythingV2-B and freeze the base weights; 3-2: Parameter initialization: Initialize the LoRA low-rank matrices A and B (rank r=16, the optimal value in the ablation experiment), decoder, Conv GRU, and CNN local encoder parameters; 3-3: Training configuration: The optimizer is Adam W, and the learning rate η = 1 × 10⁻⁶. 4. Batch size BS=8 (optimal batch size), monocular training epoch=20, stereo training epoch=40 (fixed number of rounds in two-stage training).
[0102] Phase Two: Monocular Depth Estimation Training, specifically including: Step 1: Input a preprocessed single underwater image (extracted from a stereo image pair), and extract a multi-scale feature pyramid using the DepthAnythingV2 encoder; Step 2: The LoRA module calculates h=W0x+BAx to adapt to underwater scene features, optimizing only the A, B and decoder parameters; Step 3: SDFA blocks aggregate multi-scale features, and the decoder generates discrete disparity volumes V∈R. N×H×W Converted to monocular depth prior , ; Step 4: Calculate the monocular self-supervised loss, and backpropagate to update the parameters of A, B, and the decoder; Step 5: After training for 20 epochs, the basic weights for underwater monocular adaptation are obtained through sparse regularization and weight integration using dynamic LoRA.
[0103] The third stage: Dynamic LoRA adaptation. Low-rank adaptation techniques are used to adjust the encoder of the visual base model to obtain an encoder adapted to the target scene. Through adaptive rank selection, sparse regularization, and weight integration, efficient adaptation and continuous learning of the model in underwater scenes are achieved. Specifically, this includes: Step 1: Stereo Stage LoRA Initialization: Initialize new LoRA low-rank matrices A′ and B′ based on W^1, which are used for encoder adaptation from monocular settings to stereo settings; Step 2: Two-stage training optimization: First, perform 45%~50% intensive training (exploring rank components), then perform sparse training (soft thresholding to prune redundant components) to optimize the importance weight w′; Step 3: Stereo Stage Weight Integration: After stereo training is completed, the rank components with non-zero weights are integrated into the basic weights of underwater monocular adaptation to obtain the final basic weights of underwater stereo adaptation.
[0104] Phase 4: Stereo depth estimation training. This specifically includes: Step 1: Input the preprocessed stereo image pair I L I R Extract feature map f L ,f R ∈R C×H / 4×W / 4 ; Step 2: Convert to initial parallax After calibrating and correcting the scale using a hybrid scale alignment, the initial disparity after calibration is obtained. Step 3: Calculate the 4D relevant volume C(i,j,d) = f L (i,j),f R (i,j d) Constructing a multi-scale correlation pyramid ; Step 4: Extract combined context features and initialize ConvGRU hidden state h(0); Step 5: Perform 32 ConvGRU iterations to obtain the final disparity d(L); Step 6: Upsample d(L) to full resolution and convert it to a metric depth D; Step 7: Calculate the stereo self-supervised joint loss, and backpropagate to update the parameters of A′, B′, ConvGRU, CNN encoder, and upsampling module; Step 8: After training for 40 epochs, save the complete model; optional TartanAir underwater subset fine-tuning can further improve performance.
[0105] Phase 5: Inference and Deployment: Fully execute the stereo depth estimation method. Starting from acquiring stereo image pairs, it sequentially completes monocular depth estimation, scale calibration, iterative optimization, and metric depth map generation. The final output is three-dimensional geometric information that can be directly used for underwater robot navigation and obstacle avoidance.
[0106] The deployment of the Jetson Orin NX (16GB) platform to the BlueROV2 underwater robot fully complied with the experimental protocol: 1. Hardware configuration: The robot is equipped with a stereo camera (with calibrated focal length f and baseline b), an STM32 motion controller, and a Jetson Orin NX computing platform. The left and right sensors of the camera are triggered by a built-in synchronization circuit. 2. Data Acquisition: Three obstacle environments (dispersed, side-by-side, and clustered) are set up in an indoor rectangular water tank. The remote-controlled robot moves along three preset tracks and acquires 9 sets of synchronous stereo image sequences. 3. Real-time preprocessing: Mobile IE performs real-time distortion removal on the acquired IL and IR, scaling them to 640×360 resolution (adapting to the computing capabilities of embedded platforms). 4. End-to-end inference: Input the Stereo Adapter model accelerated by Tensor RT, and follow the instructions... Figure 2 (a) Process execution, output depth map D, actual inference latency is 1113ms / frame; 5. Application of results: The depth map is transmitted to the robot's motion control system, providing three-dimensional geometric information for autonomous navigation, obstacle avoidance, and infrastructure detection; 6. Evaluation Benchmark: A 3D metric grid for the scene is constructed using April Tag, a reference depth map is generated, and the REL, SQREL, RMSE, LOGRMSE, and A1 metrics are used for evaluation.
[0107] Figure 3 The paper presents a comparison of the architecture of the Stereo Adapter underwater stereo depth estimation method provided in this application with existing benchmark methods. (a) is the Stereo Anywhere method, (b) is the TiO-Depth method, and (c) is the Stereo Adapter method of this application. The three methods have significant differences in encoder design, feature fusion method and training mechanism.
[0108] In the arbitrary-position stereo vision method shown in Figure (a), features are extracted from the left and right images by independent feature encoders. A multi-scale correlation pyramid is then constructed, and the GRU updater is used for iterative optimization to finally output a full-resolution depth map. This method relies on the traditional stereo matching framework, does not fully utilize monocular depth priors, and the encoder is not adapted for underwater scenes, making it prone to photometric fragility and scale blurring issues in underwater imaging environments.
[0109] The TiO-Depth method shown in Figure (b) employs a dual-encoder architecture. Features are extracted from the left and right images via a convolutional window encoder and a Swin encoder, respectively. Then, the SDFA module performs multi-scale feature aggregation to generate monocular / binocular depth maps. Although this method introduces dual encoders to enhance feature representation, it still uses fixed-rank LoRA adaptation, which cannot dynamically adjust the adaptation strength. Furthermore, it fails to achieve efficient fusion of monocular priors and stereo matching, thus limiting its generalization ability and accuracy in underwater scenes.
[0110] Figure (c) shows the Stereo Adapter method of this application. The left and right images are respectively processed by an arbitrary depth V2 encoder (embedded with a dynamic LoRA module) to extract features. The dynamic LoRA module achieves efficient adaptation to underwater scenes through adaptive rank selection and sparse regularization. Subsequently, the features of the left image are processed by the SDFA module to generate a monocular depth prior (initial depth) with global consistency, providing a reliable benchmark for subsequent stereo optimization. At the same time, the features of the left and right images are used to construct a multi-scale correlation pyramid as input for stereo geometric features. Together with the monocular depth prior and combined context features, the pyramid is input to the GRU updater. The disparity is gradually corrected through iterative optimization, and finally, the full-resolution final depth map is output.
[0111] In terms of training mechanism, the Stereo Adapter method in this application constructs the UW-StereoDepth-40K synthetic stereo dataset (synthesized in UE5, containing 40,000 pairs of high-fidelity underwater stereo images, simulating different attenuation, scattering, particle and baseline settings) to jointly pre-train the visual base model with efficient parameter adaptation and iterative optimization modules; at the same time, dynamic LoRA is used to automatically select the effective rank of each layer, integrate the remaining components into the base weights, and perform multi-objective optimization through monocular prior guidance, photometric reconstruction, occlusion perception mask and edge perception smoothness, so as to achieve continuous scene adaptation without relying on dense underwater real data.
[0112] In terms of experimental verification, the method of this application achieved significant performance improvements on both simulated datasets (such as Tartan Air) and real underwater stereo depth benchmarks (such as SQUID), with RMSE improvements of 6.11% and 5.12%, respectively. In the actual deployment test of the BlueROV2 underwater robot, the method can stably output high-precision three-dimensional geometric information, directly supporting autonomous navigation, obstacle avoidance and mapping operations, fully verifying the practicality and reliability of the stereo depth estimation method.
[0113] To verify the effectiveness, adaptability, and practical application value of the stereo depth estimation method proposed in this application, multi-dimensional experiments were conducted in the embodiments of this application. The overall process was divided into three core stages: pre-training verification on synthetic datasets, performance evaluation on public benchmark datasets, and deployment testing in real-world underwater scenarios. The experiments focused on the model's accuracy, generalization ability, and real-time performance in underwater stereo depth estimation tasks. Simultaneously, ablation experiments were used to explore the role of core modules, comprehensively verifying the feasibility and superiority of the technical solution.
[0114] (a) Dataset and Evaluation Metrics The experiment used the UW-StereoDepth-40K synthetic dataset for model pre-training. This dataset was generated based on Unreal Engine 5 and contains 40,000 pairs of 800×1280 resolution underwater stereo images, covering four types of scenes: coral reefs, industrial ruins, shipwrecks, and natural seabed. It simulates different attenuation, scattering, grain effects, and binocular baseline distances, and includes dense depth ground truth and semantic masks.
[0115] The performance evaluation phase used two real-world underwater stereo datasets: Tartan Air and SQUID. Tartan Air contains 14 subsets of underwater stereo images from 22 different environments, while SQUID contains 57 pairs of stereo images captured in 4 different scenes. Evaluation metrics adopted were standard metrics in the depth estimation domain, including relative error (REL), squared relative error (SQREL), root mean square error (RMSE), log-mean square error (LOGRMSE), and accuracy metrics δ1, δ2, and δ3 (with thresholds of 1.25, 1.25², and 1.25³, respectively). Qualitative evaluations of visual coherence and edge accuracy were also supplemented.
[0116] (II) Experimental Environment and Model Configuration Model training is based on a hardware environment consisting of an Intel Xeon Platinum 8469C CPU, an NVIDIA L4 480GB GPU, and 64GB of memory. Inference deployment testing is carried out on the BlueROV2 underwater robot, with a Jetson OrinNX (16GB) core computing unit, equipped with a Zed2i stereo camera and an STM32 motion controller. All sensor data are collected through the robot's built-in synchronization circuit.
[0117] The model initialization uses a pre-trained DepthAnythingV2-B[7] encoder. The training strategy is divided into two stages: first, 20 cycles of monocular depth estimation training are performed, and then 40 cycles of stereo depth estimation training are performed. The optimizer is Adam W, and the learning rate is fixed at 1×10. -4 The batch size is 8. Dynamic LoRA adaptation uses a threshold of 0.01, and intensive training rounds account for 45%-50% to ensure that the model is fully adapted to the characteristics of underwater tasks.
[0118] (III) Experimental Results and Analysis After training on the UW-StereoDepth-40K pre-training dataset, the proposed method demonstrated state-of-the-art performance in tests on both the Tartan Air underwater subset and the SQUID dataset. On the Tartan Air dataset, the model achieved the lowest relative error (0.0527) and root mean square error (RMSE) (2.8047), with a δ1 accuracy of 94.67%. After fine-tuning on the Tartan Air dataset, the error was further reduced, with the RMSSE decreasing to 2.7834, and the accuracies for δ1, δ2, and δ3 reaching 95.12%, 98.36%, and 99.04%, respectively. On the SQUID dataset, the RMSSE was as low as 1.883, further reduced to 1.862 after fine-tuning on the Tartan Air dataset, with accuracies for δ1, δ2, and δ3 reaching 94.13%, 97.48%, and 98.52%, respectively, significantly outperforming existing mainstream stereo matching methods.
[0119] Qualitative evaluation results show that the depth map generated by the method in this application has stronger visual coherence, more accurate scale estimation for distant areas, clearer edge details, and high geometric shape reproduction, effectively solving the problems of texture loss and edge blurring that occur in traditional methods in underwater scenes.
[0120] (iv) Ablation experiment verification To clarify the technical contributions of the core modules, this embodiment conducted multiple ablation experiments: First, ablation experiments on the loop refinement module showed that the number of GRU layers, hidden dimensions, and iterations significantly affected performance; a configuration of 4 GRU layers, 128 hidden layers, and 32 iterations achieved the optimal balance between performance and computational efficiency. Second, ablation experiments on the dynamic LoRA configuration verified that a threshold of 0.01 and a dense round ratio of 45%-50% showed the best performance in terms of root mean square error and relative error. Third, ablation experiments on training hyperparameters showed that a medium batch size (8-16) and a learning rate of 1×10⁻⁶ were optimal. -4The settings, combined with a two-stage training strategy, enable more stable model convergence. Ultimately, the optimal performance was achieved by using a batch size of 8, 20 rounds of monocular training, and 40 rounds of stereo training.
[0121] Real-time performance testing was conducted on JetsonOrinNX (MaxN mode) with an input resolution of 640×320. The inference time per frame of the method in this application is as low as 113 milliseconds, which is much lower than that of comparative methods such as Stereo Anywhere (140 milliseconds), Depth AnythingB (327 milliseconds), and Foundation Stereo (702 milliseconds), thus meeting the time requirements for real-time operation of underwater robots.
[0122] Real-world deployment experiments were conducted in an indoor rectangular water tank, constructing three obstacle environments: dispersed, well-shaped, and clustered. A BlueROV2 robot navigated along three different trajectories, acquiring nine sets of synchronized stereo image sequences. Experimental results show that the depth information output by the proposed method accurately supports the robot's autonomous navigation and obstacle avoidance operations. The depth measurements highly match the ground truth values of the 3D reference mesh constructed based on AprilTag, fully verifying the reliability and practicality of the technical solution in real-world underwater operation scenarios.
[0123] In summary, this application provides a stereo depth estimation method. It involves acquiring a stereo image pair of the image to be processed, including a left view and a right view; performing monocular depth estimation on the left view using a visual baseline model to obtain monocular depth reference information; scaling the monocular depth reference information using the geometric constraints of the stereo image pair to obtain an initial disparity; iteratively optimizing the initial disparity using stereo geometric features extracted from the stereo image pair to obtain a final disparity; and generating a metric depth map based on the final disparity. This step-by-step processing achieves high-precision stereo depth estimation of the image to be processed, generating a depth map with practical metric significance. First, stereo image pairs are acquired as basic inputs. Based on a fundamental visual model, monocular depth estimation is performed on the left view to quickly obtain monocular depth reference information that can be used as a benchmark for subsequent processing. Second, the geometric constraints of the stereo image pairs are used to correct the scale bias of the monocular depth reference information, transforming the monocular depth without absolute scale into a precise initial disparity. Finally, the stereo geometric features of the stereo image pairs are extracted to iteratively optimize the initial disparity, correcting local errors and improving detail accuracy. A usable metric depth map is then generated from the final disparity, solving the problems of insufficient scale in monocular depth estimation and lack of accuracy in initial disparity, ensuring the practicality and accuracy of the depth results.
[0124] Based on the same technical concept, embodiments of this application also provide a stereo depth estimation system, such as... Figure 4As shown, the system includes: The stereo image acquisition module 401 is used to acquire a stereo image pair of an image to be processed, wherein the stereo image pair includes a left view and a right view of the image to be processed; The monocular depth estimation module 402 is used to call the visual basic model to perform monocular depth estimation on the left view and obtain monocular depth reference information. The scale calibration module 403 is used to perform scale calibration on the monocular depth reference information using the geometric constraints of the stereo image pair to obtain the initial disparity; The iterative module 404 is used to iteratively optimize the initial disparity using the stereo geometric features extracted from the stereo image pair to obtain the final disparity; The depth measurement map module 405 is used to generate a depth measurement map based on the final disparity.
[0125] This application also provides an electronic device corresponding to the method provided in the foregoing embodiments. Please refer to... Figure 5 The diagram illustrates an electronic device provided by some embodiments of this application. The electronic device 20 may include: a processor 200, a memory 201, a bus 202, and a communication interface 203, wherein the processor 200, the communication interface 203, and the memory 201 are connected via the bus 202; the memory 201 stores a computer program that can run on the processor 200, and when the processor 200 runs the computer program, it executes the method provided by any of the foregoing embodiments of this application.
[0126] The memory 201 may include high-speed random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Communication between this system network element and at least one other network element is achieved through at least one physical port (which can be wired or wireless), such as the Internet, wide area network, local area network, or metropolitan area network.
[0127] Bus 202 can be an ISA bus, PCI bus, or EISA bus, etc. The bus can be divided into an address bus, a data bus, a control bus, etc. The memory 201 is used to store programs. After receiving an execution instruction, the processor 200 executes the program. The method disclosed in any of the foregoing embodiments of this application can be applied to the processor 200, or implemented by the processor 200.
[0128] The processor 200 may be an integrated circuit chip with signal processing capabilities. In implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 200 or by instructions in software form. The processor 200 may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it may also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The storage medium is located in memory 201. The processor 200 reads the information in memory 201 and, in conjunction with its hardware, completes the steps of the above method.
[0129] The electronic devices and methods provided in the embodiments of this application are based on the same inventive concept and have the same beneficial effects as the methods they employ, operate, or implement.
[0130] This application also provides a computer-readable storage medium corresponding to the method provided in the foregoing embodiments. Please refer to... Figure 6 The computer-readable storage medium shown is an optical disc 30, on which a computer program (i.e., a program product) is stored, which, when run by a processor, executes the methods provided in any of the foregoing embodiments.
[0131] It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other optical and magnetic storage media, which will not be elaborated here.
[0132] The computer-readable storage medium provided in the above embodiments of this application and the method provided in the embodiments of this application are based on the same inventive concept and have the same beneficial effects as the methods adopted, run or implemented by the applications stored therein.
[0133] It should be noted that the above embodiments are illustrative of this application and not limiting of it, and that those skilled in the art can devise alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses should not be construed as limiting the claims. The word "comprising" does not exclude the presence of elements or steps not listed in the claims. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. This application can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by the same item of hardware. The use of the words first, second, and third, etc., does not indicate any order. These words can be interpreted as names.
[0134] The above description is merely a preferred embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
[0135] The above description is only a preferred embodiment of the present invention and does not limit the patent scope of the present invention. All equivalent structural transformations made under the concept of the present invention using the contents of the present invention specification and drawings, or direct / indirect applications in other related technical fields, are included within the patent protection scope of the present invention.
Claims
1. A method for estimating stereo depth, characterized in that, The method includes: Obtain a stereo image pair of the image to be processed, wherein the stereo image pair includes a left view and a right view of the image to be processed; The visual base model is invoked to perform monocular depth estimation on the left view to obtain monocular depth reference information; Using the geometric constraints of the stereo image pair, the monocular depth reference information is scaled to obtain the initial disparity; The initial disparity is iteratively optimized using the stereo geometric features extracted from the stereo image pair to obtain the final disparity; A depth map is generated based on the final disparity.
2. The method as described in claim 1, characterized in that, The step of calling the visual baseline model to perform monocular depth estimation on the left view to obtain monocular depth reference information includes: The encoder of the visual base model is adjusted using low-rank adaptation technology to obtain an encoder adapted to the target scene, and the adapted encoder is integrated into the visual base model. The left view is input into the visual base model of the adapted encoder, and a multi-scale feature pyramid is extracted. The multi-scale feature pyramid includes image features at several resolution levels. The multi-scale feature pyramid is subjected to adjacent-scale feature aggregation to obtain aggregated features; The aggregated features are decoded to generate the monocular depth reference information.
3. The method as described in claim 2, characterized in that, The step of adjusting the encoder of the visual base model using low-rank adaptation technology to obtain an encoder adapted to the target scene includes: A low-rank adaptation module is introduced into the encoder of the visual base model. The low-rank adaptation module includes a low-rank matrix and is used to incrementally update the base weights of the model encoder. Assign learnable importance weights to each rank component in the low-rank matrix; The low-rank matrix parameters and importance weights in the low-rank adaptation module are optimized and trained, while keeping the pre-trained basic weights of the encoder of the visual base model unchanged. A sparse regularization constraint is adopted to set the values of rank components whose importance weights are lower than a preset threshold to zero. The effective rank components whose values are not set to zero are integrated into the pre-trained base weights of the visual base model encoder to obtain an encoder adapted to the target scene.
4. The method as described in claim 2, characterized in that, In the process of decoding the aggregated features to generate the monocular depth reference information, the decoding process and the adaptation parameters of the visual base model are jointly optimized using self-supervised loss, including: The left view is reconstructed based on the monocular depth reference information to obtain the reconstructed left view; Calculate the photometric reconstruction loss, which is the photometric difference between the reconstructed left view and the original left view; Calculate the edge-aware smoothness loss, where the edge-aware smoothness loss is the edge continuity constraint value of the monocular depth reference information; A self-supervised loss function is constructed based on the photometric reconstruction loss and the edge-aware smoothness loss. The loss value of the self-supervised loss function is backpropagated through backpropagation to optimize the adapted encoder parameters and decoding parameters of the decoding process in the visual base model.
5. The method as described in claim 1, characterized in that, The step of using the geometric constraints of the stereo image pair to perform scale calibration on the monocular depth reference information to obtain the initial disparity includes: Feature points of the left and right views in the stereo image are extracted, and sparse correspondences are obtained based on feature point matching and bidirectional consistency checks. The sparse correspondences include the set of matching points and the corresponding depth values. Calculate the scale factor between the monocular depth reference information and the depth value of the sparse correspondence; Determine whether the scaling factor is within a preset threshold range; if it is within the preset threshold range, directly convert the monocular depth reference information into initial disparity based on the camera intrinsic parameters; if it is not within the preset threshold range, correct the monocular depth reference information using the scaling factor, and then convert it into initial disparity based on the camera intrinsic parameters; wherein, the camera intrinsic parameters include the camera focal length and the binocular baseline distance, and the binocular baseline distance is the distance between the shooting angles of the left and right views of the stereo image pair.
6. The method as described in claim 1, characterized in that, The initial disparity is iteratively optimized using stereo geometric features extracted from the stereo image pair to obtain the final disparity, including: Extract the combined contextual features of the stereo image pair, which include global features output by a visual base model adapted by low-rank adaptation technique and local detail features extracted by a lightweight convolutional neural network. The combined context features are input into the loop update module to initialize the hidden state of the loop update module; the initial disparity is assigned the current disparity at the beginning of the iteration. In each iteration, resolution-level features matching the current iteration are extracted from the stereo geometric features; the resolution-level features, the features corresponding to the current disparity, and the combined context features are fused to obtain fused features; the fused features are input into the initialized loop update module, which outputs the disparity update amount; the current disparity is adjusted using the disparity update amount to obtain the updated current disparity; Repeat the steps of each iteration until the number of iterations reaches the preset termination condition, and take the current disparity at this time as the final disparity.
7. The method as described in claim 6, characterized in that, Each iteration also includes: Based on the current disparity of this iteration, the right view of the stereo image pair is processed to obtain the reconstructed left view; Calculate the stereo reconstruction loss, which is the photometric difference between the reconstructed left view and the left view of the stereo image pair; The monocular depth reference information is used to distinguish the occlusion regions in the stereo image pair, and the occlusion perception loss is calculated. Calculate the disparity guidance loss, which is the difference between the current disparity and the initial disparity in the current iteration; A self-supervised joint loss function is constructed by combining the stereo reconstruction loss, the occlusion perception loss, and the disparity guidance loss. The parameters of the cyclic update module are optimized by minimizing the self-supervised joint loss function through backpropagation.
8. The method as described in claim 1, characterized in that, The solid geometric features are extracted according to the following steps: Feature maps of the left and right views of the stereo image pair are extracted by a feature extraction encoder. Calculate the similarity correlation between the left view feature map and the right view feature map under different disparities, and construct the correlation volume; Multi-scale downsampling is performed on the relevant volumes to construct a multi-scale relevant feature pyramid; The multi-scale correlated feature pyramid is used as the stereo geometric features extracted from the stereo image pair.
9. A stereo depth estimation system, characterized in that, The system includes: A stereo image acquisition module is used to acquire a stereo image pair of an image to be processed, wherein the stereo image pair includes a left view and a right view of the image to be processed; The monocular depth estimation module is used to call the visual base model to perform monocular depth estimation on the left view and obtain monocular depth reference information. The scale calibration module is used to perform scale calibration on the monocular depth reference information using the geometric constraints of the stereo image pair to obtain the initial disparity; An iterative module is used to iteratively optimize the initial disparity using the stereo geometric features extracted from the stereo image pair to obtain the final disparity; The depth map metric module is used to generate a depth map metric based on the final disparity.
10. An electronic device, comprising: A memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor executes the computer program to implement the method as claimed in any one of claims 1-8.
11. A computer-readable storage medium, characterized in that, It stores computer-readable instructions that can be executed by a processor to implement the method as described in any one of claims 1-8.