Intracavity depth estimation method based on linear taylor transform repair and self-supervised adaptation

By employing a linear Taylor Transformer repair and self-supervised adaptation method, the problem of low geometric perception accuracy caused by endoscopic motion blur in minimally invasive surgery was solved. This method achieved high-precision image degradation repair and depth estimation, improving the perception accuracy and robustness of intracavitary scene details.

CN122265089APending Publication Date: 2026-06-23HARBIN UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HARBIN UNIV OF SCI & TECH
Filing Date
2026-03-19
Publication Date
2026-06-23

Smart Images

  • Figure CN122265089A_ABST
    Figure CN122265089A_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on linear Taylor Transform repair and self-supervised adaptation's inner cavity depth estimation method, belong to medical image processing and depth learning technical field.The application introduces linear Taylor Transform repair, ensure that the image information is clear and accurate in the process of depth estimation adaptation.First, construct high-precision training dataset, and establish parameterized degradation model to simulate real inner cavity motion imaging environment;Subsequently, using improved Multi-branch Transformer Expanded by Taylor Formula (MB-TaylorFormerV2) is carried out image repair, through T-MSA++ operator in keeping line performance efficiency while accurately restoring anatomical structure characteristics;Finally, the clear frame after repair is input Endoscopic Depth Any Camera (EndoDAC) framework and carries out self-supervised transfer learning, realizes robust depth estimation.The application integrates image degradation repair and monocular depth estimation, significantly improves the perception accuracy of network to inner cavity scene details, effectively solves the problem that motion blur interference geometry perception is caused due to endoscope fast movement in minimally invasive surgery.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of medical image processing and deep learning technology, specifically relating to an image deblurring and self-supervised depth estimation method for dynamic scenes in monocular endoscopy. Background Technology

[0002] In minimally invasive surgery, images captured by endoscopes are often accompanied by significant motion blur, which severely erodes critical edge contours and texture details, causing subsequent perception algorithms to fail in modeling spatial geometric consistency. Most existing deep learning algorithms rely on the geometric topology of high-fidelity images for inference, but accuracy collapse in dynamic environments has become a technical bottleneck. Therefore, introducing high-performance image inpainting mechanisms to restore the structural features of the input source has become an essential technique for improving the robustness of depth estimation.

[0003] To more accurately recover scene depth from disturbed intracavitary images, this invention introduces linear Taylor Transformer inpainting to ensure clear and accurate image information during depth estimation. First, a high-precision training dataset is constructed, and a parameterized degradation model is established to simulate the real intracavitary motion imaging environment. Then, an improved Multi-branch Transformer Expanded by Taylor Formula (MB-TaylorFormerV2) is used for image inpainting, and the T-MSA++ operator accurately restores anatomical structural features while maintaining linear performance efficiency. Finally, the inpainted, clear frames are input into an Endoscopic Depth Any Camera (EndoDAC) framework for self-supervised transfer learning, achieving robust depth estimation. This invention integrates image degradation inpainting and monocular depth estimation, significantly improving the network's perception accuracy of intracavitary scene details and effectively solving the problem of motion blur interfering with geometric perception due to rapid endoscopic movement in minimally invasive surgery. Summary of the Invention

[0004] To overcome the problems of low geometric perception accuracy and poor robustness caused by motion blur due to the dynamic and changing nature of minimally invasive surgical scenarios and the rapid movement of the endoscope, this invention introduces a linear Taylor Transformer repair method to ensure clear and accurate image information during depth estimation. This invention integrates image degradation repair and monocular depth estimation, significantly improving the network's perception accuracy of details within the intracavitary scene and effectively solving the problem of motion blur interfering with geometric perception due to the rapid movement of the endoscope in minimally invasive surgery.

[0005] The present invention solves the above-mentioned technical problems by adopting the following technical solution:

[0006] Methods for estimating lumen depth based on linear Taylor Transformer repair and self-supervised adaptation include:

[0007] Step a: Construct a deep dataset of the cavity environment covering the training set, validation set, and test set, and perform preprocessing. The specific steps are as follows:

[0008] Step a1: The present invention uses the Stereo Correspondence and Reconstruction of Endoscopic Data (SCARED) dataset as the basis for algorithm verification. This dataset was collected by the da Vinci Xi surgical robot from pig abdominal dissection scenes and contains 35 video segments, totaling 22,950 frames of images.

[0009] Step a2: Perform normalization preprocessing on the dataset. Each video frame is equipped with a high-precision true depth map acquired by a structured light projector.

[0010] Step a3: Strictly implement the data partitioning scheme according to the experimental verification requirements. The training set contains 15351 frames, the validation set contains 1705 frames, and the test set contains 551 frames, which are used for quantitative evaluation of the final performance.

[0011] Step b: Construct a parametric image degradation simulation model, establish a degradation process combining linear convolution and additive noise, and generate a motion-blurred image sequence with highly realistic simulation characteristics. The specific steps are as follows:

[0012] Step b1: To simulate the irregular camera trajectory during surgery, the parameter sampling module is used to randomly sample motion displacement, rotation angle and noise level from a preset range.

[0013] Step b2: Construct the point spread function. A unit impulse response is constructed at the center of the spatial matrix, and then spatially rotated using an affine transformation matrix to characterize the direction of the motion vector at the moment of exposure.

[0014] Step b3: Perform convolution and noise addition. A mirror reflection boundary expansion strategy is used to suppress edge artifacts. Spatial domain convolution is applied to the image, and a normally distributed noise tensor is superimposed to generate a simulated degraded image.

[0015] Step c: Deblur the degraded image using the improved MB-TaylorFormerV2 network, and accurately restore the key features of the image using the T-MSA++ operator. The specific steps are as follows:

[0016] Step c1: Employ a linear attention mechanism based on Taylor series expansion to reduce computational complexity from... Optimize to linear complexity .

[0017] Step c2: Image restoration is performed using the core operator T-MSA++. This is achieved using the map-preserving function. Ensure the non-negativity of the attention map to guide the model to focus on high-frequency details such as blood vessel texture and tissue edges.

[0018] Step c3: Dataset evaluation and verification. Calculate the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) between the repaired image and the blurred image.

[0019] Step d: Using the training set as input, the pre-trained base model is transferred to the surgical cavity domain using dynamic vector low-rank adaptation technology, and a self-supervised training strategy is employed. The specific steps are as follows:

[0020] Step d1: Input the repaired frame into the DepthNet module of the EndoDAC framework, which has a pre-trained visual base model DepthAnything built in.

[0021] Step d2: Apply dynamic vector low-rank adaptation technique to the weight matrix. Fine-tuning is performed using a trainable matrix. and dynamic diagonal vector , It enables fast domain conversion between cavity lighting and texture environment.

[0022] Step e: Combining photometric consistency loss, an effective estimate of the lumen depth can be obtained through model prediction.

[0023] Step e1: Employ a self-supervised strategy to jointly estimate image depth, camera pose, and intrinsic parameters, enabling the learning of geometric constraints from monocular videos without requiring prior knowledge of camera information.

[0024] Step e2: After multi-scale feature fusion and decoding, the Final Head outputs the predicted depth map. . Attached Figure Description

[0025] Figure 1 This is a flowchart of the image degradation simulation of the present invention.

[0026] Figure 2 This is the overall flowchart of the present invention.

[0027] Figure 3 This is a comparison of image degradation effects in this invention, where (a) is the original cavity image and (b) is the image degradation image of this method.

[0028] Figure 4This is a comparison of the image restoration effects of the present invention, wherein (a) is the original cavity image, (b) is the image degradation image of the present invention, and (c) is the image restoration image of the present invention.

[0029] Figure 5 This is a comparison of the depth map effects in this invention, where (a) is the original cavity map, (b) is the depth estimation map of the EndoDAC method, and (c) is the depth estimation map of this invention. Detailed Implementation

[0030] To make the objectives, technical solutions, and features of this invention clearer, the specific embodiments of this invention will be described in further detail below with reference to the accompanying drawings.

[0031] The cavity depth estimation method based on linear Taylor Transformer restoration and self-supervised adaptation in this specific implementation is illustrated in the image degradation simulation flowchart below. Figure 1 As shown, the overall flowchart is as follows: Figure 2 As shown, the steps can be summarized as follows:

[0032] Step a: Construct a deep dataset of the cavity environment covering the training set, validation set, and test set, and perform preprocessing. The specific steps are as follows:

[0033] Step a1: The present invention uses the Stereo Correspondence and Reconstruction of Endoscopic Data (SCARED) dataset as the basis for algorithm verification. This dataset was collected by the da Vinci Xi surgical robot from pig abdominal dissection scenes and contains 35 video segments, totaling 22,950 frames of images.

[0034] Step a2: Perform normalization preprocessing on the dataset. Each video frame is equipped with a high-precision true depth map acquired by a structured light projector.

[0035] Step a3: Strictly implement the data partitioning scheme according to the experimental verification requirements. The training set contains 15351 frames, the validation set contains 1705 frames, and the test set contains 551 frames, which are used for quantitative evaluation of the final performance.

[0036] Step b: Construct a parametric image degradation simulation model, establish a degradation process combining linear convolution and additive noise, and generate a motion-blurred image sequence with highly realistic simulation characteristics. The specific steps are as follows:

[0037] Step b1: To simulate the irregular camera trajectory during surgery, a parameter sampling module is used to randomly sample motion displacement, rotation angle, and noise level from a preset range. This process models the image as a composite process of linear convolution and additive noise.

[0038]

[0039] in, Represents raw, high-quality endoscopic frames acquired from the SCARED dataset; For the generated degraded image; These are original, high-fidelity, clear endoscopic image frames; The point spread function (PSF) represents the camera's motion vector. This is additive noise, simulating the thermal noise generated by the endoscope sensor.

[0040] This method first simulates the irregular camera trajectory in real surgery using a parameter sampling module, randomly sampling motion displacements from a preset interval. Rotation angle and noise level In the fuzzy kernel generation stage, the system first... The spatial matrix center is used to construct the unit impulse response to characterize the initial displacement, which is then spatially rotated using an affine transformation matrix to accurately depict the motion vector direction of the endoscope lens relative to the organ tissue at the moment of exposure. To ensure that the convolution operation follows the law of energy conservation and maintains the original brightness of the tissue region, the rotated kernel elements need to be normalized.

[0041]

[0042] in, A two-dimensional rotation matrix is ​​used to map linear motion displacements to values ​​with specific angles. On the spatial vector; The transformed coordinate space is used to determine the point spread function. The shape of the distribution in the matrix.

[0043] Step b2: Construct the point spread function A unit impulse response is constructed at the center of the spatial domain matrix, and spatially rotated using an affine transformation matrix to characterize the motion vector direction at the moment of exposure. Spatial domain convolution operators are applied to each color channel of the endoscopic image. Recognizing the importance of edge information for feature matching in endoscopic imaging, the algorithm employs a mirror reflection boundary expansion strategy, effectively suppressing edge artifacts induced by discrete convolution and ensuring the semantic continuity of soft tissue texture after degradation.

[0044]

[0045] in, Let be the unit impulse function defined at the center of the spatial matrix, representing the initial pixel position at the moment of exposure.

[0046] Step b3: Perform convolution and noise addition. A mirror reflection boundary expansion strategy is used to suppress edge artifacts. Spatial domain convolution is applied to the image, and a normally distributed noise tensor is superimposed to generate a simulated degraded image. To simulate the thermal noise and readout noise generated by the endoscope sensor in a low-light cavity environment, the system synchronously generates an isotropic normally distributed noise tensor. noise tensor This is then additively superimposed onto the blurred image. Finally, the pixel values ​​are dynamically cropped and remapped to... Integer space, generating degraded images with highly realistic characteristics. .

[0047] Step c: Deblur the degraded image using the improved MB-TaylorFormerV2 network, and accurately restore the key features of the image using the T-MSA++ operator. The specific steps are as follows:

[0048] Step c1: Employ a linear attention mechanism based on Taylor series expansion to reduce computational complexity from... Optimize to linear complexity .

[0049] Step c2: Image restoration is performed using the core operator T-MSA++. This is achieved using the map-preserving function. Ensuring the non-negativity of the attention map guides the model to focus on high-frequency details such as blood vessel texture and tissue edges. Its core attention operator, T-MSA++, achieves accurate recovery of key image features through a combination of first-order Taylor terms and focusing remainder terms. Its output vector... The expression is as follows:

[0050]

[0051] in, To preserve the Preservation Mapping Function, non-linear scaling is used to ensure the non-negativity of the attention map and guide the model to focus on high-frequency details such as tissue edges; For position The final output feature vector at the location; This represents the total number of pixels in the input feature map. To find the sum index, iterate through all positions in the image. For position The input feature vector (Value) at the location; For position The query vector (Query) after projection; For position The key vector (Key) after projection and transpose. This is a scaling factor used to adjust the distribution of attention.

[0052] Step c3: Dataset evaluation and verification. Calculate the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) between the repaired image and the blurred image.

[0053] Step d: Using the training set as input, the pre-trained base model is transferred to the surgical cavity domain using dynamic vector low-rank adaptation. A self-supervised training strategy is employed. The specific steps are as follows:

[0054] Step d1: Input the repaired frame into the DepthNet module of the EndoDAC framework, which has a pre-trained visual base model DepthAnything built in.

[0055] Step d2: Apply dynamic vector low-rank adaptation technique to the weight matrix. Fine-tuning is performed using a trainable matrix. and dynamic diagonal vector , It enables fast domain conversion between cavity lighting and texture environment.

[0056] In practice, the DepthNet module receives clear frames recovered by MB-TaylorFormerV2. As input, the pre-trained weights are finely adjusted using its built-in dynamic vector low-rank adaptation technique. This adaptation mechanism maintains the general feature extraction capability of the basic model while achieving fast domain transformation for cavity-specific lighting and texture environments with very few additional parameters. The dynamic adjustment logic of its core weights is defined by the following formula:

[0057]

[0058] in, The frozen weight matrix representing the base model. and It is a trainable low-rank matrix, and and It is a diagonal vector matrix used to dynamically adjust the adaptation weights during training.

[0059] Step e: Combining photometric consistency loss, an effective estimate of the lumen depth can be obtained through model prediction. The specific steps are as follows:

[0060] Step e1: Employing a self-supervised strategy, image depth, camera pose, and intrinsic parameters are jointly estimated, allowing geometric constraints to be learned from monocular video without prior knowledge of camera information. Total loss function. Combined with scale-invariant logarithmic loss gradient loss and neighborhood smoothing constraint term :

[0061]

[0062] in, , and Let be the weight hyperparameters for each loss term. The neighborhood smoothing term is defined as:

[0063]

[0064] in, These are weight hyperparameters used to control the proportion of the smoothing term's contribution to the total loss function; This indicates that the summation is performed only within the neighborhood of the image; Represents pixels and The depth value predicted by the network; For the original image in pixels and The color or intensity value at that location.

[0065] Step e2: After multi-scale feature fusion and decoding, the Final Head outputs the predicted depth map. .

[0066] The depth estimation accuracy of the SCARED intracavitary dataset used in this invention is shown in Table 1. The absolute relative difference (Abs_Rel), squared relative error (Sq_Rel), root mean square error (RMSE), logarithmic root mean square error (RMSE_Log), and the error between the predicted depth and the true depth (DELTA<2.5) are used as evaluation metrics.

[0067] Experimental results show that this invention solves the problem of motion degradation in endoscopic images by integrating efficient image inpainting techniques with a self-supervised depth estimation model, fusing the improved MB-TaylorFormerV2 image inpainting into the EndoDAC depth estimation network. MB-TaylorFormerV2 achieves high-quality image inpainting thanks to its refined Taylor series approximation capability, providing a stable input benchmark for the geometric feature extraction of the backend EndoDAC. Meanwhile, the dynamic vector low-rank adaptation strategy ensures rapid domain transfer of depth estimation to the medical field with extremely low training costs. This invention can significantly correct geometric distortions in depth prediction, providing consistently stable image input for downstream reconstruction tasks. Compared to the EndoDAC depth estimation method, this invention outperforms it in all metrics.

[0068] Table 1. Accuracy of depth estimation in the SCARED lumen dataset

[0069] Method Abs_Rel Sq_Rel RMSE RMSE_Log DELTA < 2.5 EndoDAC 0.060 0.482 5.312 0.085 0.969 The invention 0.057 0.450 5.071 0.081 0.976

Claims

1. A method for estimating lumen depth based on linear Taylor Transformer repair and self-supervised adaptation, the steps of which are as follows: Step a: Construct a deep dataset of the cavity environment covering the training set, validation set, and test set, and perform preprocessing. The specific steps are as follows: Step a1: The present invention uses the Stereo Correspondence and Reconstruction of Endoscopic Data (SCARED) dataset as the basis for algorithm verification. This dataset was collected by the da Vinci Xi surgical robot from pig abdominal dissection scenes and contains 35 video segments, totaling 22,950 frames of images. Step a2: Perform normalization preprocessing on the dataset. Each video frame is equipped with a high-precision true depth map acquired by a structured light projector. Step a3: Strictly implement the data partitioning scheme according to the experimental verification requirements. The training set contains 15351 frames, the validation set contains 1705 frames, and the test set contains 551 frames, which are used for quantitative evaluation of the final performance. Step b: Construct a parametric image degradation simulation model, establish a degradation process combining linear convolution and additive noise, and generate a motion-blurred image sequence with highly realistic simulation characteristics. The specific steps are as follows: Step b1: To simulate the irregular camera trajectory during surgery, the parameter sampling module is used to randomly sample motion displacement, rotation angle and noise level from a preset range. Step b2: Construct the point spread function. A unit impulse response is constructed at the center of the spatial matrix, and then spatially rotated using an affine transformation matrix to characterize the direction of the motion vector at the moment of exposure. Step b3: Perform convolution and noise addition. A mirror reflection boundary expansion strategy is used to suppress edge artifacts. Spatial domain convolution is applied to the image, and a normally distributed noise tensor is superimposed to generate a simulated degraded image. Step c: Deblur the degraded image using the improved MB-TaylorFormerV2 network, and accurately restore the key features of the image using the T-MSA++ operator. The specific steps are as follows: Step c1: Employ a linear attention mechanism based on Taylor series expansion to reduce computational complexity from... Optimize to linear complexity . Step c2: Image restoration is performed using the core operator T-MSA++. This is achieved using the map-preserving function. Ensure the non-negativity of the attention map to guide the model to focus on high-frequency details such as blood vessel texture and tissue edges. Step c3: Dataset evaluation and verification. Calculate the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) between the repaired image and the blurred image. Step d: Using the training set as input, the pre-trained base model is transferred to the surgical cavity domain using dynamic vector low-rank adaptation technology, and a self-supervised training strategy is employed. The specific steps are as follows: Step d1: Input the repaired frame into the DepthNet module of the EndoDAC framework, which has a pre-trained visual base model DepthAnything built in. Step d2: Apply dynamic vector low-rank adaptation technique to the weight matrix. Fine-tuning is performed using a trainable matrix. and dynamic diagonal vector , It enables fast domain conversion between cavity lighting and texture environment. Step e: Combining photometric consistency loss, an effective estimate of the lumen depth can be obtained through model prediction. The specific steps are as follows: Step e1: Employ a self-supervised strategy to jointly estimate image depth, camera pose, and intrinsic parameters, enabling the learning of geometric constraints from monocular videos without requiring prior knowledge of camera information. Step e2: After multi-scale feature fusion and decoding, the Final Head outputs the predicted depth map. .