A self-supervised monocular depth estimation method and system based on three-dimensional Gaussian splash

By synthesizing multi-view images and depth maps using 3D Gaussian splashing technology, an occlusion mask is generated to optimize the depth estimation network, solving the occlusion problem in self-supervised monocular depth estimation and achieving more accurate and stable depth prediction.

CN122244300APending Publication Date: 2026-06-19XIAMEN GREAT POWER GEO INFORMATION TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XIAMEN GREAT POWER GEO INFORMATION TECH
Filing Date
2026-02-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing self-supervised monocular depth estimation methods suffer from large depth prediction errors and erroneous gradient propagation when dealing with local occlusion areas caused by dynamic object motion and camera viewpoint changes in video sequences, which limits the improvement of model performance.

Method used

A three-dimensional Gaussian splashing technique is used to synthesize multi-view images and depth maps. An occlusion mask is generated to optimize the depth estimation network, and the network is trained in stages to improve robustness and accuracy.

Benefits of technology

Accurately identifying occluded areas and generating high-quality depth prediction results improves the model's prediction accuracy and stability in both occluded and unoccluded areas.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244300A_ABST
    Figure CN122244300A_ABST
Patent Text Reader

Abstract

This application relates to a self-supervised monocular depth estimation method and system based on 3D Gaussian splashing, belonging to the field of computer vision technology. The method includes: acquiring a target image and adjacent frame images; processing the target image and adjacent frame images through a first depth estimation network and a first pose estimation network to obtain an initial depth map and a relative camera pose; based on the initial depth map and the relative camera pose, using 3D Gaussian splashing technology to synthesize a multi-view image and a multi-view depth map corresponding to the target image; processing the target image and adjacent frame images through a second depth estimation network and a second pose estimation network to obtain a student depth map and a student relative camera pose; generating an occlusion mask for identifying occluded regions in the target image based on the multi-view depth map and the student depth map; constructing a loss function during training to optimize the second depth estimation network and the second pose estimation network to obtain the final depth estimation model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision technology, and more specifically, to a self-supervised monocular depth estimation method and system based on 3D Gaussian splashing. Background Technology

[0002] Self-supervised monocular depth estimation is an important research direction in computer vision, aiming to predict scene depth information from a single image without ground truth depth supervision. Based on different training data, existing methods are mainly divided into two categories: those based on binocular stereo image pairs and those based on monocular video sequences. Self-supervised methods based on monocular video sequences typically use the target image and its temporally adjacent frames as input. A depth estimation network predicts the depth map of the target image, while a pose estimation network predicts the relative camera pose between the target frame and its adjacent frames. Using the predicted depth and pose, the adjacent frame images are reprojected onto the viewpoint of the target image through a differentiable geometric transformation to obtain the reconstructed image. The training objective of the model is to minimize the photometric reconstruction loss between the reconstructed image and the original target image. Its core assumption is that the scene has appearance consistency and geometric continuity between consecutive frames.

[0003] However, such methods face a fundamental challenge: dynamic object motion and camera viewpoint changes in video sequences lead to local occlusion regions in the scene. In these occluded regions, pixels cannot establish effective correspondences between adjacent frames, rendering the photometric consistency-based reconstruction loss ineffective. This failure not only results in significant errors in depth prediction within occluded regions, but the resulting erroneous gradients also propagate during training, misleading the learning of depth and pose estimation networks, thus limiting further improvements in the overall model performance. Therefore, effectively handling the local occlusion problem is crucial for improving the accuracy and robustness of self-supervised monocular depth estimation models. Summary of the Invention

[0004] To address the aforementioned technical problems, this invention proposes a self-supervised monocular depth estimation method and system based on three-dimensional Gaussian splashing.

[0005] The technical solution of this invention is as follows: This invention proposes a self-supervised monocular depth estimation method based on 3D Gaussian splashing, comprising the following steps: Acquire the target image and its adjacent frame images in the video sequence; The target image and adjacent frame images are processed by the first depth estimation network and the first pose estimation network to obtain the initial depth map and the relative camera pose. Based on the initial depth map and relative camera pose, a multi-view image and multi-view depth map corresponding to the target image are synthesized using the 3D Gaussian splashing technique. The target image and adjacent frame images are processed by the second depth estimation network and the second pose estimation network to obtain the student's depth map and the student's pose relative to the camera. Based on multi-view depth maps and student depth maps, an occlusion mask is generated to identify occlusion areas in the target image. During training, a loss function is constructed based on at least multi-view images, multi-view depth maps, student depth maps, and occlusion masks. The second depth estimation network and the second pose estimation network are then optimized to obtain the final depth estimation model.

[0006] Preferably, the synthesis of multi-view images and multi-view depth maps corresponding to the target image using 3D Gaussian splashing technology is achieved through a 3D Gaussian splashing module, which includes: A Gaussian prediction network is used to extract features from a target image and predict a Gaussian tensor, which contains the color, opacity, scale and rotation parameters, and position offset of each three-dimensional Gaussian. The position generation submodule is used to calculate and optimize the position parameters of each 3D Gaussian in 3D space based on the position offset in the initial depth map and Gaussian tensor. The rendering submodule is used to synthesize multi-view images and multi-view depth maps based on the optimized 3D Gaussian parameters using the 3D Gaussian splash rendering algorithm.

[0007] Preferably, the specific method by which the position generation module optimizes the three-dimensional Gaussian position parameters is as follows: For a pixel location p in the target image, based on its depth value The camera intrinsic parameter matrix K is used to calculate its initial 3D coordinates through back projection, and then the position offset predicted by the Gaussian prediction network is superimposed on it. The optimized three-dimensional position is obtained. The calculation formula is: ; In the formula: Let p be the homogeneous coordinates of pixel p.

[0008] Preferably, the rendering module uses Hybrid rendering is used, where the formulas for calculating the color and depth values ​​of pixel p in a multi-view image are: ; ; In the formula: The color value of pixel p in a multi-view image; This represents the depth value of pixel p in the multi-view depth map. A three-dimensional Gaussian set sorted by contribution to the color or depth of pixel p; , These are the color and depth values ​​of the i-th 3D Gaussian element, respectively. Let be the opacity of the i-th 3D Gaussian after being mapped by the 2D covariance matrix; Let be the opacity of all Gaussians up to the i-th Gaussian.

[0009] Preferably, the step of generating an occlusion mask for identifying occlusion areas in the target image includes: By utilizing the student's pose relative to the camera, the multi-view depth map is transformed to the viewpoint corresponding to the student's depth map to obtain the reprojected depth map; Calculate the geometric consistency between the reprojected depth map and the student depth map to generate a visibility mask; Based on multi-view depth maps and student depth maps respectively, the corresponding reconstructed images are calculated through image reprojection, and the photometric error between them and the target image is compared to generate a superior mask. By combining the visibility mask and the superiority mask, the final occlusion mask is obtained.

[0010] Preferably, the formula for calculating the visibility mask is: ; In the formula: For visibility masking; This is a multi-view depth map; For reprojection depth map; This is the consistency threshold; The formula for calculating the superiority mask is as follows: ; In the formula: For superior masking; This is the photometric error function; For the target image; This is a reconstructed image based on multi-view depth maps; The reconstructed image is based on the student's depth map; The formula for calculating the occlusion mask is: ; In the formula: To cover up the mask; This is an element-wise dot product operation.

[0011] Preferably, the training process includes three stages: In the first stage, the first depth estimation network and the first pose estimation network are jointly trained, and the loss functions used include photometric reconstruction loss and depth smoothing loss based on neighboring frame images. In the second stage, the trained first depth estimation network and first pose estimation network are frozen, and the 3D Gaussian splashing module is trained separately. The loss functions used include photometric loss for synthetic multi-view images and smoothing loss for synthetic multi-view depth maps. In the third stage, the trained first depth estimation network, first pose estimation network, and 3D Gaussian splashing module are frozen, and the second depth estimation network and second pose estimation network are trained. The loss function in this stage includes pseudo-supervised loss, as well as student photometric reconstruction loss and student depth smoothing loss weighted by occlusion mask. The pseudo-supervised loss is calculated from the difference between the multi-view depth map and the student depth map.

[0012] On the other hand, the present invention also provides a self-supervised monocular depth estimation system based on three-dimensional Gaussian splashing, comprising: The data acquisition module is configured to acquire the target image and its adjacent frame images in the video sequence. The first processing module includes a first depth estimation network and a first pose estimation network, configured to process the target image and adjacent frame images to obtain an initial depth map and relative camera pose. The 3D Gaussian splashing module is configured to synthesize multi-view images and multi-view depth maps corresponding to the target image based on an initial depth map and relative camera pose. The second processing module includes a second depth estimation network and a second pose estimation network, configured to process the target image and adjacent frame images to obtain the student's depth map and the student's pose relative to the camera. The mask generation module is configured to generate occlusion masks for identifying occlusion areas in a target image based on multi-view depth maps and student depth maps. The model training module is configured to construct a loss function based on at least multi-view images, multi-view depth maps, student depth maps, and occlusion masks during the training process, and optimize the second processing module to obtain the final depth estimation model.

[0013] In another aspect, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement a self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing as described in any embodiment of the present invention.

[0014] In another aspect, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing as described in any embodiment of the present invention.

[0015] The present invention has the following beneficial effects: 1. This invention accurately identifies occluded regions in target images by comparing the geometric consistency and reconstruction superiority between multi-view depth maps synthesized by 3D Gaussian splashing and student depth maps predicted by a lightweight student model. The generated occlusion mask is used in the weighted loss function during training, particularly in the third stage, significantly mitigating the adverse effects of occlusion on model optimization. This results in a more accurate and reliable depth prediction model for both occluded and unoccluded regions.

[0016] 2. This invention integrates 3D Gaussian splashing technology into a self-supervised depth estimation framework, designing a self-supervised monocular 3D Gaussian splashing module. This module can utilize the initially predicted depth and pose to construct an explicit, differentiable 3D scene representation, and synthesize high-quality images and depth maps from different viewpoints. This synthesized multi-view data provides powerful pseudo-supervision signals for training lightweight student models, effectively compensating for the shortcomings of relying solely on raw video sequences for self-supervision, and guiding the student model to learn more robust scene geometric features.

[0017] 3. This invention designs a three-stage progressive training process. This process avoids training instability issues that may arise from simultaneously optimizing all parameters by training the first network, the 3D Gaussian splashing module, and the lightweight student model in stages and modules. This design ensures that the model performance can be steadily improved layer by layer, ultimately obtaining a high-performance depth estimation model. Attached Figure Description

[0018] Figure 1 This is a flowchart of the method of the present invention. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0020] It should be understood that the step numbers used in the text are for ease of description only and are not intended to limit the order in which the steps are performed.

[0021] It should be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.

[0022] The terms “comprising” and “including” indicate the presence of the described feature, whole, step, operation, element and / or component, but do not exclude the presence or addition of one or more other features, wholes, steps, operations, elements, components and / or collections thereof.

[0023] The term “and / or” refers to any combination of one or more of the associated listed items, as well as all possible combinations, and includes these combinations.

[0024] Example 1: To make the objectives, technical solutions, and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below in conjunction with specific embodiments of the present application and with reference to the accompanying drawings.

[0025] To address the problems of existing technologies, this invention provides a self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing, comprising the following steps: Acquire the target image and its adjacent frame images in the video sequence; In this embodiment, a video sequence is acquired using a monocular camera. A single frame is then extracted from the video sequence as the target image. And take the frames immediately before and after it as adjacent frame images. and The image needs to be preprocessed, such as normalized to the [0,1] range or standardized.

[0026] The target image and adjacent frame images are processed by the first depth estimation network and the first pose estimation network to obtain the initial depth map and the relative camera pose. In this embodiment, the first depth estimation network can adopt a U-Net structure, and the network input is the target image. The output is the initial depth map. The first pose estimation network can be a convolutional neural network, with the input being... and (or The stack of ) outputs the relative camera pose between the two. , .

[0027] Based on the initial depth map and relative camera pose, a multi-view image and multi-view depth map corresponding to the target image are synthesized using the 3D Gaussian splashing technique. In this embodiment, multi-view images and multi-view depth maps are synthesized using a 3D Gaussian splashing module, specifically: initial depth map. and relative camera pose Input the 3D Gaussian splash module to predict the 3D Gaussian representation of the scene. Thus, based on three-dimensional Gaussian splashing, the synthesized image is compared with the input image. Corresponding multi-view images With multi-view depth map .

[0028] In this embodiment, the three-dimensional Gaussian splashing module includes: Gaussian prediction network: In this embodiment, a lightweight CNN such as ResNet-18 is used as the Gaussian prediction network. The extracted features predict the output Gaussian tensor, where each spatial location corresponds to a three-dimensional Gaussian, and the tensor channels contain the color of that Gaussian. Opacity, scale, rotation, and core position offset .

[0029] Location generation submodule: based on initial depth map And a Gaussian tensor, for a pixel position p in the target image, based on its depth value The camera intrinsic parameter matrix K is used to calculate its initial 3D coordinates through back projection, and then the position offset predicted by the Gaussian prediction network is superimposed on it. The optimized three-dimensional Gaussian position parameters are obtained. The location generation submodule ultimately outputs a 3D Gaussian representation of the scene. .

[0030] 3D Gaussian position parameters The calculation formula is: ; In the formula: Let p be the homogeneous coordinates of pixel p. This step effectively utilizes depth information, providing an accurate initial spatial distribution for the 3D Gaussian.

[0031] Rendering submodule: Utilizing a visibility-aware rendering algorithm based on 3D Gaussian splashing, it can render each 3D Gaussian... Projected onto target image On a two-dimensional image plane, a splashed Gaussian is obtained. Splash Gaussian in each image patch according to depth priority Sort and reuse Hybrid rendering algorithm to obtain synthetic image The color of each pixel in the composite image for: ; In the formula: The color value of pixel p in the composite image; A three-dimensional Gaussian set sorted by contribution to the color or depth of pixel p; The color value of the i-th three-dimensional Gaussian; Let be the opacity of the i-th 3D Gaussian after being mapped by the 2D covariance matrix; Let be the opacity of all Gaussians prior to the i-th Gaussian; Depth maps can be synthesized using the same method. : ; In the formula: This represents the depth value of pixel p in the synthesized depth map; Let be the depth value of the i-th 3D Gaussian vector; in addition, let be the relative camera pose output by the first pose estimation network. Inputting it into the rendering module can generate a 3D Gaussian Projected onto previous and next frame images On a two-dimensional image plane, a splashed Gaussian is obtained. and according to Hybrid rendering algorithms obtain synthetic multi-view images Synthetic multi-view depth map

[0032] The target image and adjacent frame images are processed by the second depth estimation network and the second pose estimation network to obtain the student's depth map and the student's pose relative to the camera. In this embodiment, the second depth estimation network and the second pose estimation network can have the same structure as the first network, but use more lightweight parameter settings, such as using fewer channels, to reduce computational burden. They process data in the same way. Output the student's depth map along with adjacent frames. Student relative camera pose .

[0033] Based on multi-view depth maps and student depth maps, an occlusion mask is generated to identify occlusion areas in the target image. In this embodiment, the step of generating an occlusion mask for identifying occlusion areas in the target image includes: Using students' relative camera pose Multi-view depth map Transform to student depth map correspond From the perspective of obtaining the reprojected depth map Reprojection depth map The calculation formula is: ; In the formula: and The target image With source image The corresponding pixel in and Homogeneous coordinate representation; calculation of reprojected depth map Student depth map The geometric consistency between the elements is used to generate a visibility mask; the formula for calculating the visibility mask is: ; In the formula: For visibility masking; This is a multi-view depth map; For reprojection depth map; The consistency threshold is set to 2 in this embodiment; The brackets represent Iverson's condition. The value is 1 if the condition inside the brackets is met, otherwise the value is 0. A region with a value of 1 indicates geometric consistency and may not be occluded.

[0034] Based on multi-view depth maps Student depth map ,Will Reprojecting the image onto the target viewpoint yields the image. and The photometric error between the photometric image and the target image is compared to generate a superior mask; the formula for calculating the superior mask is: ; In the formula: For superior masking; This is the photometric error function; For the target image; This is a reconstructed image based on multi-view depth maps; The reconstructed image is based on the student's depth map; A value of 1 indicates that the reconstruction based on the synthetic depth map is more effective.

[0035] By combining the visibility mask and the superiority mask, the final occlusion mask is obtained. The calculation formula for the occlusion mask is as follows: ; In the formula: To cover up the mask; This is an element-wise dot product operation. Only locations where both masks are 1 are assigned a value of 1. These locations are geometrically consistent regions with a better synthesized depth map, and are considered unoccluded regions that should be relied upon during training.

[0036] During training, a loss function is constructed based on at least multi-view images, multi-view depth maps, student depth maps, and occlusion masks. The second depth estimation network and the second pose estimation network are then optimized to obtain the final depth estimation model.

[0037] In this embodiment, the training process includes three stages: In the first stage, the first depth estimation network and the first pose estimation network are jointly trained. The loss functions used include photometric reconstruction loss based on neighboring frame images and depth smoothing loss. The specific loss function for the first stage is as follows: ; in: ; ; In the formula: This is the loss function for the first stage; An adaptive mask used for filtering static frames; These are hyperparameters used to adjust the weights of the loss function; To minimize the photometric reconstruction loss between the reconstructed image and the target image; To utilize the pose predicted by the first pose network to the source frame Distort to target perspective The reprojected image; For deep smoothing loss; This is the mean-normalized inverse depth of the initial depth map.

[0038] In the second stage, the trained first depth estimation network and first pose estimation network are frozen, and the 3D Gaussian splashing module is trained separately. The loss functions used include photometric loss for synthesized multi-view images and smoothing loss for synthesized multi-view depth maps. The specific loss function for the second stage is as follows: ; in: ; ; In the formula: This is the loss function for the second stage; These are hyperparameters used to adjust the weights of the loss function; The photometric loss in the synthesis of multi-view images and input images by the 3D Gaussian splash module; Edge-aware smoothing loss for depth maps synthesized by the 3D Gaussian splash module; The input is a real image; The image synthesized by the 3D Gaussian splash module; For balance parameters; The mean-normalized inverse depth of the synthesized depth map; In the third stage, the trained first depth estimation network, first pose estimation network, and 3D Gaussian splashing module are frozen, and the second depth estimation network and second pose estimation network are trained. The loss function in this stage includes pseudo-supervised loss, as well as student photometric reconstruction loss and student depth smoothing loss weighted by occlusion mask. The pseudo-supervised loss is calculated from the difference between the multi-view depth map and the student depth map. The specific loss function in the third stage is as follows: ; in: ; ; ; In the formula: This is the loss function for the third stage; These are hyperparameters used to adjust the weights of the loss function; Losses due to false supervision; Edge smoothing loss for the student depth map; Minimum photometric reconstruction loss between the student's reconstructed image and the target image; The pose predicted by the second pose network is used to transform the source frame Distort to target perspective The reprojected image; This is the mean-normalized inverse depth of the student depth map.

[0039] Example 2: This embodiment provides a self-supervised monocular depth estimation system based on 3D Gaussian splashing, including: The data acquisition module is configured to acquire the target image and its adjacent frame images in the video sequence. The first processing module includes a first depth estimation network and a first pose estimation network, configured to process the target image and adjacent frame images to obtain an initial depth map and relative camera pose. The 3D Gaussian splashing module is configured to synthesize multi-view images and multi-view depth maps corresponding to the target image based on an initial depth map and relative camera pose. The second processing module includes a second depth estimation network and a second pose estimation network, configured to process the target image and adjacent frame images to obtain the student's depth map and the student's pose relative to the camera. The mask generation module is configured to generate occlusion masks for identifying occlusion areas in a target image based on multi-view depth maps and student depth maps. The model training module is configured to construct a loss function based on at least multi-view images, multi-view depth maps, student depth maps, and occlusion masks during the training process, and optimize the second processing module to obtain the final depth estimation model.

[0040] Example 3: This embodiment proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements a self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing as described in any embodiment of the present invention.

[0041] Example 4: This embodiment proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements a self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing as described in any embodiment of the present invention.

[0042] In this application embodiment, "at least one" refers to one or more, and "more than one" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent the existence of A alone, A and B simultaneously, or B alone. A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c can represent: a, b, c, a and b, a and c, b and c, or a and b and c, where a, b, and c can be single or multiple.

[0043] Those skilled in the art will recognize that the units and algorithm steps described in the embodiments disclosed herein can be implemented using electronic hardware, computer software, or a combination of electronic hardware and software. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0044] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0045] In the several embodiments provided in this application, any function, if implemented as a software functional unit and sold or used as an independent product, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0046] The above description is merely an embodiment of the present invention and does not limit the patent scope of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of the present invention.

Claims

1. A self-supervised monocular depth estimation method based on 3D Gaussian splashing, characterized in that, Includes the following steps: Acquire the target image and its adjacent frame images in the video sequence; The target image and adjacent frame images are processed by the first depth estimation network and the first pose estimation network to obtain the initial depth map and the relative camera pose. Based on the initial depth map and relative camera pose, a multi-view image and multi-view depth map corresponding to the target image are synthesized using the 3D Gaussian splashing technique. The target image and adjacent frame images are processed by the second depth estimation network and the second pose estimation network to obtain the student's depth map and the student's pose relative to the camera. Based on multi-view depth maps and student depth maps, an occlusion mask is generated to identify occlusion areas in the target image. During training, a loss function is constructed based on at least multi-view images, multi-view depth maps, student depth maps, and occlusion masks. The second depth estimation network and the second pose estimation network are then optimized to obtain the final depth estimation model.

2. The self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing as described in claim 1, characterized in that: The synthesis of multi-view images and multi-view depth maps corresponding to the target image using 3D Gaussian splashing technology is achieved through a 3D Gaussian splashing module, which includes: A Gaussian prediction network is used to extract features from a target image and predict a Gaussian tensor, which contains the color, opacity, scale and rotation parameters, and position offset of each three-dimensional Gaussian. The position generation submodule is used to calculate and optimize the position parameters of each 3D Gaussian in 3D space based on the position offset in the initial depth map and Gaussian tensor. The rendering submodule is used to synthesize multi-view images and multi-view depth maps based on the optimized 3D Gaussian parameters using the 3D Gaussian splash rendering algorithm.

3. The self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing as described in claim 1, characterized in that: The specific method by which the position generation module optimizes the three-dimensional Gaussian position parameters is as follows: For a pixel location p in the target image, based on its depth value The camera intrinsic parameter matrix K is used to calculate its initial 3D coordinates through back projection, and then the position offset predicted by the Gaussian prediction network is superimposed on it. The optimized three-dimensional position is obtained. The calculation formula is: ; In the formula: Let p be the homogeneous coordinates of pixel p.

4. A self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing as described in claim 2 or 3, characterized in that: The rendering module uses Hybrid rendering is used, where the formulas for calculating the color and depth values ​​of pixel p in a multi-view image are: ; ; In the formula: The color value of pixel p in a multi-view image; This represents the depth value of pixel p in the multi-view depth map. A three-dimensional Gaussian set sorted by contribution to the color or depth of pixel p; , These are the color and depth values ​​of the i-th 3D Gaussian element, respectively. Let be the opacity of the i-th 3D Gaussian after being mapped by the 2D covariance matrix; Let be the opacity of all Gaussians up to the i-th Gaussian.

5. The self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing according to claim 1, characterized in that: The step of generating an occlusion mask for identifying occlusion regions in a target image includes: By utilizing the student's pose relative to the camera, the multi-view depth map is transformed to the viewpoint corresponding to the student's depth map to obtain the reprojected depth map; Calculate the geometric consistency between the reprojected depth map and the student depth map to generate a visibility mask; Based on multi-view depth maps and student depth maps respectively, the corresponding reconstructed images are calculated through image reprojection, and the photometric error between them and the target image is compared to generate a superior mask. By combining the visibility mask and the superiority mask, the final occlusion mask is obtained.

6. The self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing according to claim 5, characterized in that: The formula for calculating the visibility mask is: ; In the formula: For visibility masking; This is a multi-view depth map; For reprojection depth map; This is the consistency threshold; The formula for calculating the superiority mask is as follows: ; In the formula: For superior masking; This is the photometric error function; For the target image; This is a reconstructed image based on multi-view depth maps; The reconstructed image is based on the student's depth map; The formula for calculating the occlusion mask is: ; In the formula: To cover up the mask; This is an element-wise dot product operation.

7. The self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing according to claim 1, characterized in that: The training process consists of three stages: In the first stage, the first depth estimation network and the first pose estimation network are jointly trained, and the loss functions used include photometric reconstruction loss and depth smoothing loss based on neighboring frame images. In the second stage, the trained first depth estimation network and first pose estimation network are frozen, and the 3D Gaussian splashing module is trained separately. The loss functions used include photometric loss for synthetic multi-view images and smoothing loss for synthetic multi-view depth maps. In the third stage, the trained first depth estimation network, first pose estimation network and 3D Gaussian splash module are frozen, and the second depth estimation network and second pose estimation network are trained. The loss function in this stage includes pseudo-supervised loss, as well as student photometric reconstruction loss and student depth smoothing loss weighted by occlusion mask; wherein, the pseudo-supervised loss is calculated from the difference between the multi-view depth map and the student depth map.

8. A self-supervised monocular depth estimation system based on three-dimensional Gaussian splashing, characterized in that, include: The data acquisition module is configured to acquire the target image and its adjacent frame images in the video sequence. The first processing module includes a first depth estimation network and a first pose estimation network, configured to process the target image and adjacent frame images to obtain an initial depth map and relative camera pose. The 3D Gaussian splashing module is configured to synthesize multi-view images and multi-view depth maps corresponding to the target image based on an initial depth map and relative camera pose. The second processing module includes a second depth estimation network and a second pose estimation network, configured to process the target image and adjacent frame images to obtain the student's depth map and the student's pose relative to the camera. The mask generation module is configured to generate occlusion masks for identifying occlusion areas in a target image based on multi-view depth maps and student depth maps. The model training module is configured to construct a loss function based on at least multi-view images, multi-view depth maps, student depth maps, and occlusion masks during the training process, and optimize the second processing module to obtain the final depth estimation model.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements a self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing as described in claims 1-7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements a self-supervised monocular depth estimation method based on three-dimensional Gaussian splashing as described in claims 1-7.