A depth perception method based on speckle structured light and stereo matching network joint optimization

By combining Fourier transform-generated differentiable speckle structured light with a dual-branch stereo matching network for joint optimization, the problems of structured light patterns being unable to transmit environmental information and the high cost of customized diffractive optical elements in active stereo technology are solved, achieving high-precision depth sensing and robust depth acquisition.

CN120451239BActive Publication Date: 2026-06-19NORTHEASTERN UNIV CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHEASTERN UNIV CHINA
Filing Date
2025-04-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing active stereoscopic technology projects structured light patterns that cannot effectively convey environmental information, resulting in low depth perception accuracy. Furthermore, custom diffractive optical elements are costly and difficult to implement.

Method used

A speckle structured light generation scheme based on Fourier transform is jointly optimized with a two-branch stereo matching network. Differentiable speckle structured light patterns are generated during the training phase, and depth estimation is performed using a binocular camera and a projector, thus avoiding the need for custom diffractive optical elements.

Benefits of technology

It improves the accuracy and performance of depth estimation, reduces costs, and achieves high-precision depth perception capabilities and robust depth acquisition results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120451239B_ABST
    Figure CN120451239B_ABST
Patent Text Reader

Abstract

This invention belongs to the field of depth perception technology and discloses a depth perception method based on joint optimization of speckle structured light and a stereo matching network. A speckle structured light generation scheme based on Fourier transform is constructed, and the size, grayscale, and density of the speckle structured light are parameterized. The generation process of the speckle structured light is made differentiable. A differentiable projection model is constructed to synthesize active stereo images. A two-branch stereo matching network is constructed, using the synthesized active stereo image pairs and RGB image pairs as inputs to the two-branch stereo matching network. During the training phase, joint optimization of the speckle structured light and the two-branch stereo matching network is completed. During the testing phase, the optimized speckle structured light is projected in actual scenes, and the depth of the actual scene is obtained through the two-branch stereo matching network. This invention enables the optimization of structured light parameters, making structured light optimization possible, and introduces active stereo images containing structured light information into the network to achieve high-precision depth estimation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of depth sensing technology, and in particular to a depth sensing method based on joint optimization of speckle structured light and stereo matching network. Background Technology

[0002] Depth sensing technology has become the cornerstone of development in fields such as virtual reality, autonomous driving, and facial recognition. Currently, mainstream depth sensing technologies mainly include binocular vision, time-of-flight, and structured light.

[0003] Binocular vision acquires depth by calculating parallax. In recent years, the development of binocular vision technology has been greatly propelled by advancements in deep learning. However, in textureless or repetitive textured regions, finding matching points inherently presents ambiguity. Time-of-flight (TOF) technology achieves depth perception by emitting light pulses into a scene and calculating the time difference or phase difference of the returning pulses; however, TOF suffers from multipath interference and low resolution. Structured light achieves depth perception by calculating distortion information in two-dimensional images or by marking three-dimensional space; however, structured light is vulnerable to lighting conditions.

[0004] Active stereo technology offers a low-cost depth perception solution to the aforementioned problems. An active stereo system is equipped with two cameras and a projection module. The projection module generates a pre-designed structured light pattern to artificially add a layer of texture to the measurement scene, while the cameras calculate parallax to obtain scene depth information. This process leverages the spatial characteristics of structured light, enhancing the uniqueness of matching points and achieving reliable estimation of stereo correspondences. Due to its low-cost infrared projection module and CMOS sensor, active stereo technology has already been deployed in commercially available products, such as the Intel RealSense D435.

[0005] Active stereo technology still faces some challenges. The structured light patterns projected by active stereo technology are pre-designed and cannot effectively transfer environmental information into the structured light design process, causing a disconnect between structured light generation and depth estimation algorithms, thus affecting depth perception accuracy. Furthermore, currently deployed active stereo depth camera projection modules primarily use diffractive optical elements, but manufacturing custom diffractive optical elements requires advanced equipment and is very expensive, which is usually difficult to achieve. Summary of the Invention

[0006] The purpose of this invention is to propose a depth perception method based on the joint optimization of speckle structured light and stereo matching network. By jointly optimizing digital speckle structured light and stereo matching network, the depth estimation task can be completed using only a projector and a binocular camera in real-world scenarios, without the need for custom diffractive optical elements.

[0007] The technical solution of this invention is as follows: A depth perception method based on joint optimization of speckle structured light and stereo matching network includes two stages: a training stage and an actual testing stage. In the training stage, a speckle structured light generation scheme based on Fourier transform is constructed, and the size, grayscale, and density of the speckle structured light are parameterized. Simultaneously, the speckle structured light generation process is differentiable. A differentiable projection model is constructed to synthesize active stereo images. A two-branch stereo matching network is constructed, using the synthesized active stereo image pairs and RGB image pairs as inputs for depth acquisition. Throughout the training stage, joint optimization of speckle structured light and the two-branch stereo matching network is completed. In the actual testing stage, the optimized speckle structured light is actually projected, and the depth of the actual scene is obtained through the two-branch stereo matching network.

[0008] The specific method for generating speckle structured light based on Fourier transform is as follows: According to Fourier optics theory, the generation of speckle structured light patterns is formed by the superposition of multiple components with independent phases; using discrete Fourier transform speckle structured light pattern generation, a random speckle structured light pattern is generated. phase matrix , The elements within are defined as:

[0009]

[0010] in This represents a uniform distribution between 0 and 1. Indicates speckle size, Used to control the density of speckle structured light patterns;

[0011] Design a complex matrix Used to represent a light field:

[0012]

[0013] In complex matrix If a differentiable mask based on the sigmoid function is added to control the speckle size, the final optical field will be expressed as:

[0014]

[0015] For the final light field The speckle structured light pattern is obtained by sequentially performing Fourier transform and squaring operations:

[0016]

[0017] Pixel-by-pixel brightness control of speckle structured light patterns:

[0018]

[0019] in Represents the grayscale value of a pixel;

[0020] in and Differentiable processing is performed using an automatic differentiation mechanism, and It is inherently differentiable, enabling differentiable manipulation of the speckle structured light generation process; by , and The generated speckle structured light pattern As input to the differentiable sampling model.

[0021] The differential projection model includes the optical projection process and the camera imaging process;

[0022] In the simulated optical projection process, the sampling factor of the speckle structured light pattern is determined by calculating the ratio of the camera pixel size to the projector pixel size:

[0023]

[0024] in and These represent the pixel pitch of the camera and the pixel pitch of the projector, respectively. and These represent the focal lengths of the camera and the projector, respectively; the sigma sampling factor is used to sample the speckle structured light pattern. ;

[0025] After simulating the optical projection process, the Lambertian model is used to simulate the camera imaging process; based on the depth map of the camera's viewpoint... and occlusion map The viewpoint transformation operation is completed, and the sampled speckle structured light pattern is transformed. Distorting the image to a binocular viewpoint yields a speckle structured light pattern under binocular vision. :

[0026]

[0027] in `warp` represents element-wise multiplication, and `warp` represents the warp operator.

[0028] After the perspective is changed, an active stereo image is obtained based on the Lambert model;

[0029]

[0030] in, To describe the scalar values ​​of exposure and sensor spectral quantum efficiency, Indicates ambient light. This indicates the power of the projector. It's noise. Indicates reflectivity.

[0031] The dual-branch stereo matching network includes a convolution-based backbone network, multiple attention layers, an attention mask, optimal transport, coarse parallax and occlusion regression, and a context-adaptive layer.

[0032] The processing procedure of the two-branch stereo matching network is as follows:

[0033] Active stereo image pairs and RGB stereo image pairs are used as inputs. A convolutional backbone network is used to obtain the initial feature descriptors of the active stereo image pairs and the feature descriptors of the RGB image pairs, respectively. The initial feature descriptors of the active stereo image pairs are superimposed to obtain the local features of the active stereo image pairs.

[0034] The local features of the active stereo image pair and the feature descriptors of the RGB image pair are input into multiple attention layers for processing. Each attention layer includes: a main matching path, an active stereo path, and a fusion path. The local features of the active stereo image pair are input to the fusion path via the active stereo path. The feature descriptors of the RGB image pair are input to the fusion path via the main matching path.

[0035] The main matching path and the active stereo path each contain a self-attention layer and a cross-attention layer. The main matching path is used to obtain contextual information and corresponding pixel similarity based on RGB stereo image pairs. The active stereo path is used to obtain contextual information and corresponding pixel similarity based on active stereo image pairs. The fusion path is used to integrate the contextual information of the active stereo image pairs into the main matching path. In the last attention layer, the features output by the main matching path are used as input to the attention mask. After optimal transmission, coarse disparity map and occlusion map are obtained. Finally, after context adaptation layer processing, the final disparity map and occlusion map are obtained.

[0036] The convolution-based backbone network adopts an hourglass-shaped feature extraction architecture, and the generated feature descriptors are the same size as the original image.

[0037] The initial feature descriptor overlay process for the active stereo image pair specifically involves: after obtaining the initial feature descriptors for the active stereo image pair, for each pixel... ,by Centered on the point, surround the surrounding area The feature channels of each pixel are stitched together to form a... The vector, This represents the feature channel dimension; the above operations are performed on all pixels in the initial feature descriptor, and then passed through a layer of 1. 1 convolution and one layer of 3 The 3-level convolution reduces the dimension of the feature channels to the same dimension as the initial feature descriptor and serves as the input to the attention layer.

[0038] The self-attention layer is used to aggregate information from features in the main matching path and the active stereo path. The cross-attention layer is used to calculate the pixel similarity on each epipolar line in the RGB stereo image features and the active stereo image features. The fusion path concatenates the features of the main matching path and the active stereo path, and then uses two convolutional layers to aggregate the active stereo image features containing speckle structured light information into the RGB stereo image features. The fused features are used as the input of the main matching path in the next attention layer, and the output features of the active stereo path continue to be used as the input of the active stereo path in the next attention layer.

[0039] The attention mask, in addition to ensuring the uniqueness of matching points between two feature images, also considers the positional information of the matching points; let... and Let be the matching points on an epipolar line, with their coordinates increasing from left to right. Then the coordinates of the matching points satisfy... The final matching point appears within a lower triangular matrix;

[0040] The optimal transmission takes the features output by the attention mask as input. In order to ensure the uniqueness of the matching between pixels, an optimal transmission based on entropy regularization is introduced. Based on soft allocation and differentiability, the gradient flows normally.

[0041] The coarse parallax and occlusion regression first employs a winner-takes-all principle to find the optimal matching point. Then, a 3-pixel window is constructed around the matching point. The pixels within the pixel window are normalized, and finally, the coarse disparity is obtained by calculating the weighted sum of the candidate disparities within the pixel window. :

[0042]

[0043]

[0044] in This represents the probability of candidate disparity. This represents the normalized candidate disparity probability. Indicates candidate parallax; Point occlusion probability The calculation is as follows:

[0045]

[0046] The context adaptation layer uses convolutional blocks and the Sigmoid function to complete the final estimation, and uses residual blocks and the ReLU function to complete the final disparity estimation, thereby achieving the aggregation of contextual information on different epipolar lines.

[0047] In the joint learning process, the parameters of the bi-branch stereo matching network are collectively labeled as follows: Parameters of speckle structured light , and Overall marking as The end-to-end joint optimization problem can be summarized as follows:

[0048]

[0049] in This represents the predicted left disparity map. This represents the true value of the left disparity. This represents the predicted left occlusion map. This represents the true value of left occlusion; for disparity loss, relative response loss and L1 loss function are used, and for occlusion loss, binary cross-entropy loss is used.

[0050] The optimized speckle structured light was actually projected using two monocular cameras and a projector onto the scene to be tested, and active stereo images were acquired using the cameras. The active stereo images acquired by the cameras were then input into a trained dual-branch stereo matching network to complete depth acquisition.

[0051] The beneficial effects of this invention are:

[0052] (1) A differentiable structured light generation scheme is proposed, which realizes the optimization of structured light parameters, makes structured light optimization possible, and is conducive to improving depth estimation performance.

[0053] (2) A dual-branch stereo matching network is proposed, which introduces active stereo images containing structured light information into the network to achieve high-precision depth estimation.

[0054] (3) An experimental prototype was built, and comprehensive experiments were conducted in datasets and the real world. The results show that the method of the present invention has robust depth acquisition capabilities and is superior to existing methods. Attached Figure Description

[0055] Figure 1 This is the overall flowchart of the present invention; (a) is the training phase, and (b) is the testing phase;

[0056] Figure 2 This is a diagram of a two-branch stereo matching network architecture;

[0057] Figure 3 This is a schematic diagram of the fusion path;

[0058] Figure 4 (a) is a schematic diagram of the attention mask; (b) is a schematic diagram of the matching point positions of the image pair; (c) is a schematic diagram of the attention mask calculation.

[0059] Figure 5 A diagram of an apparatus used in the method of the present invention;

[0060] Figure 6 The following are qualitative accuracy evaluation diagrams: (a) is the qualitative accuracy evaluation diagram of the ActiveStereoNet method, (b) is the qualitative accuracy evaluation diagram of the Polka Lines method of structured light optimization, (c) is the qualitative accuracy evaluation diagram of the STTR method of stereo matching, (d) is the qualitative accuracy evaluation diagram of the CSTR method of stereo matching, (e) is the qualitative accuracy evaluation diagram of the ELFNet method of stereo matching, and (f) is the qualitative accuracy evaluation diagram of the present invention.

[0061] Figure 7 For quantitative assessment of accuracy;

[0062] Figure 8 (a) is a qualitative estimation image; (b) is a scene RGB image; (c) is a qualitative estimation image of the ActiveStereoNet method in different scenes; (d) is a qualitative estimation image of the Polka Lines method for structured light optimization in different scenes; (e) is a qualitative estimation image of the STTR method for stereo matching in different scenes; (f) is a qualitative estimation image of the ELFNet method for stereo matching in different scenes; (g) is a qualitative estimation image of the present invention in different scenes; and (h) is a ground truth image. Detailed Implementation

[0063] Figure 1 This is a flowchart of the technical solution of the present invention. The present invention proposes a depth perception method based on joint optimization of speckle structured light and a stereo matching network, comprising two stages: a training stage and an actual testing stage. In the training stage, a speckle structured light generation scheme based on Fourier transform is constructed, and the size, grayscale, and density of the speckle structured light are parameterized. Simultaneously, the speckle structured light generation process is differentiable. A differentiable projection model is constructed to synthesize active stereo images. A two-branch stereo matching network is constructed, using the synthesized active stereo image pairs and RGB image pairs as inputs for depth acquisition. Throughout the training stage, the joint optimization of speckle structured light and the two-branch stereo matching network is completed. In the actual testing stage, the optimized speckle structured light is actually projected, and the depth of the actual scene is obtained through the two-branch stereo matching network.

[0064] The specific implementation process includes the following steps:

[0065] (1) Structured light generation. According to Fourier optics theory, the speckle pattern is formed by the superposition of multiple components with independent phases. This process can be simulated using the Discrete Fourier Transform. First, assume that a random speckle pattern is generated. phase matrix , The elements within are defined as:

[0066]

[0067] in This represents a uniform distribution between 0 and 1. Indicates speckle size; This is used to control the density of the speckle structured light pattern. A complex matrix is ​​then obtained. Used to represent a light field:

[0068]

[0069] Note The range of values ​​for approximates a rectangular function, which results in the speckle size being non-differentiable. To ensure the entire process of structured light and deep networks is differentiable, we use a matrix... Previously, a differentiable mask for an approximate matrix function was added. Used to control the speckle size, the final optical field is expressed as:

[0070]

[0071] For matrix By performing Fourier transform and squaring operations sequentially, the speckle structured light pattern can be obtained.

[0072]

[0073] In addition to controlling the phase and speckle size, we also perform pixel-by-pixel brightness control on the speckle structured light pattern:

[0074]

[0075] in This represents the grayscale value of a pixel.

[0076] At this point, we have completed the encoding of the key parameters of structured light, namely density (achieved by controlling the phase), size, and brightness, and ensured the differentiability of the parameters.

[0077] (2) Differentiable projection model. The sampling factor of the structured light pattern during projection is determined by calculating the ratio of camera pixel size to projector pixel size:

[0078]

[0079] in and These represent the pixel pitch of the camera and the projector, respectively. and These represent the focal lengths of the camera and projector, respectively. The sampling factor is applied to the structured light pattern, and the structured light is sampled using the 𝑏𝑖𝑐𝑢𝑏𝑖𝑐 sampling factor. Since this process is unrelated to depth information, It can be applied in any scenario.

[0080] After simulating the projection process, it is necessary to simulate the light transmission process from the scene under test to the stereo camera. This process is implemented using geometric optics.

[0081] Depth map using camera view and occlusion map Complete the viewpoint transformation operation and apply structured light. Distorted to a binocular perspective:

[0082]

[0083] in _ represents element-wise multiplication, and warp represents the warp operator.

[0084] After the perspective is changed, an active stereo image is obtained using the Lambert model.

[0085]

[0086] in To describe the scalar values ​​of exposure and sensor spectral quantum efficiency, Indicates ambient light. This indicates the power of the projector. It's noise. Indicates reflectivity.

[0087] (3) Two-branch stereo matching network. The architecture of the two-branch stereo matching network is as follows: Figure 2 As shown. The entire process incorporates active stereo images to enhance region recognition. Active stereo image pairs and RGB stereo image pairs are used as inputs, and a convolutional backbone network is used to obtain the initial feature descriptors for the active stereo image pairs and the feature descriptors for the RGB image pairs, respectively. The initial feature descriptors of the active stereo image pairs are then superimposed to obtain the local features of the active stereo image pairs.

[0088] The local features of the active stereo image pair and the feature descriptors of the RGB stereo image pair are input into multiple attention layers for processing. Each attention layer includes: a main matching path, an active stereo path, and a fusion path. The local features of the active stereo image pair are input to the fusion path via the active stereo path. The feature descriptors of the RGB image pair are input to the fusion path via the main matching path.

[0089] The main matching path and the active stereo path each contain a self-attention layer and a cross-attention layer. The main matching path is used to obtain contextual information and corresponding pixel similarity based on RGB stereo image pairs. The active stereo path is used to obtain contextual information and corresponding pixel similarity based on active stereo image pairs. The fusion path is used to integrate the contextual information of the active stereo image pairs into the main matching path. In the last attention layer, the features output by the main matching path are used as input to the attention mask. After optimal transmission, coarse disparity map and occlusion map are obtained. Finally, after context adaptation layer processing, the final disparity map and occlusion map are obtained.

[0090] Regarding feature extraction, we employ an hourglass-shaped feature extraction architecture, generating feature descriptors of the same size as the original image. Notably, we further process the feature descriptors for active stereo image pairs. Considering the local uniqueness brought by structured light information, we superimpose the feature descriptors after obtaining the initial ones. Specifically, for each pixel... , we Centered on the point, surround the surrounding area The feature channels of each pixel are stitched together to form a... vector ( (representing the feature channel dimension), here Set to 5. Perform the above operation on all pixels in the feature descriptor. Then pass through a layer of 1. 1 convolution and one layer of 3 A 3x3 convolution reduces the dimension of the feature channels to the same dimension as the original feature descriptors before feeding them into the attention layer. Here, a 1x3 convolution is used. The purpose of convolution is to integrate local information from structured light into a single pixel, using 3... The purpose of 3-convolution is to enhance the capture of local features in active stereo images.

[0091] Regarding the attention layer, self-attention is used to aggregate information from the images in both the main matching path and the active stereo path, and cross-attention is used to calculate pixel similarity along an epipolar line. Note that self-attention is calculated in both the horizontal and vertical directions in both the main matching path and the active stereo path to improve generalization performance. To fully utilize the features of the active stereo image pairs, a fusion path is added to each layer, such as... Figure 3As shown in the diagram. Specifically, the features of the main matching path and the active stereo path are concatenated, and then two convolutional layers are used to aggregate the features containing structured light information into the main features. The fused features are then used as the input to the next main matching path layer.

[0092] Other modules mainly include attention masking, optimal transport, original disparity and occlusion regression, and a context-adaptive layer. Regarding optimal transport and attention masking, to ensure the uniqueness of the match, an entropy-regularized optimal transport is introduced, utilizing its soft assignment and differentiability to achieve normal gradient flow. Besides ensuring the uniqueness of the matching point, the positional information of the matching point also needs to be considered. For example... Figure 4 As shown, assuming and Let be the matching points on an epipolar line, with their coordinates increasing from left to right. Then the coordinates of the matching points satisfy... The possible range of the final matching point is a lower triangular matrix. Regarding the original disparity and occlusion regression, the winner-takes-all principle is first adopted to find the optimal matching point. Then, a 3-pixel window is constructed around the matching point. The pixels within the window are normalized, and finally, the weighted sum of the candidate disparities within the window is calculated to obtain the original disparity. :

[0093]

[0094]

[0095] in This represents the probability of candidate disparity. This represents the normalized candidate disparity probability. This represents the candidate disparity. The above method improves the model's robustness in multimodal distributions. The above is the calculation of the unoccluded disparity probability. Point occlusion probability It can be calculated as:

[0096]

[0097] Regarding the context adaptation layer, we use convolutional blocks and the Sigmoid function to complete the final estimation, and residual blocks and the ReLU function to complete the final disparity estimation. This process realizes the aggregation of contextual information on different epipolar lines.

[0098] (4) Joint learning. The parameters of the two-branch stereo matching network are labeled as... Parameters of speckle structured light , and Overall marking as The end-to-end joint optimization problem can then be summarized as follows:

[0099]

[0100] in This represents the predicted left disparity map. This represents the true value of the left disparity. Similarly, This represents the predicted left occlusion map. This represents the true value of left occlusion. For disparity loss, we use relative response loss and L1 loss function; for occlusion loss, we use binary cross-entropy loss.

[0101] The apparatus for implementing the present invention is as follows Figure 5 As shown. The camera resolution is 1920. 1200. The projector resolution is 1920. 1080.

[0102] Compared with existing technologies, this invention improves depth sensing accuracy. Multiple measurements were performed on a standard plane with weak texture at distances ranging from 40 to 180 cm. Qualitative evaluation results are as follows: Figure 6 As shown, compared to other methods, the planes reconstructed by this invention at different depths have smooth surfaces and continuous depth. Quantitative evaluation results are as follows: Figure 7 As shown, this invention uses mean absolute error to measure the reconstruction effect. This invention maintains a low overall error; even if the results at 60cm and 80cm are not optimal, they are still near-optimal. Other methods exhibit larger error fluctuations, with their maximum errors being significantly greater than those of this invention.

[0103] In a qualitative comparison with current mainstream depth perception methods, such as Figure 8 As shown, this invention obtains a more complete depth map compared to other methods, and achieves better matching results in detailed areas.

Claims

1. A depth perception method based on joint optimization of speckle structured light and stereo matching network, characterized in that, It includes two phases: the training phase and the actual testing phase; During the training phase, a speckle structured light generation scheme based on Fourier transform is constructed, and the size, gray level, and density of the speckle structured light are parameterized. At the same time, the speckle structured light generation process is made differentiable. A differentiable projection model is constructed to synthesize active stereo images; a two-branch stereo matching network is constructed, and the synthesized active stereo image pairs and RGB image pairs are used as inputs to the two-branch stereo matching network for depth acquisition. Throughout the training phase, the joint optimization of speckle structured light and bi-branch stereo matching network was completed; In the actual testing phase, the optimized speckle structured light was actually projected and the depth of the actual scene was obtained through a dual-branch stereo matching network. The dual-branch stereo matching network includes a convolution-based backbone network, multiple attention layers, an attention mask, optimal transport, coarse parallax and occlusion regression, and a context-adaptive layer. The processing procedure of the two-branch stereo matching network is as follows: Using active stereo image pairs and RGB image pairs as input, a convolution-based backbone network is used to obtain the initial feature descriptors of the active stereo image pairs and the feature descriptors of the RGB image pairs, respectively. The initial feature descriptors of the active stereo image pair are superimposed to obtain the local features of the active stereo image pair; The local features of the active stereo image pair and the feature descriptors of the RGB image pair are input into multiple attention layers for processing; Each attention layer includes: a main matching path, an active stereo path, and a fusion path; local features of the active stereo image pair are input to the fusion path via the active stereo path; feature descriptors of the RGB image pair are input to the fusion path via the main matching path. Both the main matching path and the active stereo path contain one self-attention layer and one cross-attention layer. The main matching path acquires contextual information and corresponding pixel similarity based on RGB stereo image pairs. The active stereo path acquires contextual information and corresponding pixel similarity based on active stereo image pairs. The fusion path integrates the contextual information of the active stereo image pairs into the main matching path. In the final attention layer, the features output by the main matching path are used as input to the attention mask. After optimal transmission, coarse disparity and occlusion maps are obtained. Finally, after context adaptation, the final disparity and occlusion maps are obtained.

2. The method of claim 1, wherein, The specific method for generating speckle structured light based on Fourier transform is as follows: According to Fourier optics theory, the generation of speckle structured light patterns is formed by the superposition of multiple components with independent phases; using discrete Fourier transform speckle structured light pattern generation, a random speckle structured light pattern is generated. phase matrix , The elements within are defined as: in This represents a uniform distribution between 0 and 1. Indicates speckle size, Used to control the density of speckle structured light patterns; Designing a complex matrix To represent the optical field: In the complex matrix Before increasing a differentiable mask based on the Sigmoid function to control the speckle size, the final light field representation is: for the final light field The speckle structure light pattern is obtained by sequentially performing Fourier transform and squaring operation: Pixel-by-pixel brightness control of speckle structured light patterns: wherein represents the gray value of the pixel point; in and Differentiable processing is performed using an automatic differentiation mechanism, and It is inherently differentiable, enabling differentiable manipulation of the speckle structured light generation process; by , and The generated speckle structured light pattern As input to the differentiable sampling model.

3. The method of claim 1, wherein, The differential projection model includes the optical projection process and the camera imaging process; In the simulated optical projection process, the sampling factor of the speckle structured light pattern is determined by calculating the ratio of the camera pixel size to the projector pixel size: in and These represent the pixel pitch of the camera and the pixel pitch of the projector, respectively. and These represent the focal lengths of the camera and the projector, respectively; the sigma sampling factor is used to sample the speckle structured light pattern. ; After simulating the optical projection process, the Lambertian model is used to simulate the camera imaging process; based on the depth map of the camera's viewpoint... and occlusion map The viewpoint transformation operation is completed, and the sampled speckle structured light pattern is transformed. Distorting the light to a binocular viewing angle yields a speckle structured light pattern under binocular vision. : in `warp` represents element-wise multiplication, and `warp` represents the warp operator. After the perspective is changed, an active stereo image is obtained based on the Lambert model; in, To describe the scalar values ​​of exposure and sensor spectral quantum efficiency, Indicates ambient light. This indicates the power of the projector. It's noise. Indicates reflectivity.

4. The method of claim 1, wherein, The convolution-based backbone network adopts an hourglass-shaped feature extraction architecture, and the generated feature descriptors are the same size as the original image.

5. The method of claim 1, wherein, The initial feature descriptor overlay process for the active stereo image pair specifically involves: after obtaining the initial feature descriptors for the active stereo image pair, for each pixel... ,by Centered on the point, surround the surrounding area The feature channels of each pixel are stitched together to form a... The vector, This represents the feature channel dimension; the above operations are performed on all pixels in the initial feature descriptor, and then passed through a layer of 1. 1 convolution and one layer of 3 The 3-level convolution reduces the dimension of the feature channels to the same dimension as the initial feature descriptor and serves as the input to the attention layer.

6. The method of claim 5, wherein, The self-attention layer is used to aggregate information in features in the main matching path and the active stereo path, and the cross-attention layer is used to calculate the pixel similarity on each epipolar line in the RGB stereo image features and the active stereo image features. The fusion path concatenates the features of the main matching path and the active stereo path, and then uses two convolutional layers to aggregate the active stereo image features containing speckle structured light information into the RGB stereo image features. The fused features serve as the input to the main matching path of the next attention layer, and the output features of the active stereo path continue to serve as the input to the active stereo path of the next attention layer.

7. The method of claim 5, wherein, The attention mask, in addition to ensuring the uniqueness of matching points between two feature images, also considers the positional information of the matching points; let... and Let be the matching points on an epipolar line, with their coordinates increasing from left to right. Then the coordinates of the matching points satisfy... The final matching point appears within a lower triangular matrix; The optimal transmission takes the features output by the attention mask as input. In order to ensure the uniqueness of the matching between pixels, an optimal transmission based on entropy regularization is introduced. Based on soft allocation and differentiability, the gradient flows normally. The coarse parallax and occlusion regression first employs a winner-takes-all principle to find the optimal matching point. Then, a 3-pixel window is constructed around the matching point. The pixels within the pixel window are normalized, and finally, the coarse disparity is obtained by calculating the weighted sum of the candidate disparities within the pixel window. : in This represents the probability of candidate disparity. This represents the normalized candidate disparity probability. Indicates candidate parallax; Point occlusion probability The calculation is as follows: The context adaptation layer uses convolutional blocks and the Sigmoid function to complete the final estimation, and uses residual blocks and the ReLU function to complete the final disparity estimation, thereby achieving the aggregation of contextual information on different epipolar lines.

8. The method of claim 5, wherein, In the joint learning process, the parameters of the bi-branch stereo matching network are collectively labeled as follows: Parameters of speckle structured light , and Overall marking as The end-to-end joint optimization problem can be summarized as follows: in This represents the predicted left disparity map. This represents the true value of the left disparity. This represents the predicted left occlusion map. This represents the true value of left occlusion; for disparity loss, relative response loss and L1 loss function are used, and for occlusion loss, binary cross-entropy loss is used.

9. The method of claim 1, wherein, The optimized speckle structured light was used for actual projection, employing two monocular cameras and a projector to project the optimized speckle structured light pattern onto the scene under test, and the cameras were used to acquire active stereo image pairs. The active stereo image pairs acquired by the cameras were input into a trained dual-branch stereo matching network to complete depth acquisition.