A structure and motion cue based generalized stereo matching method and device

By introducing structural and motion cues into stereo matching and performing multiple rounds of iterative optimization, the problem of traditional cyclic structures being unable to inherit monocular depth priors is solved, thereby improving the generalization ability of stereo matching and the stability of disparity prediction.

CN122289342APending Publication Date: 2026-06-26HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2026-02-06
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

The impact of existing stereo matching techniques on zero-shot generalization performance during the iterative optimization stage has not been fully considered. Traditional loop structures are difficult to inherit the rich prior knowledge of monocular depth base models, and scale inconsistency problems are easily introduced when monocular depth prediction results are fused with binocular disparity information, affecting geometric consistency and iterative convergence.

Method used

By introducing structural and motion cues, an initial fused disparity map is generated using the matching cost volume and monocular relative depth map. Multiple rounds of iterative optimization are performed in the cue loop unit, and the hidden state is recursively updated in combination with structural and motion cues to avoid distorted state information and ambiguous guidance.

Benefits of technology

This improves the zero-shot generalization ability of stereo matching methods in unknown scenarios and across datasets, reduces the dependence on large-scale labeled data and scene-specific training, and achieves stable disparity prediction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289342A_ABST
    Figure CN122289342A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of image processing technology and specifically discloses a generalized stereo matching method and device based on structure and motion cues. It includes: aligning image features based on disparity information to generate confidence information; fusing binocular disparity information and monocular depth information according to the confidence information to obtain an initial fused disparity map; aligning binocular image features based on the initial fused disparity map; and initializing the hidden state during the iteration process. During the iteration process, structural cues are constructed based on the geometric consistency information between the current disparity and monocular depth, and motion cues are constructed by combining the cost information obtained during stereo matching. The hidden state is recursively updated accordingly, and the disparity is progressively corrected, outputting the final disparity map. This invention effectively improves the zero-shot generalization ability of stereo matching methods under unknown scenes and cross-dataset conditions, and reduces the dependence on large-scale labeled data and scene-specific training.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing technology, and more specifically, relates to a generalized stereo matching method and device based on structure and motion cues. Background Technology

[0002] Stereo matching is a key technique for estimating scene depth by analyzing the disparity information between paired corrected images. It can obtain pixel-level dense depth maps and plays an important role in 3D scene understanding. This technology is widely used in fields such as autonomous driving, robot navigation, augmented reality, and virtual reality, and has a significant impact on environmental perception accuracy and system reliability.

[0003] With the rapid development of monocular depth models, zero-shot generalized stereo matching has gradually become a research hotspot. Related research shows that monocular depth models, trained on large-scale data, can learn rich geometric priors. Introducing these priors into stereo matching tasks helps improve the model's generalization performance across datasets and in unknown scenarios. Based on this idea, existing techniques mainly focus on using monocular depth models to extract depth-aware features, construct more robust matching cost volumes, or generate more accurate initial disparity results, thereby reducing dependence on specific training data distributions.

[0004] However, in existing technologies, the impact of the iterative optimization phase on zero-shot generalization performance has not been adequately considered. Some methods attempt to introduce monocular depth priors as guiding information during the iteration process, but they typically still rely on traditional recurrent structures such as gated recurrent units for disparity updates. These recurrent structures generally require training from scratch, making it difficult to inherit the rich prior knowledge inherent in the monocular depth base model, thus limiting their hidden state representation capabilities and scalability. Furthermore, traditional recurrent structures often impose strong range constraints on the hidden states, which can easily lead to representational limitations when facing large disparity variations or complex geometric structures. Simultaneously, their direct convolutional fusion of external guiding information and internal states may distort the original state information and compress the guiding information, thereby reducing the effectiveness of iterative optimization.

[0005] On the other hand, monocular depth prediction results typically exhibit scale and offset uncertainties, and their output only has relative depth significance. If directly fused with binocular disparity information without proper processing, scale inconsistency issues can easily be introduced, thereby affecting geometric consistency and subsequent iterative convergence.

[0006] Therefore, existing stereo matching techniques still have shortcomings in improving zero-shot generalization capabilities using monocular depth models, particularly in iterative optimization of structural design, effective fusion of monocular prior and binocular matching information, and scale consistency processing. Further improvements are still needed in these areas. Summary of the Invention

[0007] To address the shortcomings or improvement needs in existing technologies regarding the comprehensive integration of monocular structural and binocular motion cues to enhance the generalization ability of stereo matching iterations and achieve accurate disparity map prediction, this invention provides a generalized stereo matching method and device based on structural and motion cues. This method guides the stereo matching process by introducing structural and motion cues, achieving stable disparity prediction for different scenes without requiring retraining for the target scene. The method extracts features from first and second images, using the matching cost volume and monocular relative depth map to obtain an initial fused disparity map. Structural and motion cues are injected into the cue loop unit to avoid distorted state information and ambiguous guidance. A cue loop unit inheriting monocular depth priors iteratively optimizes the initial fused disparity map through multiple rounds to obtain the final disparity map. By integrating monocular structural and binocular motion cues, this invention effectively improves the zero-shot generalization ability of stereo matching methods in unknown scenes and across datasets, reducing reliance on large-scale labeled data and scene-specific training, thus possessing high practical value.

[0008] To achieve the above objectives, according to one aspect of the present invention, a generalized stereo matching method based on structure and motion cues is proposed, comprising the following steps: Step 1: Obtain binocular disparity information based on stereo matching and depth information based on monocular depth model. After performing scale-independent normalization on the binocular disparity information and monocular depth information, align image features based on binocular disparity information to generate confidence information. Then, fuse the binocular disparity information and monocular depth information according to the confidence information to obtain an initial fused disparity map. Step 2: Align the features of the binocular images based on the initial fused disparity map; Step 3: Initialize the hidden state during the iteration process. During the iteration process, construct structural cues based on the geometric consistency information between the current disparity and monocular depth, and construct motion cues by combining the cost information obtained during stereo matching. Step four: Under the combined guidance of structural and motion cues, the hidden state is recursively updated, and the parallax is progressively corrected. Step 5: Repeat the iterative process of Steps 3 and 4 until the preset conditions are met, and output the final disparity map.

[0009] As a further preferred option, step one specifically includes the following steps: (11) Obtain the initial binocular disparity map based on the cost volume and the relative depth map obtained from the monocular depth model, and calculate their affine normalization parameters respectively, and perform scale-independent normalization on the two. (12) Based on the normalized initial binocular disparity, the relative depth projection is aligned to the disparity space, and the initial disparity is used to align the second image features and generate a confidence map with the first image features. (13) Based on the confidence map, the initial binocular disparity and the aligned relative depth are weighted and fused, and the initial fused disparity map is obtained by mapping back to the disparity space through inverse normalization. Preferably, the generation of the initial fused disparity map specifically includes: The median and scale factor of the initial binocular disparity map from the matching cost volume and the monocular relative depth map from the monocular depth model are calculated and then affine invariant normalization is performed. Using the median and scale factor of the initial binocular disparity map, the normalized monocular relative depth map is linearly mapped back to the binocular disparity space; Preferably, the median and scale factor of the initial binocular disparity map and monocular relative depth map satisfy the following relationship: In the formula, The initial disparity or relative depth to be normalized. for the median of for The scale factor, This is the result after normalization; Preferably, the linear mapping of the normalized monocular relative depth map back to the binocular disparity space satisfies: in, The initial binocular parallax obtained through the cost volume, The scaling factor for the initial disparity. The median of the initial disparity. The normalized relative depth. This represents the relative depth after alignment.

[0010] As a further preferred embodiment, in step three, the hidden states during the initialization iteration process include: (31) Obtain the image features of the first and second images at multiple resolution levels; (32) Based on the initial fusion parallax, the second image features are back-projected and distorted to align them with the first image features in spatial position; (33) The aligned first and second image features are spliced ​​together and input into the convolution block for processing to generate the initial hidden state at each resolution level; Preferably, the feature map of the second image is back-projected and warped based on the initial fused disparity map; the back-projected feature map of the second image is then concatenated with the feature map of the first image, and a multi-resolution initial hidden state is generated through convolution processing, as shown in the following formula: in, This is the initial hidden state generated. Features of the first image To utilize initial parallax The distorted second image features For convolutional blocks As a further preferred embodiment, step three, which involves fusing the binocular disparity information and the monocular depth information based on the confidence information, includes: Based on the initial binocular disparity map, the feature map of the second image is projected to obtain the recovered first feature map; the recovered first feature map and the first feature map are concatenated, and a confidence map is generated through convolution and activation functions; the initial binocular disparity map and the mapped monocular relative depth map are fused pixel by pixel according to the confidence map, satisfying the following relationship: , in, To fuse parallax, For confidence plots, For element-wise multiplication, As the initial parallax, This represents the relative depth after alignment.

[0011] As a further preferred embodiment, the construction of structural prompts based on the geometric consistency information between the current disparity and monocular depth includes: The disparity map obtained in the current iteration is subjected to affine invariant normalization, and the difference features are calculated between it and the normalized monocular relative depth map, satisfying the following specific formula: , in, This represents the difference between the normalized current disparity and the normalized relative depth. This represents the normalized disparity during the current iteration. The normalized relative depth is used to generate structural cue features by jointly encoding the difference features with the frozen monocular depth features. The structural cue features are then superimposed onto the hidden states during the iteration process in the form of residuals to guide the updating of disparity in regions of geometric inconsistency. The specific formula is as follows: , , in, Provide structural hints and features. For a structure encoder with the same architecture as existing methods, For monocular global depth features, It is in a hidden state in the middle. It is a convolutional block.

[0012] As a further preferred embodiment, the construction of motion cues by combining the cost information obtained during stereo matching includes: Based on the disparity map of the current iteration, the corresponding local cost volume features are indexed in the cost volume; the local cost volume features and the current disparity map are input into the motion encoder to generate motion cue features; and the motion cue features are superimposed onto the hidden state in the form of residuals, with the specific formula as follows: , , in, For motion cue features, For motion encoders, This is the local cost volume in the current iteration process. For the current parallax, It is in a hidden state in the middle. It is a convolutional block.

[0013] As a further preferred option, in step five, the iterative update of the parallax includes: During the update process, the calculation formula for the update gate varies depending on the resolution level, including: , in, To update the door, The Sigmoid activation function is used. This is the hidden state of the current level. This is the hidden state of the previous level. Provide structural hints and features. Motion cue features For updating the hidden state, the hidden state of the higher-resolution layer is first processed by residual blocks and added to the hidden state of the current layer. Then, it undergoes further residual block processing. For the initial resolution layer, structural cue features and motion cue features are additionally processed by convolutional blocks and added to the hidden state sequentially. The hidden state is then subjected to a 1×1 convolution to adjust the feature dimensions. Finally, combined with the update gate, the processed hidden state and the original hidden state are weighted and fused to obtain the updated hidden state. The specific formula is as follows: , , , , in, This is the initial hidden state after adjustments. This is the hidden state of the current level. For residual convolution blocks, For a higher resolution level of hidden state; This is a 1×1 convolution operation; This is the updated hidden state. This is an element-wise multiplication operation; The updated hidden state is then input into the convolutional block for processing to obtain the disparity update. This disparity update is then added to the current disparity to obtain the updated disparity. The specific formula is as follows: , in, For the updated parallax, Current parallax; Determine whether the current iteration number is greater than the preset iteration number; When the current iteration number is greater than or equal to the preset iteration number, the iteration stops, and the disparity update amount is added to the input disparity map to obtain the final disparity map based on the obtained disparity update amount. When the current iteration number is less than or equal to the preset iteration number, the disparity update amount is added to the input disparity map and used as a new input disparity map in the next iteration.

[0014] According to another aspect of the invention, a generalized stereo matching system based on structure and motion cues is also provided, comprising an affine invariant fusion module and a cue loop unit module, wherein, The affine invariant fusion module is used to acquire binocular disparity information obtained based on stereo matching and depth information obtained based on a monocular depth model. After performing scale-independent normalization on the binocular disparity information and monocular depth information, the image features are aligned based on the binocular disparity information to generate confidence information. The binocular disparity information and monocular depth information are then fused according to the confidence information to obtain an initial fused disparity map. The cue loop unit module is used to align the features of the binocular image based on the initial fused disparity map, and then initialize the hidden state during the iteration process. During the iteration process, structural cues are constructed based on the geometric consistency information between the current disparity and the monocular depth, and motion cues are constructed by combining the cost information obtained during stereo matching. Under the joint guidance of structural cues and motion cues, the hidden state is recursively updated, and the disparity is gradually corrected and iterated until the preset conditions are met, and the final disparity map is output. Preferably, the system further includes a feature extraction module and a cost body construction module, wherein the feature extraction module, the cost body construction module, the affine invariant fusion module, and the prompting loop unit module are sequentially connected, wherein... The feature extraction module is used to input the first image and the second image into the feature extraction network, the feature extraction network extracts features from the first image and the second image, and sends the extracted features to the cost body construction module; The cost body construction module is used to calculate the group-related cost body based on the feature maps of the first image and the second image. Then, a lightweight 3D convolutional network is used to filter the cost body to obtain the geometric code body. Finally, a full-pair cost body is calculated based on the global feature similarity, and the full-pair cost body is pooled to construct a multi-scale cost body pyramid, thereby providing multi-scale matching relationships.

[0015] According to another aspect of the invention, an electronic device is also provided, comprising: At least one processor, at least one memory, and a communication interface; wherein, The processor, memory, and communication interface communicate with each other; The memory stores program instructions that can be executed by the processor, which invokes the program instructions to execute a generalized stereo matching method based on structure and motion cues, which is a combination of any of the above embodiments.

[0016] According to another aspect of the invention, a non-transitory computer-readable storage medium is also provided, the non-transitory computer-readable storage medium storing computer instructions that cause the computer to execute a generalized stereo matching method based on structure and motion cues, which is a combination of any of the above embodiments.

[0017] In summary, compared with the prior art, the above-described technical solutions conceived by this invention mainly possess the following technical advantages: This invention guides the stereo matching process by introducing structural and motion cues, achieving stable disparity prediction for different scenes without requiring retraining for the target scene. The method extracts features from the first and second images, using the matching cost volume and monocular relative depth map to obtain an initial fused disparity map. A cue loop unit inheriting monocular depth priors iteratively optimizes the initial fused disparity map through multiple rounds to obtain the final disparity map. Structural and motion cues are injected into the cue loop unit to avoid distorted state information and ambiguous guidance. By fusing monocular structural and binocular motion cues, this invention effectively improves the zero-shot generalization ability of stereo matching methods in unknown scenes and across datasets, reducing reliance on large-scale labeled data and scene-specific training, thus possessing high practical value. Attached Figure Description

[0018] Figure 1 This is a flowchart of a generalized stereo matching method based on structure and motion cues provided in an embodiment of the present invention; Figure 2 This is a flowchart of the affine invariant fusion method in a generalized stereo matching method for structure and motion cues provided in an embodiment of the present invention; Figure 3 This is a flowchart of the method for initializing the hidden state in a generalized stereo matching method for structure and motion cues provided in an embodiment of the present invention; Figure 4 This is a flowchart of a method for constructing structural and motion cues in a generalized stereo matching method for structural and motion cues provided in an embodiment of the present invention; Figure 5 This is a flowchart of the method for iteratively updating the cue loop unit in a generalized stereo matching method for structure and motion cues provided in an embodiment of the present invention; Figure 6 This is a system schematic diagram of another generalized stereo matching system with structural and motion cues provided in an embodiment of the present invention; Figure 7 This is a schematic diagram of an electronic device structure for a generalized stereo matching method for storing structure and motion cues, provided by an embodiment of the present invention. Detailed Implementation

[0019] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0020] Example 1: This embodiment provides a generalized stereo matching method based on structure and motion cues, such as... Figure 1 As shown, the method flow includes: Step 1: Obtain binocular disparity information based on stereo matching and depth information based on monocular depth model, and perform scale-independent normalization on both; align image features based on disparity information to generate confidence information, and fuse binocular disparity information and monocular depth information according to the confidence information to obtain an initial fused disparity map.

[0021] Step 2: Align the features of the binocular image based on the initial fused disparity map and initialize the hidden state during the iteration process; during the iteration process, construct structural cues based on the geometric consistency information between the current disparity and the monocular depth, and construct motion cues by combining the cost information obtained during stereo matching.

[0022] Step 3: Guided by both structural and motion cues, recursively update the hidden state and progressively correct the disparity; repeat the above iterative process until the preset conditions are met, and output the final disparity map.

[0023] In step one, the initial binocular disparity information obtained from the cost volume constructed based on stereo matching and the relative depth information predicted by the monocular depth model are obtained. The two are normalized in a scale-independent manner and aligned and fused in a unified scale space to obtain the initial fused disparity map.

[0024] In this embodiment, the initial binocular disparity information can be obtained by any existing stereo matching method, such as the disparity result obtained by constructing a cost volume based on feature correlation and performing disparity regression; the relative depth information can be predicted by a pre-trained monocular depth model, which reflects the relative distance relationship between different pixels in the scene, but does not have an absolute scale.

[0025] Because binocular parallax and monocular relative depth differ in numerical distribution and scale, direct fusion can easily lead to a bias in the fusion result towards one, thus affecting the stability of subsequent iterations. Therefore, in this embodiment, scale-independent normalization is first performed on both to make them comparable within a unified numerical space.

[0026] Furthermore, in this embodiment, the image features of the second image are spatially aligned using the normalized initial disparity, and combined with the image features of the first image to evaluate the reliability of the initial disparity in different regions and generate corresponding confidence information. Based on the confidence information, the initial binocular disparity information and the aligned relative depth information are weighted and fused to obtain an initial fused disparity map.

[0027] This step allows for a more stable and reasonable initial result by fully combining binocular geometric constraints and monocular structural priors before the iteration begins, providing a more stable and reasonable basis for subsequent disparity refinement.

[0028] In step two, the binocular image features are aligned based on the initial fused disparity map, and the hidden state of the loop unit is initialized during the iterative update process. During the iteration process, structural cues are constructed based on the current disparity and monocular depth information, and motion cues are constructed by combining the matching information in the cost volume.

[0029] In this embodiment, the image features of the second image are first back-projected and distorted using initial fusion parallax to align them spatially with the image features of the first image. Then, the aligned second image features are concatenated with the first image features, and convolution processing is used to generate the initial hidden states of the recurrent units at each resolution level.

[0030] In subsequent iterations, for each round of disparity update, this embodiment simultaneously constructs structural cues and motion cues. The structural cues reflect the geometric consistency between the current disparity and the monocular depth prior; this is achieved by scaling the two and calculating their differences, thus characterizing the structural rationality of the current disparity. The motion cues reflect the changes in pixel correspondence during stereo matching; they are constructed by extracting local matching information related to the current disparity from the cost volume.

[0031] By introducing both structural and motion cues in each iteration, richer and more complementary guidance information can be provided for the disparity update process without compromising monocular depth priors.

[0032] In step three, under the joint guidance of structural and motion cues, the hidden state of the loop unit is recursively updated, and the current disparity is corrected based on the updated hidden state; the above iterative process is repeated until the preset number of iterations or convergence conditions are reached, and the final disparity map is output.

[0033] In this embodiment, the loop unit adopts a multi-resolution structure, updating the hidden state at different resolution levels to gradually refine the disparity results. In each iteration, the loop unit generates a disparity residual based on the current hidden state and the hint information, and superimposes the residual onto the current disparity map to achieve gradual correction of the disparity.

[0034] Unlike traditional disparity update methods based on gated recurrent units, the recurrent unit in this embodiment no longer relies on complex gating structures. Instead, it directly guides the hidden state through prompts, thereby improving the model's expressive power while avoiding damage to the original monocular depth prior. Through multiple rounds of iterative updates, the disparity results gradually converge, ultimately yielding a high-precision disparity map.

[0035] like Figure 2 As shown, this embodiment provides a sub-process for generating an initial fused disparity map, which specifically includes the following steps: Step (201) Obtain the initial binocular disparity map based on the stereo matching cost volume and the relative depth map predicted by the monocular depth model, and calculate the corresponding affine normalization parameters for the two respectively, and perform scale-independent normalization processing on the initial binocular disparity map and the relative depth map.

[0036] In this embodiment, the initial binocular disparity map can be obtained by constructing cost volumes from the first and second images and performing disparity regression, reflecting the pixel displacement relationship between the first and second images; the relative depth map can be directly predicted by a pre-trained monocular depth model, reflecting the relative distance relationship between pixels in the image. Since the two maps originate from different sources, they typically differ significantly in numerical scale and distribution range. Therefore, they need to be affine-invariantly normalized separately to eliminate the influence of scale and offset, providing a unified representation basis for subsequent fusion.

[0037] In this embodiment, affine invariant normalization satisfies the following relationship: , , , in, The initial disparity or relative depth to be normalized. for the median of for The scale factor, This is the result after normalization; In this embodiment, the normalized monocular depth map is linearly mapped back to the binocular disparity space, satisfying: , in, The initial binocular parallax obtained through the cost volume, The scaling factor for the initial disparity. The median of the initial disparity. The normalized relative depth. This represents the relative depth after alignment.

[0038] Step (202): Based on the normalized initial binocular disparity map, the normalized relative depth map is projected and aligned to the disparity space. At the same time, the image features of the second image are aligned using the initial binocular disparity map, and a confidence map is generated together with the image features of the first image.

[0039] In this embodiment, the relative depth map can be mapped to spatial coordinates consistent with binocular parallax by parallax guidance, thereby making the two comparable at the pixel level. At the same time, after distorting and aligning the second image features using the initial binocular parallax, it is fused with the first image features to reflect the reliability of the current parallax estimation in different image regions, thereby generating a confidence map to characterize the credibility of the initial parallax.

[0040] Step (203): Based on the confidence map, the initial binocular disparity map and the aligned relative depth map are weighted and fused, and the fusion result is mapped back to the disparity space through inverse normalization to obtain the initial fused disparity map.

[0041] In this embodiment, the initial binocular disparity map and the mapped monocular relative depth map are fused pixel-by-pixel based on the confidence map, satisfying the following relationship: , in, To fuse parallax, For confidence plots, For element-wise multiplication, As the initial parallax, This represents the relative depth after alignment.

[0042] In this embodiment, the confidence map is used to adjust the contribution ratio of binocular disparity information and monocular depth information in different regions, so that binocular disparity is relied upon more in regions where binocular matching is reliable, while monocular depth prior is introduced in regions where binocular matching is unstable or texture is weak. This results in an initial fused disparity map that is more consistent in overall geometry and more robust, providing high-quality initial input for subsequent iterative refinement.

[0043] like Figure 3 As shown, this embodiment provides a sub-process for initializing the hidden state of the prompt loop unit, which specifically includes the following steps: Step (301): Obtain image features of the first image and the second image at multiple resolution levels.

[0044] In this embodiment, the first image and the second image can be respectively input into a feature extraction network with shared weights for processing. The feature extraction network can adopt a multi-layer convolutional structure or a pyramid structure to extract corresponding image features at different resolution levels, so as to simultaneously characterize the local detail information and global structural information of the image, and provide a foundation for subsequent multi-scale parallax thinning.

[0045] Step (302): Based on the initial fused disparity map, the image features of the second image at each resolution level are back-projected and distorted to align them with the features of the first image in spatial position.

[0046] In this embodiment, the initial fused disparity map is used to describe the pixel correspondence between the first and second images. Through the disparity-guided distortion operation, the features of the second image can be mapped to coordinate positions consistent with the features of the first image, thereby establishing a one-to-one correspondence between the first and second images at the feature level and reducing spatial inconsistency caused by disparity offset.

[0047] Step (303) involves stitching the aligned first and second image features together at each resolution level and inputting them into a convolutional block for feature processing to generate the initial hidden state at the corresponding resolution level.

[0048] In this embodiment, by stitching together the aligned first and second image features, complementary information in the binocular images can be fully integrated; then, convolutional blocks are used to process the stitched features to adapt them to the state representation requirements of the cue loop unit, thereby generating initial hidden states for subsequent iterative updates at each resolution level.

[0049] In this embodiment, the specific formula for generating the hidden state is as follows: in, This is the initial hidden state generated. Features of the first image To utilize initial parallax The distorted second image features It is a convolutional block.

[0050] like Figure 4 As shown, this embodiment provides a sub-process for constructing structural cues and motion cues during disparity iteration, which specifically includes the following steps: Step (401): In the current iteration round, obtain the current disparity map, the relative depth map obtained by inference from the monocular depth model, and the global depth features extracted by the monocular depth model.

[0051] In this embodiment, the current disparity map is the disparity result after the previous iteration update, used to characterize the current binocular matching relationship; the relative depth map and monocular global depth features are derived from a pre-trained monocular depth model, used to provide stable monocular geometric priors.

[0052] Step (402): Perform affine invariant normalization on the current disparity map and the relative depth map respectively, and calculate the geometric difference features between the normalized current disparity and the normalized relative depth.

[0053] In this embodiment, by performing affine invariant normalization on parallax and relative depth, the differences between the two in scale and offset can be eliminated, allowing them to be compared at a uniform relative scale; furthermore, by calculating the difference features between the two, inconsistent regions can be reflected.

[0054] In this embodiment, the specific formula is as follows: in, This represents the difference between the normalized current disparity and the normalized relative depth. This represents the normalized disparity during the current iteration. This represents the normalized relative depth.

[0055] Step (403): Construct structural cueing based on the geometric difference features and monocular global depth features.

[0056] In this embodiment, geometric difference features and monocular global depth features are jointly encoded to generate structural cue features. The structural cue features are used to guide the iterative process to focus on regions with inconsistent structures or significant depth changes, thereby gradually correcting the stereo parallax results while maintaining the prior stability of monocular depth.

[0057] In this embodiment, the specific formula for constructing the structural hints is: in, Provide structural hints and features. For a structure encoder with the same architecture as existing methods, For monocular global depth features, It is in a hidden state in the middle. It is a convolutional block.

[0058] Step (404): Construct motion cues based on the local cost information indexed in the cost volume of the current disparity map.

[0059] In this embodiment, the cost body contains matching cost information of the binocular images under different disparity assumptions; by indexing the local cost information corresponding to the current disparity in the cost body, motion cues and matching confidence information related to stereo matching can be extracted and encoded as motion cues to assist subsequent disparity updates.

[0060] In this embodiment, the specific formula for constructing motion cues is: , , in, For motion cue features, For motion encoders, This is the local cost volume in the current iteration process. For the current parallax, It is in a hidden state in the middle. It is a convolutional block.

[0061] like Figure 5 As shown, this embodiment provides a sub-process for recursive disparity updates under the joint guidance of structural cues and motion cues, which specifically includes the following steps: Step (501): Guided by structural and motion cues, calculate the update gate and update the hidden state in the current iteration round.

[0062] In this embodiment, the structural cue is used to reflect the inconsistency between the current disparity result and the monocular geometry, and the motion cue is used to characterize the local cost information related to disparity changes during stereo matching. By calculating the update gate under the joint guidance of the two, the update degree of the hidden state can be adaptively controlled, so that the hidden state can retain existing effective information while introducing new structures and motion cues, thereby improving the stability and effectiveness of iterative updates.

[0063] In this embodiment, the specific formula for calculating the update gate is as follows: , in, To update the door, The Sigmoid activation function is used. This is the hidden state of the current level. This is the hidden state of the previous level. Provide structural hints and features. Features that indicate motion.

[0064] Step (502) fuses the hidden states at different resolution levels and performs feature adjustment on the fused hidden states.

[0065] In this embodiment, the prompt loop unit adopts a multi-resolution structure, and the hidden states at different resolution levels depict spatial information at different scales. By fusing the hidden states at different resolution levels, local detail information and global structural information can be integrated, and feature adjustment can be used to make the fused hidden states meet the representation requirements of subsequent disparity updates.

[0066] In this embodiment, the specific formula for the iterative update of the loop unit is as follows: , , , in, This is the initial hidden state after adjustments. This is the hidden state of the current level. For residual convolution blocks, For a higher resolution level of hidden state; This is a 1×1 convolution operation; This is the updated hidden state. This is an element-wise multiplication operation.

[0067] Step (503): Predict the residual update amount of the current disparity based on the updated hidden state.

[0068] In this embodiment, the updated hidden state is input into the disparity prediction module to obtain the disparity residual used to correct the current disparity map; the disparity residual is used to characterize the direction and magnitude of the deviation between the current disparity and the target disparity, thereby achieving the purpose of gradually approaching the true disparity.

[0069] Step (504): The disparity residuals are accumulated into the current disparity map to obtain the updated disparity map. Based on preset iteration conditions, it is determined whether to proceed to the next iteration or output the final disparity result. In this embodiment, the specific formula for disparity update is: in, For the updated parallax, This represents the current parallax.

[0070] In this embodiment, when the number of iterations has not reached the preset number, the updated disparity map is input as the new current disparity map into the next iteration; when the number of iterations reaches the preset number or the preset convergence condition is met, the iteration stops and the current disparity map is output as the final disparity map.

[0071] Based on the above embodiments or any combination of the above embodiments, this embodiment provides a generalized stereo matching system based on structure and motion cues, including: The first main control module is used to acquire binocular disparity information obtained based on stereo matching and depth information obtained based on monocular depth model. After performing scale-independent normalization on the binocular disparity information and monocular depth information, the module aligns image features based on the binocular disparity information to generate confidence information, and fuses the binocular disparity information and monocular depth information according to the confidence information to obtain an initial fused disparity map. The second main control module is used to align the features of the binocular image based on the initial fused disparity map; The third main control module initializes the hidden state during the iteration process. During the iteration process, it constructs structural cues based on the geometric consistency information between the current disparity and the monocular depth, and constructs motion cues by combining the cost information obtained during stereo matching. Under the joint guidance of structural cues and motion cues, it recursively updates the hidden state and iteratively corrects the disparity until the preset conditions are met, and outputs the final disparity map.

[0072] Example 2: This embodiment, based on Embodiment 1, provides a generalized stereo matching system based on structure and motion cues, used to execute the generalized stereo matching method based on structure and motion cues described in Embodiment 1. The generalized stereo matching system based on structure and motion cues includes: a feature extraction module, a cost volume construction module, an affine invariant fusion module, and a cue loop unit module, wherein: The feature extraction module, cost body construction module, affine invariant fusion module, and cue loop unit module are sequentially connected. The feature extraction module inputs the first image and the second image into the feature extraction network, which extracts features from the first image and the second image and sends the extracted features to the cost body construction module.

[0073] The cost body construction module first calculates the group-related cost body using the first and second feature maps. Next, it filters the cost body using a lightweight 3D convolutional network to obtain the geometric code body. To utilize global feature similarity, a full-pair cost body is also calculated simultaneously. Finally, the cost bodies undergo pooling operations to construct a multi-scale cost body pyramid, thereby providing multi-scale matching relationships.

[0074] The affine-invariant fusion module is used to obtain a more reliable initial disparity map. For the binocular disparity map regressed from the geometric code and the inverse depth map obtained from monocular branch inference, their medians and scale factors are calculated respectively. Then, using the aforementioned binocular disparity median and scale factor, the monocular inverse depth map is linearly scaled to the binocular disparity space. Simultaneously, using the binocular disparity map and the second feature map image, a recovered first feature map is obtained through projection. The recovered first feature map and the first feature map are concatenated along the channel dimension and then fed into a convolution module. After activation by an activation function, a confidence probability map for fusion is obtained. Finally, using the confidence map, the binocular disparity map, and the transformed monocular inverse depth map, the initial disparity map at the iteration starting point is fused. This initial disparity map is subsequently fed into the cue loop unit loop module for fine-tuning.

[0075] The cue loop unit module is used to iterate the initial disparity map based on the fused features. It indexes the local cost volume corresponding to the current disparity from the cost volume through a lookup operation, and obtains a disparity map that converges to near-accuracy after dozens of iterations. The cue loop unit module is based on a monocular depth fundamental model decoder and is used to refine the initial disparity map step by step. Specifically, the cue loop unit utilizes a multi-resolution architecture, updating the hidden state and disparity by introducing structural cues and motion cues. Its processing includes: State initialization step. Unlike traditional recurrent units that only use the first feature for initialization, this embodiment utilizes the initial disparity map output by the affine invariant fusion module to perform a reverse distortion mapping operation on the second feature map. The mapped second feature map is then concatenated with the first feature map along the channel dimension. The concatenated features are input into the convolution module to generate initial hidden states at various scales. This step ensures that the hidden states contain the correspondence information of stereo matching in the early stages of iteration.

[0076] Structural cue construction steps. To utilize monocular depth priors during iterations without disrupting the disparity range of stereo matching, this invention constructs structural cues. First, the disparity map obtained in the current iteration step and the monocular inverse depth map extracted by the feature extraction module are respectively subjected to affine invariant normalization. Next, the absolute difference map between the normalized current disparity map and the normalized monocular depth map is calculated. Finally, this difference map and the frozen monocular depth features are input into the structural encoder, and after convolution processing, structural cue features are obtained. The structural cue features are superimposed on the current hidden state in residual form, thereby guiding the network to focus on regions with inconsistent geometric structures at the feature level.

[0077] Motion cue construction steps. To introduce motion cues for stereo matching, this invention constructs motion cues. Based on the disparity map of the current iteration, indexing is performed in the cost volume pyramid to extract corresponding local cost volume features. These local cost volume features and the current disparity map are input into the motion encoder to generate motion cue features. These motion cue features are also superimposed on the hidden state as residuals, enabling the network to adaptively integrate stereo matching cost information.

[0078] The state update and disparity regression steps employ a simplified update strategy in the cue recursive unit. In each iteration, based on the current hidden state, and combining the structural and motion cue features, an update gate and candidate hidden states are calculated via convolutional layers. This embodiment removes the reset gate from the traditional gated recursive unit, retaining only the update gate to reduce computational complexity. Using the update gate, the hidden state and candidate hidden states from the previous time step are weighted and fused to obtain the updated hidden state. Finally, the updated hidden state is input into a convolutional layer to predict the residual value of the current disparity, and this residual value is accumulated onto the current disparity map to obtain the updated disparity map. This process is iterated a preset number of times, ultimately outputting a high-precision disparity map.

[0079] like Figure 6 The diagram shown is a schematic of a generalized stereo matching system based on structure and motion cues in a practical application scenario. The left image can be regarded as the first image, the right image can be regarded as the second image, feature extraction can be regarded as the feature extraction module, cost aggregation can be regarded as the cost volume construction module, affine invariant fusion can be regarded as the affine invariant fusion module, and cue loop unit can be regarded as the cue loop unit module.

[0080] Example 3: like Figure 7 As shown, this embodiment provides an electronic device for executing method steps of a generalized stereo matching method based on structure and motion cues, including one or more processors 41, multiple memories 42, and a communication interface.

[0081] Processor 41 and memory 42 can be connected via a bus or other means. Figure 7 Taking a bus connection as an example, the processor, memory, and communication interface communicate with each other.

[0082] The memory 42, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs and non-volatile computer-executable programs, such as the efficient stereo matching method for high parallax scenarios in the above embodiments. The processor 41 executes the generalized stereo matching method based on structure and motion cues by running the non-volatile software programs and instructions stored in the memory 42.

[0083] Memory 42 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 42 may optionally include memory remotely located relative to processor 41, which can be connected to processor 41 via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0084] The program instructions / modules are stored in the memory 42. When executed by one or more processors 41, they perform the generalized stereo matching method based on structure and motion cues in the above embodiments, for example, the method described above. Figures 1-5 The steps shown.

[0085] This invention also provides a non-transitory computer-readable storage medium storing computer program instructions; when executed by a processor, these computer program instructions implement the generalized stereo matching method based on structure and motion cues provided in this invention.

[0086] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A generalized stereo matching method based on structure and motion cues, characterized in that, Includes the following steps: Step 1: Obtain binocular disparity information based on stereo matching and depth information based on monocular depth model. After performing scale-independent normalization on the binocular disparity information and monocular depth information, align image features based on binocular disparity information to generate confidence information. Then, fuse the binocular disparity information and monocular depth information according to the confidence information to obtain an initial fused disparity map. Step 2: Align the features of the binocular images based on the initial fused disparity map; Step 3: Initialize the hidden state during the iteration process. During the iteration process, construct structural cues based on the geometric consistency information between the current disparity and monocular depth, and construct motion cues by combining the cost information obtained during stereo matching. Step four: Under the combined guidance of structural and motion cues, the hidden state is recursively updated, and the parallax is progressively corrected. Step 5: Repeat the iterative process of Steps 3 and 4 until the preset conditions are met, and output the final disparity map.

2. The generalized stereo matching method based on structure and motion cues according to claim 1, characterized in that, Step one specifically includes the following steps: (11) Obtain the initial binocular disparity map based on the cost volume and the relative depth map obtained from the monocular depth model, and calculate their affine normalization parameters respectively, and perform scale-independent normalization on the two. (12) Based on the normalized initial binocular disparity, the relative depth projection is aligned to the disparity space, and the initial disparity is used to align the second image features and generate a confidence map with the first image features. (13) Based on the confidence map, the initial binocular disparity and the aligned relative depth are weighted and fused, and the initial fused disparity map is obtained by mapping back to the disparity space through inverse normalization. Preferably, the generation of the initial fused disparity map specifically includes: The median and scale factor of the initial binocular disparity map from the matching cost volume and the monocular relative depth map from the monocular depth model are calculated and then affine invariant normalization is performed. Using the median and scale factor of the initial binocular disparity map, the normalized monocular relative depth map is linearly mapped back to the binocular disparity space; Preferably, the median and scale factor of the initial binocular disparity map and monocular relative depth map satisfy the following relationship: , , , In the formula, The initial disparity or relative depth to be normalized. for the median of for The scale factor, This is the result after normalization; Preferably, the linear mapping of the normalized monocular relative depth map back to the binocular disparity space satisfies: , in, The initial binocular parallax obtained through the cost volume, The scaling factor for the initial disparity. The median of the initial disparity. The normalized relative depth. This represents the relative depth after alignment.

3. The generalized stereo matching method based on structure and motion cues according to claim 1, characterized in that, In step three, the hidden states during the initialization iteration process include: (31) Obtain the image features of the first and second images at multiple resolution levels; (32) Based on the initial fusion parallax, the second image features are back-projected and distorted to align them with the first image features in spatial position; (33) The aligned first and second image features are spliced ​​together and input into the convolution block for processing to generate the initial hidden state at each resolution level; Preferably, the feature map of the second image is back-projected and warped based on the initial fused disparity map; the back-projected feature map of the second image is then concatenated with the feature map of the first image, and a multi-resolution initial hidden state is generated through convolution processing, as shown in the following formula: , in, This is the initial hidden state generated. Features of the first image To utilize initial parallax The distorted second image features It is a convolutional block.

4. The generalized stereo matching method based on structure and motion cues according to claim 1, characterized in that, In step three, the fusion of binocular disparity information and monocular depth information based on the confidence information includes: Based on the initial binocular disparity map, the feature map of the second image is projected to obtain the recovered first feature map; the recovered first feature map and the first feature map are concatenated, and a confidence map is generated through convolution and activation functions; the initial binocular disparity map and the mapped monocular relative depth map are fused pixel by pixel according to the confidence map, satisfying the following relationship: , in, To fuse parallax, For confidence plots, For element-wise multiplication, As the initial parallax, This represents the relative depth after alignment.

5. The generalized stereo matching method based on structure and motion cues according to claim 1, characterized in that, The structural cueing based on the geometric consistency information of the current disparity and monocular depth includes: The disparity map obtained in the current iteration is subjected to affine invariant normalization, and the difference features are calculated between it and the normalized monocular relative depth map, satisfying the following specific formula: , in, This represents the difference between the normalized current disparity and the normalized relative depth. This represents the normalized disparity during the current iteration. The normalized relative depth is used to generate structural cue features by jointly encoding the difference features with the frozen monocular depth features. The structural cue features are then superimposed onto the hidden states during the iteration process in the form of residuals to guide the update of disparity in regions of geometric inconsistency. The specific formula is as follows: , , in, Provide structural hints and features. For a structure encoder with the same architecture as existing methods, For monocular global depth features, It is in a hidden state in the middle. It is a convolutional block.

6. The generalized stereo matching method based on structure and motion cues according to claim 1, characterized in that, The motion cues constructed by combining the cost information obtained during stereo matching include: Based on the disparity map of the current iteration, the corresponding local cost volume features are indexed in the cost volume; the local cost volume features and the current disparity map are input into the motion encoder to generate motion cue features; and the motion cue features are superimposed onto the hidden state in the form of residuals, with the specific formula as follows: , , in, For motion cue features, For motion encoders, This is the local cost volume in the current iteration process. For the current parallax, It is in a hidden state in the middle. It is a convolutional block.

7. The generalized stereo matching method based on structure and motion cues according to claim 1, characterized in that, In step five, the iterative update of parallax includes: During the update process, the calculation formula for the update gate varies depending on the resolution level, including: , in, To update the door, The Sigmoid activation function is used. This is the hidden state of the current level. This is the hidden state of the previous level. Provide structural hints and features. Motion cue features For updating the hidden state, the hidden state of the higher-resolution layer is first processed by residual blocks and added to the hidden state of the current layer. Then, it undergoes further residual block processing. For the initial resolution layer, structural cue features and motion cue features are additionally processed by convolutional blocks and added to the hidden state sequentially. The hidden state is then subjected to a 1×1 convolution to adjust the feature dimensions. Finally, combined with the update gate, the processed hidden state and the original hidden state are weighted and fused to obtain the updated hidden state. The specific formula is as follows: , , , , in, This is the initial hidden state after adjustments. This is the hidden state of the current level. For residual convolution blocks, For a higher resolution level of hidden state; This is a 1×1 convolution operation; This is the updated hidden state. This is an element-wise multiplication operation; The updated hidden state is then input into the convolutional block for processing to obtain the disparity update. This disparity update is then added to the current disparity to obtain the updated disparity. The specific formula is as follows: , in, For the updated parallax, Current parallax; Determine whether the current iteration number is greater than the preset iteration number; When the current iteration number is greater than or equal to the preset iteration number, the iteration stops, and the disparity update amount is added to the input disparity map to obtain the final disparity map based on the obtained disparity update amount. When the current iteration number is less than or equal to the preset iteration number, the disparity update amount is added to the input disparity map and used as a new input disparity map in the next iteration.

8. A generalized stereo matching system based on structural and motion cues, characterized in that, This includes an affine invariant fusion module and a prompting loop unit module, wherein, The affine invariant fusion module is used to acquire binocular disparity information obtained based on stereo matching and depth information obtained based on a monocular depth model. After performing scale-independent normalization on the binocular disparity information and monocular depth information, the image features are aligned based on the binocular disparity information to generate confidence information. The binocular disparity information and monocular depth information are then fused according to the confidence information to obtain an initial fused disparity map. The cue loop unit module is used to align the features of the binocular image based on the initial fused disparity map, and then initialize the hidden state during the iteration process. During the iteration process, structural cues are constructed based on the geometric consistency information between the current disparity and the monocular depth, and motion cues are constructed by combining the cost information obtained during stereo matching. Under the joint guidance of structural cues and motion cues, the hidden state is recursively updated, and the disparity is gradually corrected and iterated until the preset conditions are met, and the final disparity map is output. Preferably, the system further includes a feature extraction module and a cost body construction module, wherein the feature extraction module, the cost body construction module, the affine invariant fusion module, and the prompting loop unit module are sequentially connected, wherein... The feature extraction module is used to input the first image and the second image into the feature extraction network, the feature extraction network extracts features from the first image and the second image, and sends the extracted features to the cost body construction module; The cost body construction module is used to calculate the group-related cost body based on the feature maps of the first image and the second image. Then, a lightweight 3D convolutional network is used to filter the cost body to obtain the geometric code body. Finally, a full-pair cost body is calculated based on the global feature similarity, and the full-pair cost body is pooled to construct a multi-scale cost body pyramid, thereby providing multi-scale matching relationships.

9. An electronic device, characterized in that, include: At least one processor, at least one memory, and a communication interface; wherein, The processor, memory, and communication interface communicate with each other; The memory stores program instructions that can be executed by the processor, which invokes the program instructions to execute a generalized stereo matching method based on structure and motion cues as described in any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium, characterized in that, The non-transitory computer-readable storage medium stores computer instructions that cause the computer to execute a generalized stereo matching method based on structure and motion cues as described in any one of claims 1 to 7.