Systems and methods for cost-volume attention based disparity estimation

By using a cost-volume attention-based deep learning system, this paper addresses the disparity estimation problem in complex real-world scenes using a single model. This solves the problem of poor accuracy of existing methods in different environments and achieves robust disparity estimation for both indoor and outdoor scenes.

CN114078113BActive Publication Date: 2026-06-23SAMSUNG ELECTRONICS CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SAMSUNG ELECTRONICS CO LTD
Filing Date
2021-06-30
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing disparity estimation methods cannot effectively handle various complex real-world scenes, especially indoor and street scenes, resulting in poor accuracy and the need for multiple models to handle different environments.

Method used

A deep learning system based on cost-volume attention is adopted. Through feature map extraction, cost volume calculation, attention-perception cost volume generation and aggregation, a single model is used to estimate real-world parallax. The system includes a feature map extraction module, a cost volume calculation module, a cost volume attention module and a cost aggregation module. Multi-branch and single-branch attention mechanisms are used to handle different scenarios.

Benefits of technology

Robust disparity estimation for various scenarios is achieved under a single model, improving the accuracy and efficiency of disparity estimation, and is applicable to both indoor and outdoor scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114078113B_ABST
    Figure CN114078113B_ABST
Patent Text Reader

Abstract

Systems and methods for cost-volume attention-based disparity estimation are provided. One method includes extracting a first feature map from a left image taken by a first camera; extracting a second feature map from a right image taken by a second camera; computing matching costs based on a comparison of the first feature map and the second feature map to generate a cost volume; generating an attention-aware cost volume from the generated cost volume; and aggregating the attention-aware cost volume to generate an output disparity.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure generally relates to the estimation of real-world parallax of elements in a scene captured by two cameras, and more specifically, to deep learning systems and methods for robust parallax estimation based on cost-volume attention. Background Technology

[0002] Deep learning is currently leading many performance breakthroughs in various computer vision tasks. The state-of-the-art performance of deep learning comes from over-parameterized deep neural networks, which, when trained on very large datasets, are able to automatically extract useful representations (features) of the data for the target task.

[0003] There is also interest in estimating the real-world depth of elements in a captured scene, which has numerous applications (such as the ability to separate foreground (near) objects from background (far) objects within a captured scene). Accurate parallax estimation allows for the separation of foreground objects of interest from the background in a scene. Furthermore, accurate foreground-background separation allows for the processing of captured images to simulate effects such as bokeh. Bokeh is a soft out-of-focus blur of the background, which is effective when using the correct settings on an expensive camera with a fast lens and a wide aperture, and when the camera is moved closer to the subject and the subject further away from the background to simulate a shallow depth of field. Therefore, accurate parallax estimation allows for the processing of images from non-professional photographers or cameras with smaller lenses (such as mobile phone cameras) to obtain more aesthetically pleasing images with a bokeh effect applied to the captured subject. Other applications of accurate parallax estimation include 3D object reconstruction and virtual reality applications, where it is desirable to alter the background or subject and render them according to the desired virtual reality.

[0004] However, real-world scenarios are highly complex and consist of various modalities (such as indoor and street driving). Therefore, existing disparity estimation methods do not perform well because they are optimized only for a limited number of scenarios. Furthermore, they require multiple models to handle disparity estimation in different real-world environments. Summary of the Invention

[0005] This disclosure is provided to at least address the aforementioned problems and / or disadvantages and to provide at least the following advantages.

[0006] One aspect of this disclosure is to provide a system and method for estimating real-world parallax of elements in a scene captured by two cameras using a single model that works well for scenes with various modes.

[0007] Another aspect of this disclosure is to provide a deep learning system and method for robust disparity estimation based on cost-volume attention.

[0008] Another aspect of this disclosure is to provide a system and method for cost-quantity attention-based disparity estimation, which can handle real-world disparity estimation problems using a single model.

[0009] According to one embodiment, a method is provided, the method comprising: extracting a first feature map from a left image captured by a first camera; extracting a second feature map from a right image captured by a second camera; calculating a matching cost based on a comparison of the first and second feature maps to generate a cost volume; generating an attention-perception cost volume from the generated cost volume; and aggregating the attention-perception cost volume to generate an output disparity.

[0010] According to one embodiment, a system is provided, the system including a memory and a processor, the processor being configured to extract a first feature map from a left image captured by a first camera, extract a second feature map from a right image captured by a second camera, compute a matching cost based on a comparison of the first and second feature maps to generate a cost volume, generate an attention-perception cost volume from the generated cost volume, and aggregate the attention-perception cost volume to generate an output disparity.

[0011] According to one embodiment, a system is provided, the system comprising: a feature map extraction module configured to extract a first feature map from a left image captured by a first camera and extract a second feature map from a right image captured by a second camera; a cost volume calculation module configured to calculate a matching cost based on a comparison of the first and second feature maps to generate a cost volume; a cost volume attention module configured to generate an attention-perceptual cost volume from the generated cost volume; and a cost aggregation module configured to aggregate the attention-perceptual cost volume to generate an output disparity. Attached Figure Description

[0012] The above and other aspects, features, and advantages of certain embodiments of this disclosure will become clearer from the following detailed description taken in conjunction with the accompanying drawings, in which:

[0013] Figure 1 A deep learning system for robust disparity estimation based on cost-volume attention, according to an embodiment, is shown;

[0014] Figure 2 The process of generating the final output disparity by a deep learning system according to an embodiment is illustrated;

[0015] Figure 3 The processing of cost volume per channel parallax attention (CVA-CWDA) according to an embodiment is shown;

[0016] Figure 4 The detailed processing of attention blocks in CVA-CWDA according to an embodiment is shown;

[0017] Figure 5 The processing of per-parallax channel attention (CVA-DWCA) for the cost volume is illustrated according to an embodiment;

[0018] Figure 6 The detailed processing of attention blocks in CVA-DWCA according to an embodiment is shown;

[0019] Figure 7 The processing of single-branch disparity attention (CVA-SBDA) with respect to the cost volume is illustrated according to an embodiment;

[0020] Figure 8 The processing of single-branch channel attention (CVA-SBCA) with respect to the cost body according to an embodiment is shown;

[0021] Figure 9 The processing of parallax-channel attention (CVA-SBCDCA) with respect to single-branch combinations of cost bodies according to an embodiment is shown;

[0022] Figure 10 The processing of single-branch spatial attention (CVA-SBSA) with respect to the cost volume according to an embodiment is shown;

[0023] Figure 11 The processing of dual cost body attention, utilizing sequential and parallel arrangements, is illustrated according to an embodiment;

[0024] Figure 12 A graph illustrating the effectiveness of the demonstration cost-volume attention module according to an embodiment is shown; and

[0025] Figure 13 An electronic device in a network environment according to an embodiment is shown. Detailed Implementation

[0026] In the following, embodiments of the present disclosure are described in detail with reference to the accompanying drawings. It should be noted that although the same elements are shown in different drawings, the same elements will be designated by the same reference numerals. In the following description, only specific details such as detailed configurations and components are provided to aid in a comprehensive understanding of the embodiments of the present disclosure. Therefore, it will be apparent to those skilled in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope of the present disclosure. Furthermore, for clarity and conciseness, descriptions of well-known functions and structures are omitted. The terminology described below is defined in consideration of the functions in this disclosure and may vary depending on the user, the user's intent, or habit. Therefore, the definitions of the terms should be determined based on the content throughout this specification.

[0027] This disclosure can have various modifications and various embodiments, of which the following describes in detail with reference to the accompanying drawings. However, it should be understood that this disclosure is not limited to the embodiments, but includes all modifications, equivalents, and substitutions within the scope of this disclosure.

[0028] Although various elements may be described using ordinal terms including first, second, etc., structural elements are not limited by these terms. These terms are used only to distinguish one element from another. For example, a first structural element may be referred to as a second structural element without departing from the scope of this disclosure. Similarly, a second structural element may also be referred to as a first structural element. As used herein, the term "and / or" includes any and all combinations of one or more related items.

[0029] The terminology used herein is for describing various embodiments of this disclosure only and is not intended to limit the disclosure. Unless the context clearly indicates otherwise, the singular form is intended to include the plural form. In this disclosure, it should be understood that the terms “comprising” or “having” indicate the presence of features, numbers, steps, operations, structural elements, components, or combinations thereof, and do not preclude the possibility of the presence or addition of one or more other features, numbers, steps, operations, structural elements, components, or combinations thereof.

[0030] Unless otherwise defined, all terms used herein have the same meaning as understood by one of ordinary skill in the art to which this disclosure pertains. Terms (such as those defined in commonly used dictionaries) should be interpreted as having the same meaning as in the context of the relevant field, and should not be interpreted as having an ideal or overly formal meaning unless explicitly defined in this disclosure.

[0031] The electronic device according to one embodiment can be one of various types of electronic devices. The electronic device may include, for example, a portable communication device (e.g., a smartphone), a computer, a portable multimedia device, a portable medical device, a camera, a wearable device, or a home appliance. According to one embodiment of this disclosure, the electronic device is not limited to the electronic devices described above.

[0032] The terminology used in this disclosure is not intended to limit the disclosure, but rather to include various modifications, equivalents, or substitutions of the corresponding embodiments. Regarding the description of the drawings, similar reference numerals may be used to refer to similar or related elements. Unless the relevant context clearly indicates otherwise, the singular form of a noun corresponding to an item may include one or more things. As used herein, each of phrases such as “A or B,” “at least one of A and B,” “at least one of A or B,” “A, B, or C,” “at least one of A, B, and C,” and “at least one of A, B, or C” may include all possible combinations of items listed together in the corresponding phrase. As used herein, terms such as “first (1)” may be used. st The terms “second,” “first,” and “second” distinguish the corresponding component from another component, but are not intended to limit the component in other respects (e.g., importance or order). The intention is that if an element (e.g., the first element) is referred to as being “combined” with another element (e.g., the second element) (with or without the terms “operably” or “communically”), it means that the element can be combined with the other element directly (e.g., wired), wirelessly, or via a third element.

[0033] As used herein, the term "module" can include units implemented in hardware, software, or firmware, and is used interchangeably with other terms such as "logic," "logic block," "component," and "circuit." A module can be a single integrated component or its smallest unit or portion adapted to perform one or more functions. For example, according to one embodiment, a module can be implemented as an application-specific integrated circuit (ASIC).

[0034] Traditional disparity estimation methods focus only on estimating disparities from specific domains, such as those specific to indoor scenes or street scenes. Therefore, the accuracy can be very poor when traditional methods are tested on different scenarios.

[0035] Figure 1 A deep learning system for robust disparity estimation based on cost-volume attention, according to an embodiment, is shown.

[0036] refer to Figure 1The deep learning system includes a feature map extraction module 101, a cost volume calculation module 102, a cost volume attention (CVA) module 103, a cost aggregation module 104, and a disparity fusion module 105. The feature map extraction module 101 extracts feature maps from the left and right images. The cost volume calculation module 102 calculates the matching cost based on the left / right feature maps. The CVA module 103 adjusts (emphasizes / suppresses) parts of the cost volume based on attention, providing different weights for different disparities within the cost volume. The cost aggregation module 104 aggregates the attention-aware cost volumes to output the disparity. The disparity fusion module 105 fuses two aggregated disparities (e.g., trained on different disparity ranges) to provide the final output disparity.

[0037] although Figure 1 Each module is shown as a separate element, but modules can be included in a single element such as a processor or ASIC.

[0038] Figure 2 The process of generating the final output disparity by a deep learning system according to an embodiment is illustrated. Specifically, Figure 2 This paper illustrates the processing of a deep learning system based on CVA, which works well for various scenarios using a single model. In this paper, the system may be referred to as CVANet. For example, it can be derived from... Figure 1 The deep learning system shown performs Figure 2 The processing shown.

[0039] refer to Figure 2 A disparity fusion scheme is provided based on two networks trained on different disparity ranges. The first network is optimized on a partial disparity range [0, a], and the second network is optimized on the full disparity range [0, b], where b > a. For the two disparity estimation networks with different disparity ranges, the feature map extraction module, cost volume calculation module, CVA module, and cost aggregation module can be the same.

[0040] In both networks, the feature map extraction module extracts feature maps from the left and right images. Subsequently, the cost volume calculation module calculates the matching cost between the left and right feature maps. The output is a cost volume representing the matching cost between the left and right feature maps at each disparity level. Ideally, the matching cost at the true disparity level will be 0.

[0041] The CVA module modifies the cost volume using attention techniques, assigning different weights to different disparities within the cost volume. For different scenes, the attention module focuses on different parts of the cost volume. For example, in outdoor scenes, the attention module can give more weight to smaller disparities (because outdoor objects are farther away), but in indoor scenes, it can give more weight to larger disparities. The CVA module can refine the matching cost volume in a multi-branch or single-branch manner.

[0042] The cost aggregation module aggregates the attention-perception cost volume to output a disparity map from each network. Subsequently, the disparity fusion module (based on different disparity ranges) fuses the aggregated disparities from each network to provide the final estimated disparity.

[0043] Feature extraction can be achieved using traditional feature extraction backbones (such as ResNet or stacked hourglass networks). The input to feature extraction is a left and a right image, each with a size of H×W, where H is the height and W is the width, and the output is the corresponding feature maps C×W×H for the left and right images, respectively, where C is the number of channels.

[0044] The cost volume can also be implemented using existing cost volumes (such as standard cost volumes based on feature map correlation, or extended cost volumes that integrate multiple cost volumes). The output of the cost volume can be a four-dimensional (4D) feature map C×D×H×W, where C is the number of channels, D is the disparity level, H is the height, and W is the width.

[0045] Regarding CVA, because the cost body is a 4D feature map, traditional attention algorithms based on 3D feature maps cannot be directly applied. Therefore, various embodiments for performing attention on the cost body are provided here.

[0046] CVA based on multi-branch attention

[0047] The concept of multi-branch CVA is to combine 4D feature maps CV∈R (C×D×H×W) The data is divided into several 3D feature maps, and then an attention mechanism is applied to each 3D feature map. Below, two different methods are described for the 4D to 3D partitioning: (a) partitioning along the channel dimension of the cost volume, i.e., channel-wise disparity attention (CVA-CWDA) with respect to the cost volume, and (b) partitioning along the disparity dimension of the cost volume, i.e., disparity-wise channel attention (CVA-DWCA) with respect to the cost volume.

[0048] Figure 3The processing of CVA-CWDA according to an embodiment is illustrated. For example, it can be performed by... Figure 1 CVA module 103 execution Figure 3 The processing.

[0049] exist Figure 3 In the diagram, M represents a 3D map of each of the cost volumes from channel 1 to C. Y is the attention-based cost volume corresponding to the output of M. The attention map is D×D, which allows for displaying different attention levels with respect to disparity across different datasets.

[0050] refer to Figure 3 The 4D feature map is divided into C 3D feature maps, each with a size of D×H×W (labeled M). Specifically, the 4D feature map CV∈R is divided along the channel dimensions of the CV. (C×D×H×W) The 3D feature map CV = {M1,…,M} is obtained. C},M i ∈R (D×H×W) ,1≤i≤C.

[0051] Subsequently, channel attention is applied to attention blocks A1 to A. C For each of the C feature maps at a given location, obtain the attention-perception feature map Y. i ∈R (D×H×W) It can be applied to each Y. i Attention is calculated along the disparity dimension, resulting in a D×D attention matrix.

[0052] Then the attention-perception feature maps are concatenated back into a 4D feature map CV' = {Y1, Y2, ..., Y} C This is the output of the CVA module.

[0053] Figure 4 The detailed processing of attention blocks in CVA-CWDA according to an embodiment is shown.

[0054] refer to Figure 4 The attention block reconstructs the D×H×W graph M into the reconstructed (WH)×D graph M. r The reconstructed and transposed D×(WH) diagram M r T Then M r and M r T Multiply and use softmax to obtain the D×D attention map (i.e., the attention matrix X∈R). (D×D) Then, the D×D attention map and M r Multiply the results to reconstruct a 4D model, then add it to M to output a D×H×W attention-perceptual feature map Y.

[0055] Figure 5 The processing of CVA-DWCA according to an embodiment is illustrated. For example, it can be performed by... Figure 1 CVA module 103 execution Figure 5 The processing.

[0056] exist Figure 5 In the diagram, N represents a 3D map of each of the disparity levels 1 to D of the cost volume. Y is the attention-based cost volume corresponding to the output of N. The attention map is C×C, which allows for different attention to disparity across different channels of the cost volume.

[0057] refer to Figure 5 The 4D feature map is divided into D 3D feature maps, each with a size of C×H×W (labeled N). Specifically, the 4D feature map CV∈R is divided along the disparity dimension of the CV. (C×D×H×W) The 3D feature map CV = {N1, ..., N} is obtained. D},N i ∈R (C×H×W) ,1≤i≤D.

[0058] Subsequently, parallax attention is applied to attention blocks N1 to N. D For each of the D feature maps at a given location, obtain the attention-perception feature map Y. i ∈R (C×H×W) It can be applied to each Y. i Attention is calculated along the channel dimension, resulting in a C×C attention matrix.

[0059] Then the attention-perception feature maps are concatenated back into a 4D feature map CV' = {Y1, Y2, ..., Y} D This is the output of the CVA module.

[0060] Figure 6 The detailed processing of attention blocks in CVA-DWCA according to an embodiment is shown.

[0061] refer to Figure 6 The attention block reconstructs the C×H×W graph N into the reconstructed (WH)×C graph N. r And the reconstructed and transposed C×(WH) diagram N r T Then N r and N r T Multiply and use maximum flexibility to obtain a C×C attention map (i.e., attention matrix X∈R). (C×C) Then, the C×C attention map is multiplied by N. r The data is reconstructed into 4D and then added to N to output a C×H×W attention-perceptual feature map Y.

[0062] In the above embodiments, the CVA-CWDA and CVA-DWCA modules capture different information. More specifically, CVA-CWDA attempts to find the correlation between different parallax levels. For example, if the input image is a close-up indoor scene, CVA-CWDA can emphasize the cost volume with a large parallax level. However, if the input image is an outdoor scene, CVA-CWDA can emphasize the cost volume with a small parallax level.

[0063] CVA-DWCA focuses on the correlations between different channels of the cost volume, which can be useful when the cost volume (such as the extended cost volume in AMNet) consists of multiple types of information. When the cost volume consists of feature map correlations and dissimilarity, CVA-DWCA can modify which information used in the cost volume is better for a particular image.

[0064] CVA based on single-branch attention

[0065] The concept of single-branch CVA operates directly on the 4D cost volume. Before computing the attention matrix, the high-dimensional feature map is "flattened" into a low-dimensional feature map. This is achieved through a one-shot attention module, where the input cost volume is flattened into a 2D feature map.

[0066] Below are four different methods for flattening high-dimensional feature maps for attention computation: (a) CVA-SBDA, (b) CVA-SBCA, (c) CVA-SBCDCA, and (d) CVA-SBSA.

[0067] Figure 7 The processing of CVA-SBDA according to an embodiment is shown.

[0068] refer to Figure 7 The input to CVA-SBDA is a 4D feature map CV∈R. (C×D×H×W) The CV is reconstructed as a 2D (WHC) × (D) graph CV. r ∈R ((WHC)×D) And it is reconstructed and transposed into a 2D (D)×(WHC) graph CV r T ∈R (D×(WHC)) CV r With CV r T Multiply and use maximum flexibility to obtain the attention matrix X∈R (D×D) D×D attention matrix X and CV r Multiply, reconstruct to 4D, then add with CV to output the attention-perceptual cost volume CV'∈R. (C×D×H×W) .

[0069] Figure 8The processing of single-branch channel attention (CVA-SBCA) with respect to the cost body according to an embodiment is shown.

[0070] refer to Figure 8 The input to CVA-SBCA is a 4D feature map CV∈R. (C×D×H×W) The CV is reconstructed as a 2D (DWH)×(C) graph CV. r ∈R ((DWH)×C) And it is reconstructed and transposed into a 2D(C)×(DWH) graph CV r T ∈R (C×(DWH)) CV r With CV r T Multiply and use maximum flexibility to obtain the attention matrix X∈R (C×C) C×C attention matrix X and CV r Multiply, reconstruct to 4D, then add with CV to output the attention-perceptual cost volume CV'∈R. (C×D×H×W) .

[0071] Figure 9 The processing of parallax-channel attention (CVA-SBCDCA) with respect to single-branch combinations of cost bodies according to an embodiment is shown.

[0072] refer to Figure 9 The input to CVA-SBCDCA is a 4D feature map CV∈R. (C×D×H×W) The CV is reconstructed as a 2D (WH)×(CD) graph CV. r ∈R ((WH)×(CD)) It is then reconstructed and transposed into a 2D (CD)×(WH) graph CV. r T ∈R ((CD)×(WH)) CV r With CV r T Multiply and use maximum flexibility to obtain the attention matrix X∈R ((CD)×(CD)) CD×CD attention matrix X and CV r Multiply, reconstruct to 4D, then add with CV to output the attention-perceptual cost volume CV'∈R. (C×D×H×W) .

[0073] Figure 10 The processing of single-branch spatial attention (CVA-SBSA) with respect to cost volume according to an embodiment is shown.

[0074] refer to Figure 10 The input to CVA-SBSA is a 4D feature map CV∈R. (C×D×H×W) The CV is reconstructed as a 2D (CD) × (WH) graph CV.r ∈R ((CD)×(WH)) It is then reconstructed and transposed into a 2D (WH)×(CD) graph CV. r T ∈R ((WH)×(CD)) CV r With CV r T Multiply and use maximum flexibility to obtain the attention matrix X∈R ((WH)×(WH)) WH×WH attention matrix and CV r Multiply, reconstruct to 4D, then add with CV to output the attention-perceptual cost volume CV'∈R. (C×D×H×W) .

[0075] When comparing the above embodiments, CVA-SBDA and CVA-SBCA have the same size attention matrix as CVA-CWDA and CVA-DWCA, but their attention matrices are calculated from all channels of the cost body, rather than from multi-branch CVA, where the attention matrix is ​​calculated for each channel. Since the size of the attention matrix remains unchanged, their computational costs are similar.

[0076] CVA-SBCDCA has an attention matrix of size CD×CD, which is a combined attention between the disparity level and the channels, but results in a much higher computational cost.

[0077] CVA-SBSA has an attention matrix of size WH×WH, which is a spatial attention mechanism and also has high computational cost.

[0078] Dual-cost body attention

[0079] The concept of dual cost body attention can utilize any two of the CVA modules mentioned above. Since dual attention is constructed by using two CVA modules together, it can be utilized in a sequential or parallel arrangement.

[0080] Figure 11 The processing of dual-cost volume attention, utilizing sequential and parallel arrangements, is illustrated according to an embodiment.

[0081] refer to Figure 11 In the sequential permutation process (a), two CVA modules are used serially, and in the parallel permutation process (b), two CVA modules are used in parallel, and their results are combined to provide the final cost volume estimate. Because different attention matrices capture different information, therefore, as... Figure 11 As shown, dual cost body attention can be utilized by organizing CVAs in a sequential or parallel manner.

[0082] Cost aggregation

[0083] The cost aggregation module outputs a disparity map from the input attention-perceptual cost volume. It can be implemented using any existing cost aggregation module (such as...). Figure 1 The semi-global cost aggregation in the Guided Aggregation Net (GANet) shown in components 101, 102, and 104, or as... Figure 1 The components 101, 102, and 104 are shown as stacked atrous multi-scale (AM).

[0084] Parallax fusion

[0085] To further improve accuracy and robustness, two networks can be trained on different disparity ranges. Both networks can use the same feature extraction / cost volume / cost attention / cost aggregation, but with different maximum disparity ranges.

[0086] For example, two networks (CVANet) can be based on two commonly used backbones (AMNet and GANet).

[0087] AMNet uses a depthwise separable version of ResNet-50 as its feature extractor. Following this depthwise separable version are AM modules that capture global depth context information at multiple scales. An Extended Cost Volume (ECV) can be used for cost aggregation, simultaneously computing different cost matching metrics. The output of the ECV can be processed by stacked AM modules to output the final disparity.

[0088] GANet implements a feature extractor using an hourglass network and uses feature map correlation as the cost volume. GANet incorporates a semi-global guided aggregation (SGA) layer, which achieves a differentiable approximation of semi-global matching and aggregates matching costs along different directions across the entire image. This allows for accurate estimation of occluded and reflected regions.

[0089] More specifically, the first CVANet is trained on the disparity range [0, a] and outputs a first disparity map. Among them, P 1,i It is the probability that a pixel has an estimated disparity equal to i when i < a, and P 1,i It is the probability that a pixel has an estimated disparity greater than or equal to a when i = a.

[0090] A second CVANet is trained on the full disparity range [0, b], where a < b, and the second CVANet outputs a second disparity map. Among them, P 2,i It is the probability that a pixel has an estimated disparity equal to i when i < b, and P 2,i It is the probability that a pixel has an estimated disparity greater than or equal to b when i = b.

[0091] You can directly use a combination of disparities based on D1 and D2, or utilize the probability vector P. 1,i P 2,i Soft combinations (or probabilistic combinations) are used to fuse D1 and D2.

[0092] When parallax is combined, the final output parallax D can be... fused The result is a simple weighted sum as follows:

[0093]

[0094]

[0095] w1 and w2 are constants located between [0,1]. These are set based on the verification results.

[0096] When soft combining occurs, fusion takes place on the probability vectors as follows, where w1, w2, and w3 are constants between [0,1]:

[0097]

[0098] P should be further fused,i Normalization

[0099] The final disparity output based on soft combination can be expressed by the following formula:

[0100]

[0101] Using a single model, the above process can generate reasonable parallax outputs for both indoor and outdoor scenes.

[0102] Table 1 below provides a comparison of the accuracy and efficiency (AE) of CVANet with different attention modules, showing that multi-branch attention modules generally have better accuracy / efficiency than single-branch attention modules.

[0103] Table 1

[0104]

[0105] The attention graph also shows that the aforementioned cost-volume attention module works well for images with different scenes.

[0106] Figure 12 A graph illustrating the effectiveness of the demonstration cost-volume attention module according to an embodiment is shown.

[0107] refer to Figure 12 To demonstrate the effectiveness of the above techniques, the graphs (a) to (c) in the top row show the column summation of the values ​​of the attention matrix (D×D), which provides a pattern consistent with the disparity distribution in the graphs (d) to (f) in the bottom row.

[0108] Figure 13 A block diagram of an electronic device in a network environment according to one embodiment is shown.

[0109] refer to Figure 13 In network environment 1300, electronic device 1301 can communicate with electronic device 1302 via a first network 1398 (e.g., a short-range wireless communication network), or with electronic device 1304 or server 1308 via a second network 1399 (e.g., a long-range wireless communication network). Electronic device 1301 can communicate with electronic device 1304 via server 1308. Electronic device 1301 may include processor 1320, memory 1330, input device 1350, sound output device 1355, display device 1360, audio module 1370, sensor module 1376, interface 1377, haptic module 1379, camera module 1380, power management module 1388, battery 1389, communication module 1390, subscriber identification module (SIM) 1396, or antenna module 1397. In one embodiment, at least one of the components (e.g., display device 1360 or camera module 1380) may be omitted from electronic device 1301, or one or more other components may be added to electronic device 1301. In one embodiment, some components may be implemented as a single integrated circuit (IC). For example, sensor module 1376 (e.g., fingerprint sensor, iris sensor, or illuminance sensor) may be embedded in display device 1360 (e.g., display).

[0110] Processor 1320 can execute, for example, software (e.g., program 1340) to control at least one other component (e.g., hardware or software component) of electronic device 1301 associated with processor 1320, and can perform various data processing or calculations. As at least part of the data processing or calculations, processor 1320 can load commands or data received from another component (e.g., sensor module 1376 or communication module 1390) into volatile memory 1332, process the commands or data stored in volatile memory 1332, and store the resulting data in non-volatile memory 1334. Processor 1320 may include a main processor 1321 (e.g., central processing unit (CPU) or application processor (AP)) and auxiliary processors 1323 (e.g., graphics processing unit (GPU), image signal processor (ISP), sensor hub processor, or communication processor (CP)) that may operate independently of or with the main processor 1321. Alternatively, the auxiliary processor 1323 may be adapted to consume less power than the main processor 1321, or to perform specific functions. The auxiliary processor 1323 may be implemented as separate from or part of the main processor 1321.

[0111] The auxiliary processor 1323 may, when the main processor 1321 is inactive (e.g., in sleep) state, take over control of at least some functions or states associated with at least one component of the electronic device 1301 (e.g., display device 1360, sensor module 1376, or communication module 1390), or, when the main processor 1321 is active (e.g., executing an application), control, together with the main processor 1321, at least some functions or states associated with at least one component of the electronic device 1301 (e.g., display device 1360, sensor module 1376, or communication module 1390). According to one embodiment, the auxiliary processor 1323 (e.g., an image signal processor or a communication processor) may be implemented as part of another component (e.g., camera module 1380 or communication module 1390) functionally associated with the auxiliary processor 1323.

[0112] The memory 1330 may store various data used by at least one component of the electronic device 1301 (e.g., processor 1320 or sensor module 1376). The various data may include, for example, software (e.g., program 1340) and input or output data for commands associated therewith. The memory 1330 may include volatile memory 1332 or non-volatile memory 1334.

[0113] Program 1340 may be stored as software in memory 1330 and may include, for example, an operating system (OS) 1342, middleware 1344, or application 1346.

[0114] Input device 1350 can receive commands or data from outside electronic device 1301 (e.g., a user) to be used by another component of electronic device 1301 (e.g., processor 1320). Input device 1350 may include, for example, a microphone, mouse, or keyboard.

[0115] The sound output device 1355 can output sound signals to the outside of the electronic device 1301. The sound output device 1355 may include, for example, a speaker or a receiver. The speaker can be used for general purposes (such as playing multimedia or recording), and the receiver can be used to receive incoming calls. According to one embodiment, the receiver can be implemented separately from the speaker or as part of the speaker.

[0116] Display device 1360 can visually provide information to the outside of electronic device 1301 (e.g., to a user). Display device 1360 may include, for example, a display, a holographic device, or a projector, and control circuitry that controls a respective one of the display, holographic device, and projector. According to one embodiment, display device 1360 may include touch circuitry adapted to detect touch, or sensor circuitry adapted to measure the intensity of the force caused by touch (e.g., a pressure sensor).

[0117] The audio module 1370 can convert sound into electrical signals and vice versa. According to one embodiment, the audio module 1370 can obtain sound via an input device 1350, or output sound via a sound output device 1355 or an earphone of an external electronic device 1302 that is directly (e.g., wired) or wirelessly connected to the electronic device 1301.

[0118] Sensor module 1376 can detect the operating state of electronic device 1301 (e.g., power or temperature) or the environmental state outside electronic device 1301 (e.g., user state), and then generate an electrical signal or data value corresponding to the detected state. Sensor module 1376 may include, for example, an attitude sensor, a gyroscope sensor, an atmospheric pressure sensor, a magnetic sensor, an accelerometer, a grip sensor, a proximity sensor, a color sensor, an infrared (IR) sensor, a biosensor, a temperature sensor, a humidity sensor, or an illuminance sensor.

[0119] Interface 1377 may support one or more specified protocols for direct (e.g., wired) or wireless connection of electronic device 1301 to external electronic device 1302. According to one embodiment, interface 1377 may include, for example, a High Definition Multimedia Interface (HDMI), a Universal Serial Bus (USB) interface, a Secure Digital Card (SD) interface, or an audio interface.

[0120] The connection terminal 1378 may include a connector via which the electronic device 1301 can be physically connected to an external electronic device 1302. According to one embodiment, the connection terminal 1378 may include, for example, an HDMI connector, a USB connector, an SD card connector, or an audio connector (e.g., a headphone connector).

[0121] The haptic module 1379 can convert electrical signals into mechanical stimulation (e.g., vibration or motion) or electrical stimulation that can be recognized by a user via touch or kinesthesia. According to one embodiment, the haptic module 1379 may include, for example, a motor, a piezoelectric element, or an electrical stimulator.

[0122] Camera module 1380 can capture still images or moving images. According to one embodiment, camera module 1380 may include one or more lenses, an image sensor, an image signal processor, or a flash.

[0123] The power management module 1388 manages the power supplied to the electronic device 1301. The power management module 1388 can be implemented as at least part of, for example, a power management integrated circuit (PMIC).

[0124] Battery 1389 can supply power to at least one component of electronic device 1301. According to one embodiment, battery 1389 may include, for example, a non-rechargeable primary battery, a rechargeable accumulator, or a fuel cell.

[0125] Communication module 1390 can support the establishment of a direct (e.g., wired) or wireless communication channel between electronic device 1301 and external electronic devices (e.g., electronic device 1302, electronic device 1304, or server 1308), and perform communication via the established communication channel. Communication module 1390 may include one or more communication processors that can operate independently of processor 1320 (e.g., AP) and support direct (e.g., wired) or wireless communication. According to one embodiment, communication module 1390 may include wireless communication module 1392 (e.g., cellular communication module, short-range wireless communication module, or Global Navigation Satellite System (GNSS) communication module) or wired communication module 1394 (e.g., local area network (LAN) communication module or power line communication (PLC) module). Corresponding communication modules among these communication modules can communicate via a first network 1398 (e.g., a short-range communication network such as Bluetooth).TM The communication module 1392 communicates with external electronic devices via a second network 1399 (e.g., a long-distance communication network such as a cellular network, the Internet, or a computer network such as a LAN or a wide area network (WAN)). These various types of communication modules can be implemented as a single component (e.g., a single IC) or as multiple components that are separate from each other (e.g., multiple ICs). The wireless communication module 1392 can use user information (e.g., International Mobile Subscriber Identity (IMSI)) stored in the user identification module 1396 to identify and authenticate electronic devices 1301 in the communication network (e.g., a first network 1398 or a second network 1399).

[0126] Antenna module 1397 can transmit or receive signals or power to or from the outside of electronic device 1301 (e.g., external electronic device). According to one embodiment, antenna module 1397 may include one or more antennas, and thus, at least one antenna suitable for a communication scheme used in a communication network such as a first network 1398 or a second network 1399 may be selected, for example, by communication module 1390 (e.g., wireless communication module 1392). Signals or power can then be transmitted or received between communication module 1390 and external electronic device via the selected at least one antenna.

[0127] At least some of the above components can be combined with each other and communicate signals (e.g., commands or data) between them via peripheral communication schemes (e.g., bus, general purpose input and output (GPIO), serial peripheral interface (SPI) or mobile industrial processor interface (MIPI)).

[0128] According to one embodiment, commands or data can be sent or received between electronic device 1301 and external electronic device 1304 via server 1308 connected to a second network 1399. Each of electronic devices 1302 and 1304 can be a device of the same or different type as electronic device 1301. All or some of the operations to be performed on electronic device 1301 can be performed at one or more of the external electronic devices 1302 or 1304. For example, if electronic device 1301 is required to automatically perform a function or service, or in response to a request from a user or another device, electronic device 1301 may request one or more external electronic devices to perform at least a portion of a function or service, rather than performing a function or service, or may request one or more external electronic devices to perform at least a portion of a function or service in addition to performing a function or service. The one or more external electronic devices receiving the request may perform at least a portion of the requested function or service, or additional functions or services associated with the request, and transmit the result of the execution to electronic device 1301. Electronic device 1301 can provide the result, with or without further processing, as at least part of a response to the request. For this purpose, cloud computing, distributed computing, or client-server computing technologies can be used, for example.

[0129] One embodiment may be implemented as software (e.g., program 1340) including one or more instructions stored in a storage medium (e.g., internal memory 1336 or external memory 1338) readable by a machine (e.g., electronic device 1301). For example, a processor of electronic device 1301 may invoke at least one of the one or more instructions stored in the storage medium and execute it under the processor's control, with or without one or more other components. Thus, the machine can be operated to perform at least one function according to the invoked at least one instruction. The one or more instructions may include code generated by a compiler or code executable by an interpreter. The machine-readable storage medium may be provided in the form of a non-transitory storage medium. The term "non-transitory" indicates that the storage medium is a tangible device and does not include signals (e.g., electromagnetic waves), but the term does not distinguish whether data is stored semi-permanently or temporarily in the storage medium.

[0130] According to one embodiment, the methods of this disclosure may be included and provided in a computer program product. The computer program product may be traded as a product between a seller and a buyer. The computer program product may be distributed in the form of a machine-readable storage medium (e.g., an optical disc read-only memory (CD-ROM)) or via an app store (e.g., the Play Store). TMThe computer program product may be distributed online (e.g., downloaded or uploaded) or directly between two user devices (e.g., smartphones). If distributed online, at least a portion of the computer program product may be temporarily generated or at least temporarily stored in a machine-readable storage medium such as the memory of a manufacturer's server, an app store's server, or a relay server.

[0131] According to one embodiment, each of the above-described components (e.g., a module or program) may include a single entity or multiple entities. One or more of the above-described components may be omitted, or one or more other components may be added. Alternatively or additionally, multiple components (e.g., modules or programs) may be integrated into a single component. In this case, the integrated component can still perform one or more functions of each of the multiple components in the same or similar manner as the functions performed by the corresponding component of the multiple components prior to integration. Operations performed by a module, program, or other component may be performed sequentially, in parallel, repeatedly, or heuristically, or may be performed in a different order, or one or more operations may be omitted, or one or more other operations may be added.

[0132] Although certain embodiments of this disclosure have been described in detail, this disclosure can be modified in various forms without departing from its scope. Therefore, the scope of this disclosure should not be determined solely based on the described embodiments, but rather on the appended claims and their equivalents.

Claims

1. A method for generating parallax, comprising: Extract a first feature map from one side of the image captured by the first camera; Extract the second feature map from other side images captured by the second camera; The matching cost is calculated based on the comparison of the first feature map and the second feature map to generate the cost body; Generate attention-perception cost volume from the generated cost volume; as well as Aggregate attention-perceptual cost volume to generate output parallax. The step of generating the attention-perception cost volume includes assigning different weights to different disparity levels in the generated cost volume.

2. The method according to claim 1, wherein, The steps for generating the attention-perception cost volume also include: The generated cost volume's 4D feature map is divided into D 3D feature maps. Each 3D feature map has a size C×H×W, where D represents the disparity level, C represents the number of channels, H represents the height, and W represents the width. Parallax attention is applied to each of the 3D feature maps to obtain an attention-perceptual feature map; and The attention-perception feature maps are concatenated into a 4D feature map of the attention-perception cost body.

3. The method according to claim 2, wherein, A C×C attention matrix is ​​used to apply disparity attention to each of the 3D feature maps.

4. The method according to claim 1, wherein, The steps for generating the attention-perception cost body also include assigning different weights to different channels in the generated cost body.

5. The method according to claim 1, wherein, The steps for generating the attention-perceptual cost volume also include: The generated cost volume's 4D feature map is divided into C 3D feature maps. Each 3D feature map has a size D×H×W, where C represents the number of channels, D represents the disparity level, H represents the height, and W represents the width. Channel attention is applied to each of the 3D feature maps to obtain an attention-perceptual feature map; and The attention-perception feature maps are concatenated into a 4D feature map of the attention-perception cost body.

6. The method according to claim 5, wherein, A D×D attention matrix is ​​used to apply channel attention to each of the 3D feature maps.

7. The method according to claim 1, wherein, The steps for generating the attention-perceptual cost volume also include: The 4D feature map of the generated cost volume with size C×D×H×W is reconstructed into a 2D feature map with size (WHC)×D, where C represents the number of channels, D represents the disparity level, H represents the height, and W represents the width. Channel attention is applied to a 2D feature map to obtain an attention-perceptual feature map; and The attention-perception feature map is reconstructed into a 4D feature map of the attention-perception cost body.

8. The method according to claim 7, wherein, Channel attention is applied to the 2D feature map using a D×D attention matrix.

9. The method according to claim 1, wherein, The steps for generating the attention-perceptual cost volume also include: The 4D feature map of the generated cost volume with size C×D×H×W is reconstructed into a 2D feature map with size (DWH)×C, where C represents the number of channels, D represents the disparity level, H represents the height, and W represents the width. Channel attention is applied to a 2D feature map to obtain an attention-perception feature map; The attention-perception feature map is reconstructed into a 4D feature map of the attention-perception cost body.

10. The method according to claim 9, wherein, Channel attention is applied to the 2D feature map using a C×C attention matrix.

11. The method according to claim 1, wherein, The steps for generating the attention-perception cost volume also include: The 4D feature map of the generated cost volume with size C×D×H×W is reconstructed into a 2D feature map with size WH×CD, where C represents the number of channels, D represents the disparity level, H represents the height, and W represents the width. Channel attention is applied to a 2D feature map to obtain an attention-perception feature map; The attention-perception feature map is reconstructed into a 4D feature map of the attention-perception cost body.

12. The method according to claim 11, wherein, Channel attention is applied to the 2D feature map using a CD×CD attention matrix.

13. The method according to claim 1, wherein, The steps for generating the attention-perception cost volume also include: The 4D feature map of the generated cost volume with size C×D×H×W is reconstructed into a 2D feature map with size CD×WH, where C represents the number of channels, D represents the disparity level, H represents the height, and W represents the width. Channel attention is applied to a 2D feature map to obtain an attention-perception feature map; The attention-perception feature map is reconstructed into a 4D feature map of the attention-perception cost body.

14. The method according to claim 11, wherein, Channel attention is applied to the 2D feature map using a WH×WH attention matrix.

15. The method according to any one of claims 1 to 14, further comprising: Fuse two or more aggregated disparities from different networks trained with different disparity ranges to provide the final output disparity.

16. The method of claim 15, further comprising: Before fusing the two or more aggregated disparities, the two or more aggregated disparities are trained on different disparity ranges.

17. A system for generating parallax, comprising: Memory; as well as The processor is configured as follows: Extract the first feature map from the image of one side captured by the first camera. Extract the second feature map from other side images captured by the second camera. The matching cost is calculated based on the comparison of the first and second feature maps to generate the cost body. Generate attention-perception cost volume from the generated cost volume, and Aggregate attention-perceptual cost volume to generate output parallax. The processor is configured to generate an attention-perception cost body by assigning different weights to different disparity levels in the generated cost body.

18. The system according to claim 17, wherein, The processor is configured to generate an attention-perception cost volume by assigning different weights to different disparity levels in the generated cost volume and different weights to different channels in the generated cost volume.

19. A system for generating parallax, comprising: The feature map extraction module is configured to extract a first feature map from one side image captured by the first camera and a second feature map from the other side image captured by the second camera; The cost body calculation module is configured to calculate the matching cost based on a comparison of the first feature map and the second feature map to generate the cost body; The cost volume attention module is configured to generate an attention-perception cost volume from the generated cost volume; as well as The cost aggregation module is configured to aggregate attention-perceptual cost volumes to generate output parallax. The cost volume attention module is configured to generate an attention-perception cost volume by assigning different weights to different disparity levels in the generated cost volume.