Semantic segmentation and stereo matching method and framework based on multi-task joint learning
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TONGJI UNIV
- Filing Date
- 2023-11-08
- Publication Date
- 2026-06-26
AI Technical Summary
Existing semantic segmentation and stereo matching methods based on multi-task joint learning require a large amount of well-annotated training data and complex training strategies, and it is difficult to achieve simultaneous convergence of the two networks.
A joint encoder is used to extract shared features, a preliminary disparity map is obtained by calculating disparity, and the disparity map is updated and refined. A densely connected decoder and a multi-level GRU update operator are combined, and the entire learning process is supervised by a semantic consistency loss function, so as to achieve end-to-end learning of semantic segmentation and stereo matching.
It improves the real-time performance of the driving three-dimensional environment perception system, reduces the dependence on the dataset, simplifies the training strategy, and improves the computational efficiency, accuracy, spatial consistency and robustness of the results.
Smart Images

Figure CN117710453B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision, and in particular to a semantic segmentation and stereo matching method and framework based on multi-task joint learning. Background Technology
[0002] The semantic segmentation and stereo matching method based on multi-task joint learning can combine multiple sub-tasks, namely semantic segmentation and stereo matching, for learning, thereby reducing computational complexity while significantly improving performance.
[0003] Semantic segmentation refers to assigning each pixel in an image to a different semantic category, essentially dividing the image into multiple semantic regions. Stereo matching, on the other hand, involves analyzing depth information in an image to determine the position and distance relationships between different objects. Traditionally, semantic segmentation and stereo matching are often handled independently, using different models for learning and inference. However, this independent approach can lead to information loss and inconsistencies. Therefore, the idea of multi-task joint learning has been introduced, aiming to improve the performance and complementarity of both tasks by learning them simultaneously.
[0004] In related technologies, existing semantic segmentation and stereo matching methods based on multi-task joint learning have the following drawbacks:
[0005] A large amount of well-annotated training data is required: for example, SegStereo, DSNet, SGNet and DispSegNet first require an initial unsupervised training phase on a large dataset, and then subsequent supervised fine-tuning on a smaller dataset.
[0006] Complex training strategies are needed for joint learning of the two tasks: for example, DSNet employs different joint learning strategies, where training alternates between the semantic segmentation network and the stereo matching network, with each network frozen during the training of the other. However, achieving simultaneous convergence of the two networks can be challenging.
[0007] The shortcomings of the above methods are problems that urgently need to be solved by those skilled in the art. Summary of the Invention
[0008] The purpose of this invention is to provide a semantic segmentation and stereo matching method and framework based on multi-task joint learning to improve the real-time performance of a driving stereo environment perception system.
[0009] The objective of this invention can be achieved through the following technical solutions:
[0010] A semantic segmentation and stereo matching method based on multi-task joint learning includes the following steps:
[0011] Acquire stereo image pair information, wherein the stereo image pair information includes a left image and a right image;
[0012] Based on the left and right images, a joint encoder is used to extract shared features, and a preliminary disparity map is obtained by calculating disparity.
[0013] Based on the shared features, the preliminary disparity map is updated by updating the disparity to obtain a refined disparity map, wherein the refined disparity map is a stereo matching result;
[0014] The shared features are transformed into a semantic space and fused with the refined disparity map to obtain fused features.
[0015] A skip-connection decoder based on dense connections decodes the fused features to obtain semantic segmentation results.
[0016] Furthermore, in the multi-task joint learning method, the joint encoder and the skip connection decoder use the same parameters. The joint encoder includes residual blocks and downsampling layers, and the skip connection decoder includes a decoder layer, a skip connection layer, an upsampling layer, a deconvolution layer, and an output layer.
[0017] Furthermore, a multi-level GRU update operator is employed to obtain the disparity map.
[0018] Furthermore, the specific steps for obtaining the disparity map include:
[0019] An initial 3D related volume is constructed based on the shared features of the left and right images;
[0020] A 3D correlation volume pyramid is constructed based on the initial 3D correlation volume, and average pooling is performed for downsampling.
[0021] Based on the 3D related volume pyramid, a multi-level GRU update operator is used to update the disparity in the initial disparity map to obtain a refined disparity map.
[0022] Furthermore, the specific steps for obtaining the fusion features include:
[0023] Disparity coding: Encode the disparity map and perform feature extraction and enhancement;
[0024] Max pooling and residual layers: Max pooling layers and four residual layers are applied sequentially to gradually increase the number of feature map channels;
[0025] Feature fusion: The shared features and the features extracted from the encoded disparity map are fused to obtain fused features.
[0026] Furthermore, the feature extraction and enhancement are achieved by using convolutional layers, batch normalization layers, and ReLU activation layers.
[0027] Furthermore, each layer of the densely connected skip-connected decoder is dimensionally connected to all preceding layers and serves as the input to the next layer.
[0028] Furthermore, in the multi-task joint learning process, a semantically consistent loss function is used to supervise the entire joint learning process.
[0029] Furthermore, the calculation process of the semantic consistency loss function specifically includes:
[0030] Constructing a 3D tensor: Constructing a 3D tensor V 3D ∈R H×W×C For each pixel p and each channel c in the tensor, the Kronecker delta function δ(M) is used. G (p),c) construct In the formula, H, W, and C represent height, width, and number of channels, respectively;
[0031] Average pooling operation: The average pooling operation is applied to each channel of the tensor to obtain features between different semantic classes;
[0032] Normalization operation: Performing a normalization operation on the feature tensor We obtain the normalized tensor V N ∈R H×W×C N is the number of semantic categories;
[0033] Weight mapping: For each pixel p, the normalized feature of the channel c with the maximum value is selected as the weight to achieve semantic consistency-guided weight mapping.
[0034] Total loss calculation: in, In the formula, Indicates the total loss. Let N and C represent the semantic segmentation loss and stereo matching loss, respectively, where N represents the number of pixels and C represents the number of classes. This represents the true label of p in category c, and α represents L. ss The weights, D G The true value of parallax, D i Representative at i parallaxes.
[0035] This invention also provides a semantic segmentation and stereo matching framework based on multi-task joint learning, comprising:
[0036] Image acquisition module: acquires stereo image pair information, the stereo image pair information including left image and right image;
[0037] Shared feature extraction module: Based on the left and right images, a joint encoder is used to extract shared features, and a preliminary disparity map is obtained by calculating disparity;
[0038] Detailed disparity map acquisition module: Based on the shared features, the preliminary disparity map is updated by updating the disparity to obtain a detailed disparity map, wherein the detailed disparity map is a stereo matching result;
[0039] Feature fusion adaptive module: transforms the shared features into a semantic space and fuses them with the refined disparity map to obtain fused features;
[0040] Decoding module: Based on a densely connected skip connection decoder, it decodes the fused features to obtain semantic segmentation results.
[0041] Compared with the prior art, the present invention has the following beneficial effects:
[0042] (1) The present invention performs semantic segmentation and stereo matching based on the extracted shared features, avoiding redundant calculations and improving computational efficiency and speed. Semantic segmentation is performed based on the disparity map of the stereo matching results, making full use of the structural consistency information of the two tasks, improving the understanding and reasoning ability of the image, thereby improving the real-time performance of the driving stereo environment perception system.
[0043] (2) This invention trains the entire joint learning process through a loss function guided by semantic consistency, reducing the dependence on the dataset. The loss function emphasizes the consistency of the structure in the tasks of semantic segmentation and stereo matching, which can improve the accuracy, spatial consistency, boundary accuracy and robustness of the learning process, thereby improving the overall performance and result quality of the task.
[0044] (3) This invention implements semantic segmentation and stereo matching tasks in the same framework. This end-to-end learning process has a simple training strategy and requires less data compared to other existing joint learning frameworks.
[0045] (4) The present invention employs a feature fusion adaptive module to convert shared features into semantic space and then fuse them with encoded disparity features to improve the overall scene understanding capability. Attached Figure Description
[0046] Figure 1 This is a schematic diagram of the method flow according to an embodiment of the present invention;
[0047] Figure 2 This is a schematic diagram of feature fusion according to an embodiment of the present invention;
[0048] Figure 3 This is a diagram illustrating the effect of stereo matching and semantic segmentation of images according to an embodiment of the present invention. Detailed Implementation
[0049] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0050] Example 1
[0051] This embodiment provides a semantic segmentation and stereo matching method based on multi-task joint learning. This method can simultaneously perform semantic segmentation and stereo matching tasks when given left and right images as input, obtaining the results of semantic segmentation and stereo matching of figures. Figure 1 As shown, the method includes the following steps:
[0052] S1. Obtain stereo image pair information, which includes a left image and a right image.
[0053] First, it is necessary to obtain the left and right images in front of the driver during autonomous driving. In this embodiment, the images are RGB images, and their storage format can be PNG, JPG, or other image formats.
[0054] S2. Based on the left and right images, a joint encoder is used to extract shared features, and a preliminary disparity map is obtained by calculating disparity.
[0055] The joint encoder consists of a series of residual blocks and downsampling layers, and its main function is to extract features from the input image. Residual blocks can help the model learn more complex features, avoid the vanishing gradient problem, and perform gradient updates more effectively.
[0056] S3. Based on the shared features, update the disparity to update the preliminary disparity map and obtain a refined disparity map, wherein the refined disparity map is a stereo matching result.
[0057] To refine the disparity map, follow these steps:
[0058] Construct an initial 3D related volume from the features of the left and right views;
[0059] Construct a 3D correlation volume pyramid and perform average pooling for downsampling, where the m-th 3D correlation volume is obtained by using 1D average pooling with a kernel size of 2 and a stride of 2 from the (m-1)-th 3D correlation volume C. m-1 The constructed related volume pyramid can capture depth information in the image, providing a foundation for subsequent disparity estimation tasks;
[0060] Finally, the GRU (Gated Recurrent Unit) update operator is used to inject the obtained preliminary disparity map and contextual features into the GRU operator. The GRU updates the hidden states and then uses the new hidden states to predict disparity updates, thereby updating the disparity map. The GRU update operator updates the disparity from coarse to fine, where the disparity is initialized to 0, and finally obtains an accurate and refined disparity map.
[0061] S4. The shared features are converted into a semantic space and fused with the disparity map to obtain fused features.
[0062] like Figure 2 As shown, this embodiment implements feature fusion through the Feature Fusion Adaptive (FFA) module. The execution steps are as follows:
[0063] First, the disparity is encoded using convolutional layers, batch normalization layers, and ReLU activation layers to refine the disparity map. Then, the features of the left image are remapped and fused to obtain a fused feature map. Finally, max pooling layers and four residual layers are applied sequentially to gradually increase the number of channels in the fused feature map.
[0064] The shared feature map and disparity map are then fused to incorporate both semantic and spatial geometric information, thereby enhancing the accuracy of semantic scene understanding.
[0065] S5. A skip connection decoder based on dense connections decodes the fused features to obtain semantic segmentation results.
[0066] In this embodiment, the skip connection decoder uses the same parameters as the feature extractor. In the last layer, the features are upsampled to create a prediction mapping with N channels, where N represents the number of semantic categories.
[0067] In a densely connected skip decoder, each layer is dimensionally connected to all preceding layers and serves as input to the next layer. This approach fuses low-level and high-level features. This fusion helps the model consider both global and local information simultaneously, thereby improving the accuracy of disparity maps and the performance of semantic segmentation. Furthermore, the densely connected skip decoder directly connects feature maps from different layers, enabling feature reuse and improving efficiency.
[0068] In this embodiment, the entire joint learning process is supervised by establishing a semantically consistent loss function. The steps for constructing the loss function are as follows:
[0069] First, construct a three-dimensional tensor, namely V. 3D ∈R H×W×C , constructed using the following expression Where c represents the c-th channel in the tensor, p represents each pixel, and δ represents the Kronecker delta function.
[0070] Each channel of this tensor can be viewed as a binary segmentation map of class c. To emphasize semantic consistency, we use average pooling for each channel to obtain the feature V between different classes. I ∈R H×W×C : Among them This represents the average pooling operation.
[0071] Then normalization is applied. We obtain the normalized tensor V N ∈R H×W×C Then, the semantic consistency-guided weight mapping is obtained in the following way.
[0072] The total loss is calculated as: L scg =L ss +L sm .
[0073] in, These represent the semantic segmentation loss and the stereo matching loss, respectively, where N represents the number of pixels and C represents the number of classes. This represents the true label of p in category c, and α represents L. ss The weight of was experimentally determined to be 0.1.
[0074] Among them, D G The true value of parallax, D i Representative at The i-th disparity is set to α = 0.1 and γ = 0.9.
[0075] Through backpropagation of semantic consistency loss and model iteration, the trained model can be used for semantic segmentation and stereo matching tasks. The obtained semantic segmentation and stereo matching results are as follows: Figure 3 As shown.
[0076] To verify the effectiveness of the above method, extensive experimental results conducted on the vKITTI2 and KITTI datasets in this embodiment demonstrate the effectiveness of the proposed multi-task joint learning method and its superior performance compared to other state-of-the-art single-task networks.
[0077] Example 2
[0078] This embodiment provides a semantic segmentation and stereo matching framework based on multi-task joint learning, including:
[0079] Image acquisition module: acquires stereo image pair information, the stereo image pair information including left image and right image;
[0080] Shared feature extraction module: Based on the left and right images, a joint encoder is used to extract shared features, and a preliminary disparity map is obtained by calculating disparity;
[0081] Detailed disparity map acquisition module: Based on the shared features, the preliminary disparity map is updated by updating the disparity to obtain a detailed disparity map, wherein the detailed disparity map is a stereo matching result;
[0082] Feature fusion adaptive module: transforms the shared features into a semantic space and fuses them with the refined disparity map to obtain fused features;
[0083] Decoding module: Based on a densely connected skip connection decoder, it decodes the fused features to obtain semantic segmentation results.
[0084] The rest are as in Example 1.
[0085] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0086] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The solutions in the embodiments of the present invention can be implemented using various computer languages, such as the object-oriented programming language Java and the interpreted scripting language JavaScript.
[0087] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0088] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0089] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0090] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.
[0091] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
Claims
1. A semantic segmentation and stereo matching method based on multi-task joint learning, characterized in that, Includes the following steps: Acquire stereo image pair information, wherein the stereo image pair information includes a left image and a right image; Based on the left and right images, a joint encoder is used to extract shared features, and a preliminary disparity map is obtained by calculating disparity. Based on the shared features, the preliminary disparity map is updated by updating the disparity to obtain a refined disparity map, wherein the refined disparity map is a stereo matching result. A multi-level GRU update operator is used to obtain the refined disparity map, and the specific steps for obtaining the refined disparity map include: An initial 3D related volume is constructed based on the shared features of the left and right images; A 3D correlation volume pyramid is constructed based on the initial 3D correlation volume, and average pooling is performed for downsampling. Based on the 3D related volume pyramid, a multi-level GRU update operator is used to update the disparity in the initial disparity map to obtain a refined disparity map. The shared features are transformed into a semantic space and fused with the refined disparity map to obtain fused features. The specific steps for obtaining the fused features include: Disparity coding: The refined disparity map is encoded and features are extracted and enhanced. The feature extraction and enhancement are achieved by using convolutional layers, batch normalization layers and ReLU activation layers. Max pooling and residual layers: Max pooling layers and four residual layers are applied sequentially to gradually increase the number of feature map channels; Feature fusion: The shared features and the features extracted from the encoded, refined disparity map are fused to obtain fused features; A skip-connection decoder based on dense connections decodes the fused features to obtain semantic segmentation results.
2. The semantic segmentation and stereo matching method based on multi-task joint learning according to claim 1, characterized in that, In the multi-task joint learning method, the joint encoder and the skip connection decoder use the same parameters. The joint encoder includes residual blocks and downsampling layers, and the skip connection decoder includes a decoder layer, a skip connection layer, an upsampling layer, a deconvolution layer, and an output layer.
3. The semantic segmentation and stereo matching method based on multi-task joint learning according to claim 1, characterized in that, Each layer of the densely connected skip-connected decoder is dimensionally connected to all preceding layers and serves as the input to the next layer.
4. The semantic segmentation and stereo matching method based on multi-task joint learning according to claim 1, characterized in that, In the multi-task joint learning process, a semantically consistent loss function is used to supervise the entire joint learning process.
5. The semantic segmentation and stereo matching method based on multi-task joint learning according to claim 4, characterized in that, The calculation process of the semantic consistency loss function specifically includes: Constructing a 3D tensor: Constructing a 3D tensor For each pixel p and each channel c in the tensor, the Kronecker delta function is used. Build In the formula, H, W, and C represent the height, width, and number of channels, respectively; Average pooling operation: The average pooling operation is applied to each channel of the tensor to obtain features between different semantic classes; Normalization operation: Performing a normalization operation on the feature tensor The normalized tensor is obtained. N is the number of semantic categories; Weight mapping: For each pixel p, the normalized feature of the channel c with the maximum value is selected as the weight to achieve semantic consistency-guided weight mapping. ; Total loss calculation: ,in, , In the formula, Indicates the total loss. , Let N and C represent the semantic segmentation loss and stereo matching loss, respectively, where N represents the number of pixels and C represents the number of classes. express The true label in category c, represent The weight, The true value representing parallax. Representative at i parallaxes.
6. A semantic segmentation and stereo matching device based on a multi-task joint learning method according to any one of claims 1-5, characterized in that, include: Image acquisition module: acquires stereo image pair information, the stereo image pair information including left image and right image; Shared feature extraction module: Based on the left and right images, a joint encoder is used to extract shared features, and a preliminary disparity map is obtained by calculating disparity; Detailed disparity map acquisition module: Based on the shared features, the preliminary disparity map is updated by updating the disparity to obtain a detailed disparity map, wherein the detailed disparity map is a stereo matching result; Feature fusion adaptive module: transforms the shared features into a semantic space and fuses them with the refined disparity map to obtain fused features; Decoding module: Based on a densely connected skip connection decoder, it decodes the fused features to obtain semantic segmentation results.