Model training method and device, electronic equipment and storage medium

By inputting the left and right view images into the semantic segmentation model, performing image transformation and feature extraction, and combining image disparity to calculate the loss value, the high cost problem caused by manual annotation is solved, and efficient self-supervised semantic segmentation model training is achieved.

CN115346079BActive Publication Date: 2026-06-30BEIJING PHIGENT TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING PHIGENT TECHNOLOGY CO LTD
Filing Date
2022-07-12
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, training semantic segmentation models requires manual annotation of sample images, resulting in high sample annotation costs and low training efficiency.

Method used

By inputting the left and right view images into the semantic segmentation model to be trained, and utilizing the image transformation layer, segmentation map acquisition layer, and loss function layer, image transformation, feature extraction, and loss value calculation are performed, a self-supervised semantic segmentation model training is achieved without manual annotation.

Benefits of technology

It reduces sample labeling costs, improves model training efficiency, and enables self-supervised semantic segmentation model training.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115346079B_ABST
    Figure CN115346079B_ABST
Patent Text Reader

Abstract

This application provides a model training method, apparatus, electronic device, and storage medium. The method includes: inputting a left-view image and a right-view image into a semantic segmentation model to be trained; calling an image transformation layer to perform image transformation processing on the left-view image and the right-view image to obtain a first image and a second image of the left-view image, and a third image and a fourth image of the right-view image; calling a segmentation map acquisition layer to process the feature maps of the first image, the second image, the third image, and the fourth image to obtain a first segmentation map, a second segmentation map, a third segmentation map, and a fourth segmentation map; calling a loss function layer to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map, and image disparity; and, if the loss value is within a preset range, using the trained semantic segmentation model as the target semantic segmentation model. This application can reduce sample annotation costs and improve model training efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, and in particular to a model training method, electronic device, and storage medium. Background Technology

[0002] Semantic segmentation is a task that performs pixel-level classification of images and is now widely used in the field of autonomous driving. In autonomous driving, it can assist the autonomous driving system in making decisions by identifying objects such as trees, vehicles, pedestrians, and lane lines on the road.

[0003] With the significant advancements in deep learning for computer vision tasks, semantic segmentation based on deep learning has also achieved good results. However, traditional semantic segmentation methods require collecting sample images for model training, manually labeling all object contours and categories appearing in the images, and then performing end-to-end training on the deep learning model to obtain the semantic segmentation map of the target image. However, sample labeling requires a significant amount of human and financial resources, increasing the cost of sample labeling and reducing the training efficiency of the semantic segmentation model. Summary of the Invention

[0004] This application provides a model training method, apparatus, electronic device, and storage medium to solve the problem in related technologies that require manual annotation of images to generate model training samples, which increases the cost of sample annotation and reduces the training efficiency of semantic segmentation models.

[0005] To solve the above-mentioned technical problems, the embodiments of this application are implemented as follows:

[0006] In a first aspect, embodiments of this application provide a model training method, including:

[0007] The left and right view images are input into the semantic segmentation model to be trained; the semantic segmentation model to be trained includes: an image transformation layer, a segmentation map acquisition layer, and a loss function layer;

[0008] The image transformation layer is invoked to perform image transformation processing on the left view image and the right view image respectively, to obtain the first image and the second image corresponding to the left view image, and the third image and the fourth image corresponding to the right view image;

[0009] The segmentation map acquisition layer is invoked to process the feature maps corresponding to the first image, the second image, the third image, and the fourth image respectively, to obtain the first segmentation map of the first image, the second segmentation map of the second image, the third segmentation map of the third image, and the fourth segmentation map of the fourth image;

[0010] The loss function layer is invoked to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map, and the image disparity.

[0011] If the loss value is within a preset range, the trained semantic segmentation model will be used as the final target semantic segmentation model.

[0012] Optionally, the image transformation layer includes: a first image transformation unit and a second image transformation unit.

[0013] The step of calling the image transformation layer to perform image transformation processing on the left view image and the right view image respectively, to obtain the first image and the second image corresponding to the left view image, and the third image and the fourth image corresponding to the right view image, includes:

[0014] The first image transformation unit is invoked to perform a first transformation process on the left view image and the right view image to obtain a first image corresponding to the left view image and a third image corresponding to the right view image;

[0015] The second image transformation unit is invoked to perform a second transformation process on the left view image and the right view image to obtain a second image corresponding to the left view image and a fourth image corresponding to the right view image.

[0016] Optionally, the semantic segmentation model to be trained further includes a feature extraction layer, which is located between the image transformation layer and the segmentation map acquisition layer.

[0017] After calling the image transformation layer to perform image transformation processing on the left view image and the right view image respectively, to obtain the first image and the second image corresponding to the left view image, and the third image and the fourth image corresponding to the right view image, the method further includes:

[0018] The feature extraction layer is invoked to perform image feature extraction processing on the first image, the second image, the third image, and the fourth image respectively, so as to obtain the first feature map of the first image, the second feature map of the second image, the third feature map of the third image, and the fourth feature map of the fourth image.

[0019] Optionally, the step of calling the segmentation map acquisition layer to process the feature maps corresponding to the first image, the second image, the third image, and the fourth image respectively to obtain a first segmentation map of the first image, a second segmentation map of the second image, a third segmentation map of the third image, and a fourth segmentation map of the fourth image includes:

[0020] The segmentation map acquisition layer is invoked to perform clustering processing on the image pixels within the first feature map to obtain the pixel clustering center corresponding to the first feature map, and the first segmentation map of the first feature map is output based on the pixel clustering center;

[0021] The segmentation map acquisition layer is invoked to perform clustering processing on the image pixels within the second feature map to obtain the pixel clustering center corresponding to the second feature map, and the second segmentation map of the second feature map is output based on the pixel clustering center;

[0022] The segmentation map acquisition layer is invoked to perform clustering processing on the image pixels within the third feature map to obtain the pixel clustering center corresponding to the third feature map, and the third segmentation map of the third feature map is output based on the pixel clustering center;

[0023] The segmentation map acquisition layer is invoked to perform clustering processing on the image pixels within the fourth feature map to obtain the pixel cluster centers corresponding to the fourth feature map, and the fourth segmentation map of the fourth feature map is output based on the pixel cluster centers.

[0024] Optionally, the loss function layer includes: a transformation loss function unit and a disparity loss function unit.

[0025] The step of calling the loss function layer to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map, and image disparity includes:

[0026] The transformation loss function unit is invoked to calculate the first transformation loss value of the semantic segmentation model to be trained based on the first segmentation map and the second segmentation map;

[0027] The disparity loss function unit is invoked to calculate the first disparity loss value of the semantic segmentation model to be trained based on the first segmentation map, the third segmentation map, and the image disparity.

[0028] The loss value of the semantic segmentation model to be trained is calculated based on the first transformation loss value and the first disparity loss value.

[0029] Optionally, the loss function layer includes: a transformation loss function unit and a disparity loss function unit.

[0030] The step of calling the loss function layer to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map, and image disparity includes:

[0031] The transformation loss function unit is invoked to calculate the second transformation loss value of the semantic segmentation model to be trained based on the third segmentation map and the fourth segmentation map;

[0032] The disparity loss function unit is invoked to calculate the second disparity loss value of the semantic segmentation model to be trained based on the second segmentation map, the fourth segmentation map, and the image disparity;

[0033] The loss value of the semantic segmentation model to be trained is calculated based on the second transform loss value and the second disparity loss value.

[0034] Secondly, embodiments of this application provide a model training apparatus, including:

[0035] An image input module is used to input left and right view images into a semantic segmentation model to be trained; the semantic segmentation model to be trained includes: an image transformation layer, a segmentation map acquisition layer, and a loss function layer;

[0036] The image transformation module is used to call the image transformation layer to perform image transformation processing on the left view image and the right view image respectively, so as to obtain the first image and the second image corresponding to the left view image, and the third image and the fourth image corresponding to the right view image;

[0037] The segmentation map acquisition module is used to call the segmentation map acquisition layer to process the feature maps corresponding to the first image, the second image, the third image and the fourth image respectively, to obtain the first segmentation map of the first image, the second segmentation map of the second image, the third segmentation map of the third image and the fourth segmentation map of the fourth image;

[0038] The loss value calculation module is used to call the loss function layer to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map and the image disparity;

[0039] The semantic segmentation model acquisition module is used to use the trained semantic segmentation model as the final target semantic segmentation model when the loss value is within a preset range.

[0040] Optionally, the image transformation layer includes: a first image transformation unit and a second image transformation unit.

[0041] The image transformation module includes:

[0042] The first image transformation unit is used to call the first image transformation unit to perform a first transformation process on the left view image and the right view image to obtain a first image corresponding to the left view image and a third image corresponding to the right view image;

[0043] The second image transformation unit is used to call the second image transformation unit to perform a second transformation process on the left view image and the right view image to obtain a second image corresponding to the left view image and a fourth image corresponding to the right view image.

[0044] Optionally, the semantic segmentation model to be trained further includes a feature extraction layer, which is located between the image transformation layer and the segmentation map acquisition layer.

[0045] The device further includes:

[0046] The feature map acquisition module is used to call the feature extraction layer to perform image feature extraction processing on the first image, the second image, the third image and the fourth image respectively, so as to obtain the first feature map of the first image, the second feature map of the second image, the third feature map of the third image and the fourth feature map of the fourth image.

[0047] Optionally, the segmentation map acquisition module includes:

[0048] The first segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the first feature map, obtain the pixel clustering center corresponding to the first feature map, and output the first segmentation map of the first feature map according to the pixel clustering center.

[0049] The second segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the second feature map, obtain the pixel clustering center corresponding to the second feature map, and output the second segmentation map of the second feature map according to the pixel clustering center;

[0050] The third segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the third feature map, obtain the pixel clustering center corresponding to the third feature map, and output the third segmentation map of the third feature map according to the pixel clustering center;

[0051] The fourth segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the fourth feature map, obtain the pixel clustering center corresponding to the fourth feature map, and output the fourth segmentation map of the fourth feature map according to the pixel clustering center.

[0052] Optionally, the loss function layer includes: a transformation loss function unit and a disparity loss function unit.

[0053] The loss value calculation module includes:

[0054] The first transformation loss calculation unit is used to call the transformation loss function unit to calculate the first transformation loss value of the semantic segmentation model to be trained based on the first segmentation map and the second segmentation map;

[0055] The first disparity loss calculation unit is used to call the disparity loss function unit to calculate the first disparity loss value of the semantic segmentation model to be trained based on the first segmentation map, the third segmentation map and the image disparity;

[0056] The first loss value calculation unit is used to calculate the loss value of the semantic segmentation model to be trained based on the first transform loss value and the first disparity loss value.

[0057] Optionally, the loss function layer includes: a transformation loss function unit and a disparity loss function unit.

[0058] The loss value calculation module includes:

[0059] The second transformation loss calculation unit is used to call the transformation loss function unit to calculate the second transformation loss value of the semantic segmentation model to be trained based on the third segmentation map and the fourth segmentation map;

[0060] The second disparity loss calculation unit is used to call the disparity loss function unit to calculate the second disparity loss value of the semantic segmentation model to be trained based on the second segmentation map, the fourth segmentation map and the image disparity;

[0061] The second loss value calculation unit is used to calculate the loss value of the semantic segmentation model to be trained based on the second transform loss value and the second disparity loss value.

[0062] Thirdly, embodiments of this application provide an electronic device, including:

[0063] The memory, the processor, and the computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the object detection result determination method or the model training method described in any of the preceding claims.

[0064] Fourthly, embodiments of this application provide a readable storage medium that, when the instructions in the storage medium are executed by the processor of an electronic device, enables the electronic device to perform the object detection result determination method or the model training method described in any of the preceding claims.

[0065] In this embodiment, the left and right view images are input into the semantic segmentation model to be trained. The semantic segmentation model to be trained includes an image transformation layer, a segmentation map acquisition layer, and a loss function layer. The image transformation layer performs image transformation processing on the left and right view images respectively, obtaining a first and second image corresponding to the left view image, and a third and fourth image corresponding to the right view image. The segmentation map acquisition layer processes the feature maps corresponding to the first, second, third, and fourth images respectively, obtaining a first segmentation map of the first image, a second segmentation map of the second image, a third segmentation map of the third image, and a fourth segmentation map of the fourth image. The loss function layer calculates the loss value of the semantic segmentation model to be trained based on the first, second, third, and fourth segmentation maps and image disparity. If the loss value is within a preset range, the trained semantic segmentation model is used as the final target semantic segmentation model. This embodiment achieves self-supervised training of the semantic segmentation model by combining image features and image disparity between the left and right views, eliminating the need for manual annotation of sample images, reducing sample annotation costs, and thus improving the training efficiency of the model.

[0066] The above description is only an overview of the technical solution of this application. In order to better understand the technical means of this application and to implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of this application more obvious and understandable, the following are specific embodiments of this application. Attached Figure Description

[0067] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0068] Figure 1 A flowchart illustrating the steps of a model training method provided in this application embodiment;

[0069] Figure 2 A flowchart illustrating the steps of an image transformation processing method provided in this application embodiment;

[0070] Figure 3 A flowchart illustrating the steps of a segmentation map acquisition method provided in this application embodiment;

[0071] Figure 4 A flowchart illustrating the steps of a loss value calculation method provided in this application embodiment;

[0072] Figure 5A flowchart illustrating the steps of another loss value calculation method provided in this application embodiment;

[0073] Figure 6 A schematic diagram of a model training process provided in an embodiment of this application;

[0074] Figure 7 This is a schematic diagram of the structure of a model training device provided in an embodiment of this application;

[0075] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0076] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0077] Reference Figure 1 The flowchart illustrates the steps of a model training method provided in an embodiment of this application, as follows: Figure 1 As shown, the model training method may include the following steps:

[0078] Step 101: Input the left and right view images into the semantic segmentation model to be trained; the semantic segmentation model to be trained includes: an image transformation layer, a segmentation map acquisition layer, and a loss function layer.

[0079] The embodiments of this application can be applied to scenarios where self-supervised semantic segmentation model training is achieved by combining image depth information and the disparity of left and right views.

[0080] In this embodiment, the left and right view images can be the left view images of the target vehicle, which can be an autonomous vehicle. In practical applications, the target vehicle can be an unmanned vehicle, such as an unmanned delivery vehicle.

[0081] In one specific implementation, the target vehicle can be a single vehicle. This involves collecting left and right view images of the target vehicle under different driving conditions, and using the collected left and right view images as training samples for the model.

[0082] In another specific implementation, the target vehicle can be multiple vehicles. That is, by collecting left and right view images of multiple target vehicles under different driving conditions, the collected left and right view images are used as training samples for the model.

[0083] The left and right view images refer to the side view images of the target vehicle captured by a binocular camera installed on the target vehicle. In this example, the left and right view images of the target vehicle form the training sample images for the semantic segmentation model.

[0084] The semantic segmentation model to be trained refers to a model designed for pixel-level classification of images. In this example, the semantic segmentation model to be trained may include: an image transformation layer, a segmentation map acquisition layer, and a loss function layer. The image transformation layer can be used to enhance the left and right view images. The segmentation map acquisition layer can be used to cluster the feature maps corresponding to the enhanced images and output a segmentation map corresponding to the enhanced image based on the pixel cluster centers. The loss function layer can be used to calculate the loss value of the semantic segmentation model to be trained based on the output segmentation map.

[0085] When training a semantic segmentation model, left and right view images of the target vehicle can be collected as training samples. These images can then be input into the semantic segmentation model.

[0086] Understandably, when training the semantic segmentation model to be trained, the left view image and the right view image are used as a pair as model training samples. When inputting the model training samples into the semantic segmentation model to be trained, a pair of model training samples (i.e., the matched left view image and the right view image) are input into the semantic segmentation model to be trained for training.

[0087] After inputting the left and right view images into the semantic segmentation model to be trained, step 102 is executed.

[0088] Step 102: Call the image transformation layer to perform image transformation processing on the left view image and the right view image respectively, to obtain the first image and the second image corresponding to the left view image, and the third image and the fourth image corresponding to the right view image.

[0089] The first image and the second image refer to the two transformed images of the left view image obtained after performing two image transformations on the left view image. For example, both the first image and the second image are images obtained after rotating the left view image. The first image can be the image obtained after rotating the left view image by 90°, and the second image can be the image obtained after rotating the right view image by 180°, etc.

[0090] The third and fourth images refer to the two transformed images of the right view image obtained after performing two image transformations on the right view image.

[0091] After inputting the left and right view images into the semantic segmentation model to be trained, an image transformation layer can be called to perform image transformation processing on the left and right view images respectively, to obtain the first and second images corresponding to the left view image, and the third and fourth images corresponding to the right view image. For example... Figure 6 As shown, the input to the semantic segmentation model to be trained consists of a left view image and a right view image. After inputting the left view image into the semantic segmentation model, two data augmentation processes are performed on the left view image to obtain two transformed images corresponding to the left view image, namely View1 and View2. After inputting the right view image into the semantic segmentation model, two data augmentation processes are performed on the right view image to obtain two transformed images corresponding to the right view image, namely View3 and View4.

[0092] In this example, one of the third and fourth images undergoes the same transformation as one of the first and second images, and the other image undergoes the same transformation as the other image of the first and second images. For example, the first image is generated by rotating the left view image by 90°, the third image is generated by rotating the right view image by 90°, the second image is generated by rotating the left view image by 180°, the fourth image is generated by rotating the right view image by 180°, and so on. Specifically, the image transformation process can be combined with... Figure 2 The following is a detailed description.

[0093] Reference Figure 2 The flowchart illustrates the steps of an image transformation processing method provided in an embodiment of this application, as follows: Figure 2 As shown, the image transformation processing method may include steps 201 and 202.

[0094] Step 201: Call the first image transformation unit to perform a first transformation process on the left view image and the right view image to obtain a first image corresponding to the left view image and a third image corresponding to the right view image.

[0095] In this embodiment, the image transformation layer may include: a first image transformation unit and a second image transformation unit, wherein the first image transformation unit may be used to perform a first transformation processing operation on the input left view image and right view image, and the second image transformation unit may be used to perform a second transformation processing operation on the input left view image and right view image.

[0096] After inputting the left and right view images into the semantic segmentation model to be trained, the first image transformation unit can be invoked to perform a first transformation process on the left and right view images to obtain a first image corresponding to the left view image and a third image corresponding to the right view image. That is, the first and third images are images obtained by performing the same image transformation process on the left and right view images. For example, after inputting the left and right view images into the semantic segmentation model to be trained, a 90° rotation transformation operation can be performed on the left and right view images respectively to obtain the first image of the left view image and the third image of the right view image. Alternatively, color transformation processing can be performed on the left and right view images respectively to obtain the first image of the left view image and the third image of the right view image, and so on.

[0097] Step 202: Call the second image transformation unit to perform a second transformation process on the left view image and the right view image to obtain a second image corresponding to the left view image and a fourth image corresponding to the right view image.

[0098] After inputting the left and right view images into the semantic segmentation model to be trained, the second image transformation unit can be invoked to perform a second transformation process on the left and right view images, resulting in a second image corresponding to the left view image and a fourth image corresponding to the right view image. That is, the second and fourth images are images obtained by performing the same image transformation process on the left and right view images.

[0099] In this example, the first transformation process and the second transformation process are different. For example, if the first transformation process is a 90° rotation, the second transformation process can be a 180° rotation. Or, if the first transformation process is a 90° rotation, the second transformation process can be a color transformation, etc.

[0100] It is understood that the above examples are merely examples listed to better understand the technical solutions of the embodiments of this application, and are not intended to be the only limitation on the embodiments.

[0101] After calling the image transformation layer to perform image transformation processing on the left view image and the right view image respectively, and obtaining the first image and the second image corresponding to the left view image, and the third image and the fourth image corresponding to the right view image, step 103 is executed.

[0102] Step 103: Call the segmentation map acquisition layer to process the feature maps corresponding to the first image, the second image, the third image, and the fourth image respectively, to obtain the first segmentation map of the first image, the second segmentation map of the second image, the third segmentation map of the third image, and the fourth segmentation map of the fourth image.

[0103] After obtaining the first and second images corresponding to the left view image, and the third and fourth images corresponding to the right view image, the segmentation map can be called to process the feature maps of these four transformed images respectively, so as to obtain the first segmentation map of the first image, the second segmentation map of the second image, the third segmentation map of the third image, and the fourth segmentation map of the fourth image.

[0104] Understandably, the semantic segmentation model to be trained provided in this embodiment may further include a feature extraction layer, which is located between the image transformation layer and the segmentation map acquisition layer. After obtaining the first and second images corresponding to the left view image, and the third and fourth images corresponding to the right view image, the first, second, third, and fourth images can be used as inputs to the feature extraction layer. The feature extraction layer can be invoked to perform image feature extraction processing on the first, second, third, and fourth images respectively, and obtain the first feature map of the first image, the second feature map of the second image, the third feature map of the third image, and the fourth feature map of the fourth image. In this example, the feature extraction layer can be a CNN (Convolutional Neural Network layer), which can extract the image features corresponding to the four transformed images respectively and output the feature map corresponding to each image. Figure 6 As shown, after obtaining View1, View2, View3, and View4, the same feature extractor can be used to extract features from View1, View2, View3, and View4 respectively to obtain the features of the four views.

[0105] After obtaining the feature maps corresponding to the four transformed images, the segmentation map acquisition layer can be called to cluster the image pixel features of the four feature maps to obtain the category centers of the pixel features. Then, the segmentation map corresponding to each feature map is output based on the category centers. For example... Figure 6 As shown, after extracting the features of the four views to obtain the feature maps corresponding to the four views respectively, pixel clustering processing can be performed on the four feature maps to obtain four segmentation maps.

[0106] The process of obtaining the above segmentation map can be combined with... Figure 3 The following is a detailed description.

[0107] Reference Figure 3 The flowchart illustrates the steps of a segmentation map acquisition method provided in an embodiment of this application, as follows: Figure 3 As shown, the segmentation map acquisition method may include steps 301, 302, 303 and 304.

[0108] Step 301: Call the segmentation map acquisition layer to perform clustering processing on the image pixels in the first feature map, obtain the pixel cluster center corresponding to the first feature map, and output the first segmentation map of the first feature map according to the pixel cluster center.

[0109] In this embodiment, after obtaining the first feature map of the first image, the segmentation map acquisition layer can be invoked to perform clustering processing on the image pixels within the first feature map to obtain the pixel cluster centers corresponding to the first feature map. Specifically, the segmentation map acquisition layer can be invoked to use the KNN (k-Nearest Neighbor) clustering algorithm to perform clustering processing on the pixel features within the first feature map to obtain the pixel cluster centers within the first feature map.

[0110] Of course, this is not the only option. In practical applications, other clustering algorithms can also be used to cluster the pixel features within the feature map. Specifically, it can be determined according to business needs. This embodiment does not limit the clustering method for pixel features.

[0111] After obtaining the pixel cluster centers corresponding to the first feature map, the first segmentation map of the first feature map can be output based on the pixel cluster centers. The first segmentation map contains multiple pixel cluster centers and the category corresponding to each pixel cluster center, such as trees, vehicles, pedestrians, etc.

[0112] Step 302: Call the segmentation map acquisition layer to perform clustering processing on the image pixels in the second feature map, obtain the pixel cluster center corresponding to the second feature map, and output the second segmentation map of the second feature map according to the pixel cluster center.

[0113] Step 303: Call the segmentation map acquisition layer to perform clustering processing on the image pixels in the third feature map to obtain the pixel clustering center corresponding to the third feature map, and output the third segmentation map of the third feature map according to the pixel clustering center.

[0114] Step 304: Call the segmentation map acquisition layer to perform clustering processing on the image pixels in the fourth feature map to obtain the pixel clustering center corresponding to the fourth feature map, and output the fourth segmentation map of the fourth feature map according to the pixel clustering center.

[0115] Understandably, the methods for obtaining the second, third, and fourth segmentation images are similar to those for obtaining the first segmentation image. Therefore, this embodiment will not elaborate on the acquisition process of the second, third, and fourth segmentation images.

[0116] After obtaining the first feature map, the second feature map, the third feature map, and the fourth feature map, proceed to step 104.

[0117] Step 104: Call the loss function layer to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map and the image disparity.

[0118] Image parallax refers to the disparity between the left and right views of an image. Parallax is the directional difference that occurs when viewing the same object from two points at a certain distance.

[0119] In the specific implementation, the left and right view images are obtained by taking pictures with a binocular camera on the target vehicle. When calculating the image disparity, the image disparity between the left and right view images can be calculated based on the camera intrinsic parameters of the binocular camera.

[0120] After obtaining the first, second, third, and fourth feature maps, the loss function layer can be called to calculate the loss value of the semantic segmentation model to be trained based on the first, second, third, and fourth segment maps and image disparity.

[0121] In this example, the loss values ​​for the semantic segmentation model to be trained can include: transform loss and disparity loss. The transform loss can be calculated using two segmentation images corresponding to the left view image or two segmentation images corresponding to the right view image. The disparity loss can be calculated by combining one segmentation image from the left view image with one segmentation image from the right view image and the image disparity. Then, the loss value of the semantic segmentation model to be trained can be determined based on the transform loss and disparity loss values. This process can be combined with... Figure 4 and Figure 5 The following is a detailed description.

[0122] Reference Figure 4 The flowchart illustrates the steps of a loss value calculation method provided in an embodiment of this application, as shown below. Figure 4 As shown, the loss value calculation method may include steps 401, 402 and 403.

[0123] Step 401: Call the transformation loss function unit to calculate the first transformation loss value of the semantic segmentation model to be trained based on the first segmentation map and the second segmentation map.

[0124] In this embodiment, the loss function layer may include a transform loss function unit and a disparity loss function unit. The transform loss function unit is used to calculate the transform loss value, and the disparity loss function unit is used to calculate the disparity loss value.

[0125] After obtaining the first, second, third, and fourth segmentation maps, the transformation loss function unit can be called to calculate the first transformation loss value of the semantic segmentation model to be trained based on the first and second segmentation maps.

[0126] In this example, the formula for calculating the first transformation loss value is as follows:

[0127] The segmented image of View1 is a segmented image that is restored from the segmented image of View2 and then transformed in the same way as View1 (i.e., the case with the minimum transformation loss).

[0128] In this diagram, View1 and View2 are images obtained after transforming the left view image. After obtaining the first segmentation image (the segmentation image of View1) and the second segmentation image (the segmentation image of View2) corresponding to the two transformed images of the left view image, the same transformation process as that of View1 can be applied to the segmentation image of View2. Then, based on the difference between the transformed View2 and View1, the first transformation loss value is calculated. For example, if the first image is obtained by rotating it 90° and the second image is obtained by rotating it 180°, the second segmentation image can be restored by rotating it 180° in the opposite direction. Then, the restored segmentation image can be transformed using the same process as the first image (i.e., rotated 90° in the same direction as the first image) to obtain the transformed segmentation image. Finally, based on the difference between the transformed segmentation image and the first segmentation image, the first transformation loss value is calculated.

[0129] Alternatively, the segmentation image of View1 can be restored, and then the same transformation process as that of View2 can be applied to obtain the segmentation image. Then, the first transformation loss value can be calculated based on the difference between the processed segmentation image and the segmentation image of View2.

[0130] Step 402: Call the disparity loss function unit to calculate the first disparity loss value of the semantic segmentation model to be trained based on the first segmentation map, the third segmentation map and the image disparity.

[0131] After obtaining the first, second, third, and fourth segmentation maps, the disparity loss function unit can be invoked to calculate the first disparity loss value of the semantic segmentation model to be trained based on the first and third segmentation maps and the image disparity. The first image corresponding to the first segmentation map and the third image corresponding to the third segmentation map employ the same transformation processing method.

[0132] In this example, the formula for calculating the first disparity loss value is as follows:

[0133] First segmentation image = Third segmentation image + Image disparity (i.e., the case with the minimum disparity loss).

[0134] In this embodiment, after obtaining the first segmentation image and the third segmentation image, the disparity between the first segmentation image and the third segmentation image can be compared, and then the first disparity loss value can be calculated based on the difference between the disparity obtained from the comparison and the image disparity.

[0135] Step 403: Calculate the loss value of the semantic segmentation model to be trained based on the first transform loss value and the first disparity loss value.

[0136] After calculating the first transform loss and the first disparity loss, the loss value of the semantic segmentation model to be trained can be calculated based on these values. Specifically, the sum of the first transform loss and the first disparity loss can be calculated and used as the loss value of the semantic segmentation model to be trained. Alternatively, the first transform loss and the first disparity loss can be weighted and summed, and the resulting weighted sum can be used as the loss value of the semantic segmentation model to be trained.

[0137] In this embodiment, the training of a self-supervised semantic segmentation model is achieved by combining image features and the image parallax of the left and right views. Model training can be completed without manual annotation of sample images, which can reduce sample annotation costs and improve model training efficiency.

[0138] Reference Figure 5 The flowchart illustrates the steps of another loss value calculation method provided in an embodiment of this application, as follows: Figure 5 As shown, the loss value calculation method may include steps 501, 502 and 503.

[0139] Step 501: Call the transformation loss function unit to calculate the second transformation loss value of the semantic segmentation model to be trained based on the third segmentation map and the fourth segmentation map.

[0140] In this embodiment, the loss function layer may include a transform loss function unit and a disparity loss function unit. The transform loss function unit is used to calculate the transform loss value, and the disparity loss function unit is used to calculate the disparity loss value.

[0141] After obtaining the first, second, third, and fourth segmentation maps, the transformation loss function unit can be called to calculate the second transformation loss value of the semantic segmentation model to be trained based on the third and fourth segmentation maps.

[0142] In this example, the formula for calculating the second transformation loss value is as follows:

[0143] The segmented image of View3 is a segmented image that is restored from the segmented image of View4 and then transformed in the same way as View1 (i.e., the case with the least transformation loss).

[0144] In this context, View3 and View4 are images obtained after transforming the left view image. After obtaining the third segmentation image (the segmentation image of View3) and the fourth segmentation image (the segmentation image of View4) corresponding to the two transformed images of the left view image, the same transformation process as for View3 can be applied to the segmentation image of View4. Then, based on the difference between the transformed View4 and View3, the second transformation loss value is calculated. For example, if the third image is obtained by rotating it 90° and the fourth image by rotating it 180°, the fourth segmentation image can be restored by rotating it 180° in the opposite direction. Then, the restored segmentation image can be transformed using the same process as the third image (i.e., rotated 90° in the same direction as the third image) to obtain the transformed segmentation image. Finally, based on the difference between the transformed segmentation image and the third segmentation image, the second transformation loss value is calculated.

[0145] Alternatively, the segmentation image of View3 can be restored, and then the same transformation process as that of View4 can be applied to obtain the segmentation image. Then, the second transformation loss value can be calculated based on the difference between the processed segmentation image and the segmentation image of View4.

[0146] Step 502: Call the disparity loss function unit to calculate the second disparity loss value of the semantic segmentation model to be trained based on the second segmentation map, the fourth segmentation map and the image disparity.

[0147] After obtaining the first, second, third, and fourth segmentation maps, the disparity loss function unit can be invoked to calculate the second disparity loss value of the semantic segmentation model to be trained based on the second and fourth segmentation maps and the image disparity. The second image corresponding to the second segmentation map and the fourth image corresponding to the fourth segmentation map employ the same transformation processing method.

[0148] In this example, the formula for calculating the second disparity loss value is as follows:

[0149] The second segmentation image = the fourth segmentation image + image disparity (i.e., the case where disparity loss is minimized).

[0150] In this embodiment, after obtaining the second segmentation map and the fourth segmentation map, the disparity between the second segmentation map and the fourth segmentation map can be compared, and then the second disparity loss value can be calculated based on the difference between the disparity obtained from the comparison and the image disparity.

[0151] Step 503: Calculate the loss value of the semantic segmentation model to be trained based on the second transform loss value and the second disparity loss value.

[0152] After calculating the second transform loss and the second disparity loss, the loss value of the semantic segmentation model to be trained can be calculated based on these values. Specifically, the sum of the second transform loss and the second disparity loss can be calculated and used as the loss value of the semantic segmentation model to be trained. Alternatively, a weighted sum of the second transform loss and the second disparity loss can be performed and used as the loss value of the semantic segmentation model to be trained.

[0153] After calculating the loss value of the semantic segmentation model to be trained, proceed to step 105.

[0154] Step 105: If the loss value is within a preset range, the trained semantic segmentation model is used as the final target semantic segmentation model.

[0155] After calculating the loss value of the semantic segmentation model to be trained, it can be determined whether the loss value is within the preset range.

[0156] If the loss value is not within the preset range, it means that the model has not converged. In this case, more model training samples (i.e., pairs of left and right view images) can be combined to continue training the semantic segmentation model to be trained until the model converges, that is, the loss value is within the preset range.

[0157] If the loss value is within the preset range, it means that the model has converged. At this time, the trained semantic segmentation model can be used as the final target semantic segmentation model, which can then be applied to the semantic segmentation scenario of the side view image of the autonomous vehicle.

[0158] The model training method provided in this application involves inputting a left-view image and a right-view image into a semantic segmentation model to be trained. The semantic segmentation model includes an image transformation layer, a segmentation map acquisition layer, and a loss function layer. The image transformation layer performs image transformation processing on the left-view and right-view images respectively, obtaining a first image and a second image corresponding to the left-view image, and a third image and a fourth image corresponding to the right-view image. The segmentation map acquisition layer processes the feature maps corresponding to the first, second, third, and fourth images respectively, obtaining a first segmentation map of the first image, a second segmentation map of the second image, a third segmentation map of the third image, and a fourth segmentation map of the fourth image. The loss function layer calculates the loss value of the semantic segmentation model to be trained based on the first, second, third, and fourth segmentation maps and image disparity. If the loss value is within a preset range, the trained semantic segmentation model is used as the final target semantic segmentation model. This application embodiment achieves self-supervised semantic segmentation model training by combining image features and image disparity between the left and right views, eliminating the need for manual annotation of sample images, reducing sample annotation costs, and thus improving model training efficiency.

[0159] Reference Figure 7 The diagram shows a schematic representation of a model training device provided in an embodiment of this application. Figure 7 As shown, the model training device 700 may include the following modules:

[0160] The image input module 710 is used to input the left view image and the right view image into the semantic segmentation model to be trained; the semantic segmentation model to be trained includes: an image transformation layer, a segmentation map acquisition layer and a loss function layer;

[0161] The image transformation module 720 is used to call the image transformation layer to perform image transformation processing on the left view image and the right view image respectively, to obtain the first image and the second image corresponding to the left view image, and the third image and the fourth image corresponding to the right view image;

[0162] The segmentation map acquisition module 730 is used to call the segmentation map acquisition layer to process the feature maps corresponding to the first image, the second image, the third image and the fourth image respectively, to obtain the first segmentation map of the first image, the second segmentation map of the second image, the third segmentation map of the third image and the fourth segmentation map of the fourth image;

[0163] The loss value calculation module 740 is used to call the loss function layer to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map and the image disparity;

[0164] The semantic segmentation model acquisition module 750 is used to take the trained semantic segmentation model as the final target semantic segmentation model when the loss value is within a preset range.

[0165] Optionally, the image transformation layer includes: a first image transformation unit and a second image transformation unit.

[0166] The image transformation module includes:

[0167] The first image transformation unit is used to call the first image transformation unit to perform a first transformation process on the left view image and the right view image to obtain a first image corresponding to the left view image and a third image corresponding to the right view image;

[0168] The second image transformation unit is used to call the second image transformation unit to perform a second transformation process on the left view image and the right view image to obtain a second image corresponding to the left view image and a fourth image corresponding to the right view image.

[0169] Optionally, the semantic segmentation model to be trained further includes a feature extraction layer, which is located between the image transformation layer and the segmentation map acquisition layer.

[0170] The device further includes:

[0171] The feature map acquisition module is used to call the feature extraction layer to perform image feature extraction processing on the first image, the second image, the third image and the fourth image respectively, so as to obtain the first feature map of the first image, the second feature map of the second image, the third feature map of the third image and the fourth feature map of the fourth image.

[0172] Optionally, the segmentation map acquisition module includes:

[0173] The first segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the first feature map, obtain the pixel clustering center corresponding to the first feature map, and output the first segmentation map of the first feature map according to the pixel clustering center.

[0174] The second segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the second feature map, obtain the pixel clustering center corresponding to the second feature map, and output the second segmentation map of the second feature map according to the pixel clustering center;

[0175] The third segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the third feature map, obtain the pixel clustering center corresponding to the third feature map, and output the third segmentation map of the third feature map according to the pixel clustering center;

[0176] The fourth segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the fourth feature map, obtain the pixel clustering center corresponding to the fourth feature map, and output the fourth segmentation map of the fourth feature map according to the pixel clustering center.

[0177] Optionally, the loss function layer includes: a transformation loss function unit and a disparity loss function unit.

[0178] The loss value calculation module includes:

[0179] The first transformation loss calculation unit is used to call the transformation loss function unit to calculate the first transformation loss value of the semantic segmentation model to be trained based on the first segmentation map and the second segmentation map;

[0180] The first disparity loss calculation unit is used to call the disparity loss function unit to calculate the first disparity loss value of the semantic segmentation model to be trained based on the first segmentation map, the third segmentation map and the image disparity;

[0181] The first loss value calculation unit is used to calculate the loss value of the semantic segmentation model to be trained based on the first transform loss value and the first disparity loss value.

[0182] Optionally, the loss function layer includes: a transformation loss function unit and a disparity loss function unit.

[0183] The loss value calculation module includes:

[0184] The second transformation loss calculation unit is used to call the transformation loss function unit to calculate the second transformation loss value of the semantic segmentation model to be trained based on the third segmentation map and the fourth segmentation map;

[0185] The second disparity loss calculation unit is used to call the disparity loss function unit to calculate the second disparity loss value of the semantic segmentation model to be trained based on the second segmentation map, the fourth segmentation map and the image disparity;

[0186] The second loss value calculation unit is used to calculate the loss value of the semantic segmentation model to be trained based on the second transform loss value and the second disparity loss value.

[0187] The model training apparatus provided in this application provides a semantic segmentation model to be trained by inputting a left-view image and a right-view image into the model to be trained. The semantic segmentation model to be trained includes an image transformation layer, a segmentation map acquisition layer, and a loss function layer. The image transformation layer performs image transformation processing on the left-view image and the right-view image respectively, obtaining a first image and a second image corresponding to the left-view image, and a third image and a fourth image corresponding to the right-view image. The segmentation map acquisition layer processes the feature maps corresponding to the first image, the second image, the third image, and the fourth image respectively, obtaining a first segmentation map of the first image, a second segmentation map of the second image, a third segmentation map of the third image, and a fourth segmentation map of the fourth image. The loss function layer calculates the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map, and the image disparity. If the loss value is within a preset range, the trained semantic segmentation model to be trained is used as the final target semantic segmentation model. This application embodiment achieves self-supervised training of the semantic segmentation model by combining image features and the image disparity of the left and right views, eliminating the need for manual annotation of sample images, reducing sample annotation costs, and thus improving the training efficiency of the model.

[0188] This application provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the above-described model training method.

[0189] Figure 8 A schematic diagram of the structure of an electronic device 800 according to an embodiment of the present invention is shown. Figure 8 As shown, the electronic device 800 includes a central processing unit (CPU) 801, which can perform various appropriate actions and processes according to computer program instructions stored in read-only memory (ROM) 802 or loaded from storage unit 808 into random access memory (RAM) 803. The RAM 803 can also store various programs and data required for the operation of the electronic device 800. The CPU 801, ROM 802, and RAM 803 are interconnected via bus 804. An input / output (I / O) interface 805 is also connected to bus 804.

[0190] Multiple components in electronic device 800 are connected to I / O interface 805, including: input unit 806, such as keyboard, mouse, microphone, etc.; output unit 807, such as various types of displays, speakers, etc.; storage unit 808, such as disk, optical disk, etc.; and communication unit 809, such as network card, modem, wireless transceiver, etc. Communication unit 809 allows electronic device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0191] The various processes and handling described above can be executed by processing unit 801. For example, the methods of any of the above embodiments can be implemented as computer software programs tangibly contained in a computer-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and / or installed on electronic device 800 via ROM 802 and / or communication unit 809. When the computer program is loaded into RAM 803 and executed by CPU 801, one or more actions of the methods described above can be performed.

[0192] This application provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it implements the various processes of the above-described model training method embodiments and achieves the same technical effects. To avoid repetition, it will not be described again here. The computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, etc.

[0193] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0194] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0195] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

[0196] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed in this application can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0197] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0198] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or groups may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0199] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0200] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0201] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, ROM, RAM, magnetic disks, or optical disks.

[0202] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A model training method, characterized in that, include: Input the left and right view images into the semantic segmentation model to be trained; The semantic segmentation model to be trained includes: an image transformation layer, a segmentation map acquisition layer, and a loss function layer; The image transformation layer is invoked to perform image transformation processing on the left view image and the right view image respectively, to obtain a first image and a second image corresponding to the left view image, and a third image and a fourth image corresponding to the right view image; wherein, the first image and the third image adopt the same image transformation processing method, the second image and the fourth image adopt the same image transformation processing method, or the first image and the fourth image adopt the same image transformation processing method, and the second image and the third image adopt the same image transformation processing method; The segmentation map acquisition layer is invoked to perform clustering processing on the image pixel features of the feature maps corresponding to the first image, the second image, the third image, and the fourth image, respectively, to obtain the pixel cluster centers corresponding to the feature maps. Based on the pixel cluster centers, the segmentation map corresponding to each feature map is output to obtain the first segmentation map of the first image, the second segmentation map of the second image, the third segmentation map of the third image, and the fourth segmentation map of the fourth image. The loss function layer is invoked to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map, and the image disparity. The calculation of the loss value of the semantic segmentation model to be trained includes: calculating a transformation loss value using two segmentation maps corresponding to the left view image or two segmentation maps of the right view image; calculating a disparity loss value by combining one segmentation map of the left view image and one segmentation map of the right view image with the image disparity; and calculating the loss value of the semantic segmentation model to be trained based on the transformation loss value and the disparity loss value. If the loss value is within a preset range, the trained semantic segmentation model will be used as the final target semantic segmentation model.

2. The method according to claim 1, characterized in that, The image transformation layer includes: a first image transformation unit and a second image transformation unit. The step of calling the image transformation layer to perform image transformation processing on the left view image and the right view image respectively, to obtain the first image and the second image corresponding to the left view image, and the third image and the fourth image corresponding to the right view image, includes: The first image transformation unit is invoked to perform a first transformation process on the left view image and the right view image to obtain a first image corresponding to the left view image and a third image corresponding to the right view image; The second image transformation unit is invoked to perform a second transformation process on the left view image and the right view image to obtain a second image corresponding to the left view image and a fourth image corresponding to the right view image.

3. The method according to claim 1, characterized in that, The semantic segmentation model to be trained further includes a feature extraction layer, which is located between the image transformation layer and the segmentation map acquisition layer. After calling the image transformation layer to perform image transformation processing on the left view image and the right view image respectively, to obtain the first image and the second image corresponding to the left view image, and the third image and the fourth image corresponding to the right view image, the method further includes: The feature extraction layer is invoked to perform image feature extraction processing on the first image, the second image, the third image, and the fourth image respectively, so as to obtain the first feature map of the first image, the second feature map of the second image, the third feature map of the third image, and the fourth feature map of the fourth image.

4. The method according to claim 3, characterized in that, The step of calling the segmentation map acquisition layer to process the feature maps corresponding to the first image, the second image, the third image, and the fourth image respectively, to obtain a first segmentation map of the first image, a second segmentation map of the second image, a third segmentation map of the third image, and a fourth segmentation map of the fourth image, includes: The segmentation map acquisition layer is invoked to perform clustering processing on the image pixels within the first feature map to obtain the pixel clustering center corresponding to the first feature map, and the first segmentation map of the first feature map is output based on the pixel clustering center; The segmentation map acquisition layer is invoked to perform clustering processing on the image pixels within the second feature map to obtain the pixel clustering center corresponding to the second feature map, and the second segmentation map of the second feature map is output based on the pixel clustering center; The segmentation map acquisition layer is invoked to perform clustering processing on the image pixels within the third feature map to obtain the pixel clustering center corresponding to the third feature map, and the third segmentation map of the third feature map is output based on the pixel clustering center; The segmentation map acquisition layer is invoked to perform clustering processing on the image pixels within the fourth feature map to obtain the pixel cluster centers corresponding to the fourth feature map, and the fourth segmentation map of the fourth feature map is output based on the pixel cluster centers.

5. The method according to claim 2, characterized in that, The loss function layer includes: a transform loss function unit and a disparity loss function unit. The step of calling the loss function layer to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map, and image disparity includes: The transformation loss function unit is invoked to calculate the first transformation loss value of the semantic segmentation model to be trained based on the first segmentation map and the second segmentation map; The disparity loss function unit is invoked to calculate the first disparity loss value of the semantic segmentation model to be trained based on the first segmentation map, the third segmentation map, and the image disparity. The loss value of the semantic segmentation model to be trained is calculated based on the first transformation loss value and the first disparity loss value.

6. The method according to claim 2, characterized in that, The loss function layer includes: a transform loss function unit and a disparity loss function unit. The step of calling the loss function layer to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map, and image disparity includes: The transformation loss function unit is invoked to calculate the second transformation loss value of the semantic segmentation model to be trained based on the third segmentation map and the fourth segmentation map; The disparity loss function unit is invoked to calculate the second disparity loss value of the semantic segmentation model to be trained based on the second segmentation map, the fourth segmentation map, and the image disparity; The loss value of the semantic segmentation model to be trained is calculated based on the second transform loss value and the second disparity loss value.

7. A model training device, characterized in that, include: The image input module is used to input the left and right view images into the semantic segmentation model to be trained; The semantic segmentation model to be trained includes: an image transformation layer, a segmentation map acquisition layer, and a loss function layer; The image transformation module is used to call the image transformation layer to perform image transformation processing on the left view image and the right view image respectively, to obtain a first image and a second image corresponding to the left view image, and a third image and a fourth image corresponding to the right view image; wherein, the first image and the third image adopt the same image transformation processing method, the second image and the fourth image adopt the same image transformation processing method, or the first image and the fourth image adopt the same image transformation processing method, and the second image and the third image adopt the same image transformation processing method; The segmentation map acquisition module is used to call the segmentation map acquisition layer to perform clustering processing on the image pixel features of the feature maps corresponding to the first image, the second image, the third image and the fourth image respectively, to obtain the pixel clustering center corresponding to the feature map, and output the segmentation map corresponding to each feature map according to the pixel clustering center, to obtain the first segmentation map of the first image, the second segmentation map of the second image, the third segmentation map of the third image and the fourth segmentation map of the fourth image; The loss value calculation module is used to call the loss function layer to calculate the loss value of the semantic segmentation model to be trained based on the first segmentation map, the second segmentation map, the third segmentation map, the fourth segmentation map, and the image disparity; wherein, calculating the loss value of the semantic segmentation model to be trained includes: calculating a transformation loss value using two segmentation maps corresponding to the left view image or two segmentation maps corresponding to the right view image; calculating a disparity loss value by combining one segmentation map corresponding to the left view image and one segmentation map corresponding to the right view image with the image disparity; and calculating the loss value of the semantic segmentation model to be trained based on the transformation loss value and the disparity loss value. The semantic segmentation model acquisition module is used to use the trained semantic segmentation model as the final target semantic segmentation model when the loss value is within a preset range.

8. The apparatus according to claim 7, characterized in that, The image transformation layer includes: a first image transformation unit and a second image transformation unit. The image transformation module includes: The first image transformation unit is used to call the first image transformation unit to perform a first transformation process on the left view image and the right view image to obtain a first image corresponding to the left view image and a third image corresponding to the right view image; The second image transformation unit is used to call the second image transformation unit to perform a second transformation process on the left view image and the right view image to obtain a second image corresponding to the left view image and a fourth image corresponding to the right view image.

9. The apparatus according to claim 7, characterized in that, The semantic segmentation model to be trained further includes a feature extraction layer, which is located between the image transformation layer and the segmentation map acquisition layer. The device further includes: The feature map acquisition module is used to call the feature extraction layer to perform image feature extraction processing on the first image, the second image, the third image and the fourth image respectively, so as to obtain the first feature map of the first image, the second feature map of the second image, the third feature map of the third image and the fourth feature map of the fourth image.

10. The apparatus according to claim 9, characterized in that, The segmentation map acquisition module includes: The first segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the first feature map, obtain the pixel clustering center corresponding to the first feature map, and output the first segmentation map of the first feature map according to the pixel clustering center. The second segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the second feature map, obtain the pixel clustering center corresponding to the second feature map, and output the second segmentation map of the second feature map according to the pixel clustering center; The third segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the third feature map, obtain the pixel clustering center corresponding to the third feature map, and output the third segmentation map of the third feature map according to the pixel clustering center; The fourth segmentation map output unit is used to call the segmentation map acquisition layer to perform clustering processing on the image pixels in the fourth feature map, obtain the pixel clustering center corresponding to the fourth feature map, and output the fourth segmentation map of the fourth feature map according to the pixel clustering center.

11. The apparatus according to claim 8, characterized in that, The loss function layer includes: a transform loss function unit and a disparity loss function unit. The loss value calculation module includes: The first transformation loss calculation unit is used to call the transformation loss function unit to calculate the first transformation loss value of the semantic segmentation model to be trained based on the first segmentation map and the second segmentation map; The first disparity loss calculation unit is used to call the disparity loss function unit to calculate the first disparity loss value of the semantic segmentation model to be trained based on the first segmentation map, the third segmentation map and the image disparity; The first loss value calculation unit is used to calculate the loss value of the semantic segmentation model to be trained based on the first transform loss value and the first disparity loss value.

12. The apparatus according to claim 8, characterized in that, The loss function layer includes: a transform loss function unit and a disparity loss function unit. The loss value calculation module includes: The second transformation loss calculation unit is used to call the transformation loss function unit to calculate the second transformation loss value of the semantic segmentation model to be trained based on the third segmentation map and the fourth segmentation map; The second disparity loss calculation unit is used to call the disparity loss function unit to calculate the second disparity loss value of the semantic segmentation model to be trained based on the second segmentation map, the fourth segmentation map and the image disparity; The second loss value calculation unit is used to calculate the loss value of the semantic segmentation model to be trained based on the second transform loss value and the second disparity loss value.

13. An electronic device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the model training method of any one of claims 1 to 6.

14. A readable storage medium, characterized in that, When the instructions in the storage medium are executed by the processor of the electronic device, the electronic device is able to perform the model training method according to any one of claims 1 to 6.