A model training method and device, equipment and storage medium

By training images with and without occlusions in different scenarios, and generating loss sets using image processing and discriminant networks, the problem of rain and snow effects in intelligent driving is solved, achieving image processing accuracy and automatic occlusion removal under adverse weather conditions.

CN117197610BActive Publication Date: 2026-06-26CHONGQING CHANGAN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING CHANGAN TECH CO LTD
Filing Date
2023-08-25
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In adverse weather conditions, images perceived by cameras may contain rain, snow, and other factors, which can reduce the accuracy of intelligent driving technology. Existing technologies are unable to effectively remove rain and snow from images and require training with images of the same scene, both with and without rain and snow, which is difficult to obtain.

Method used

By acquiring images with and without occlusions in different scenarios, an image processing network and a discriminant network are used to train a model to generate images with and without occlusions. The image processing model, including a feature extraction sub-network, an occlusion addition sub-network, and an occlusion removal sub-network, is trained using a loss set to generate a target noise matrix for filtering and normalization, thereby improving the quality of feature extraction.

Benefits of technology

A model that can effectively remove occlusions from images can be trained without using images of the same scene with and without occlusions, improving the model training effect and the accuracy of image processing, and can automatically identify and remove occlusions in images.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117197610B_ABST
    Figure CN117197610B_ABST
Patent Text Reader

Abstract

The application relates to a model training method and device, equipment and a storage medium, and relates to the technical field of deep learning. The method comprises the following steps: a training image can be acquired; an occlusion image and a non-occlusion image are images in different scenes; then, a first processing image and a second processing image can be determined based on an image processing network; subsequently, the first processing image and the non-occlusion image are input into a first image discrimination network to obtain a first loss, and the second processing image and the occlusion image are input into a second image discrimination network to obtain a second loss; thus, a to-be-trained model can be trained based on a loss set to obtain an image processing model; the image processing model is used for removing occlusions in a to-be-processed image. Therefore, the image processing model for removing occlusions can be trained through the occlusion image and the non-occlusion image in different scenes, so that the model for removing occlusions is more convenient to train.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of deep learning technology, and more particularly to the field of natural image restoration technology, specifically to a model training method, apparatus, device, and storage medium. Background Technology

[0002] In recent years, intelligent driving technology for automobiles has developed rapidly, and perception technology is a crucial component of this technology. Cars can use cameras to perceive information about their external environment, thereby providing information for vehicle control.

[0003] However, in severe weather such as rain and snow, the accuracy of the car's perception is greatly affected because the images perceived by the camera will contain rain and snow, thus making it impossible to apply intelligent driving technology in such environments.

[0004] Currently, models are typically trained using both rain- and snow-free images of the same region, so that the trained model can remove rain and snow from images. However, obtaining both rain- and snow-free images of the same region is difficult, making it challenging to remove rain and snow from images using model-based methods. Summary of the Invention

[0005] This application provides a model training method, apparatus, device, and storage medium to at least solve the technical problem of how to train a model to remove rain and snow from images in related technologies. The technical solution of this application is as follows:

[0006] According to a first aspect of this application, a model training method is provided, comprising: acquiring training images; the training images including: occluded images and unoccluded images; the occluded images and unoccluded images being images from different scenes; determining a first processing image and a second processing image based on an image processing network; the first processing image being an image after removing occluders from the occluded image; the second processing image being an image after adding occluders to the unoccluded image; inputting the first processing image and the unoccluded image into a first image discrimination network to obtain a first loss, and inputting the second processing image and the occluded image into a second image discrimination network to obtain a second loss; training a model to be trained based on a loss set to obtain an image processing model; the loss set including the first loss and the second loss; the model to be trained including an image processing network, a first image discrimination network, and a second image discrimination network; the image processing model being used to remove occluders from the image to be processed.

[0007] Based on the aforementioned technical means, this application can input occluded and unoccluded images from different scenes into an image processing network to obtain a first processed image and a second processed image. A first image discrimination network then distinguishes between the first processed image after removing occlusions and the unoccluded image to obtain a first loss. A second image discrimination network then distinguishes between the second processed image after adding occlusions and the occluded image to obtain a second loss. Subsequently, a training model can be trained using a loss set including the first and second losses, enabling the corresponding image processing model to remove occlusions from the image to be processed. In this way, an image processing model for removing occlusions from an image can be trained without needing occluded and unoccluded images from the same scene.

[0008] In one possible implementation, the image processing network includes: a feature extraction subnetwork, an occlusion addition subnetwork, and an occlusion removal subnetwork; determining a first processed image and a second processed image based on the image processing network includes: performing feature extraction processing on the occluded image and the unoccluded image through the feature extraction subnetwork to obtain a first occluded image feature corresponding to the occluded image and a second occluded image feature corresponding to the unoccluded image; inputting the first occluded image feature and the occluded image into the occlusion removal subnetwork to obtain the first processed image, and inputting the second occluded image feature and the unoccluded image into the occlusion addition subnetwork to obtain the second processed image.

[0009] Based on the aforementioned technical means, this application can remove occlusions from occluded images using a feature extraction subnetwork and an occlusion removal subnetwork in the image processing network, and can add occlusions to unoccluded images using a feature extraction subnetwork and an occlusion addition subnetwork in the image processing network. In this way, the addition of the occlusion subnetwork can aid in the training of the image processing network, and the trained feature extraction subnetwork and occlusion removal subnetwork can remove occlusions from occluded images to obtain a natural image.

[0010] In one possible implementation, the loss set further includes: a third loss, and / or, a discriminative loss; the discriminative loss includes: a fourth loss and a fifth loss; when the loss set includes the third loss, the model training method further includes: generating a target noise matrix of the same size as the training image through a feature extraction subnetwork; determining a filtering kernel corresponding to the target noise matrix; filtering the target noise matrix through the filtering kernel corresponding to the target noise matrix to obtain a filtering result, and normalizing the filtering result to obtain a third occlusion image feature; determining the loss between the first occlusion image feature and the third occlusion image feature, and the loss between the second occlusion image feature and the third occlusion image feature, as the third loss;

[0011] When the loss set includes discriminative loss, the model training method further includes: processing the first processed image through a feature extraction subnetwork and an occlusion addition subnetwork to obtain a third processed image, and processing the second processed image through a feature extraction subnetwork and an occlusion removal subnetwork to obtain a fourth processed image; determining the consistency loss between the third processed image and the image with occlusion as the fourth loss, and determining the consistency loss between the fourth processed image and the image without occlusion as the fifth loss.

[0012] Based on the aforementioned technical means, this application can also generate a target noise matrix through a feature extraction sub-network, and filter and normalize it using the corresponding filtering kernel to obtain the third occlusion image features. In this way, the electronic device can determine the third occlusion image features as relatively accurate features. Subsequently, the electronic device determines the loss between the first and third occlusion image features extracted by the feature extraction sub-network, as well as the loss between the second and third occlusion image features, as a third loss, so that the quality of feature extraction by the feature extraction sub-network can be determined through the third loss.

[0013] Furthermore, this application can add occlusions to the first processed image using a feature extraction subnetwork and an occlusion addition subnetwork in the image processing network, and remove occlusions from the second processed image using a feature extraction subnetwork and an occlusion removal subnetwork in the image processing network. Since the first processed image is the image after removing occlusions, and the second processed image is the image after adding occlusions, both the occlusion addition subnetwork and the occlusion removal subnetwork can be trained on images with and without occlusions in different scenarios, resulting in better model training performance. Moreover, since the fourth loss is the consistency loss between the third processed image and the image with occlusions, and the fifth loss is the consistency loss between the fourth processed image and the image without occlusions, the model to be trained can also be trained using a loss set that includes a discriminative loss.

[0014] In one possible implementation, the model training method further includes: acquiring the original image; when it is determined that the original image contains occlusions, identifying the original image as the image to be processed; and removing the occlusions in the image to be processed through a feature extraction subnetwork and an occlusion removal subnetwork in the image processing model.

[0015] Based on the above technical means, when it is determined that the original image obtained is an image to be processed that includes occlusions, the occlusions in the image to be processed can be removed by the feature extraction subnetwork and the occlusion removal subnetwork in the trained image processing model.

[0016] In one possible implementation, the model training method further includes: acquiring an image to be processed; inputting the image to be processed into a feature extraction subnetwork in an image processing model to obtain a fourth occlusion image feature corresponding to the image to be processed; when the proportion of pixels of the occlusion in the fourth occlusion image feature is greater than a preset proportion, removing the occlusion in the image to be processed through the occlusion removal subnetwork in the image processing model and the fourth occlusion image feature.

[0017] Based on the aforementioned technical means, this application can process the image to be processed using the feature extraction sub-network in the image processing model to obtain the fourth occlusion image features. When the proportion of pixels containing occlusions is greater than a preset proportion, the occlusions in the image to be processed are removed using the occlusion removal sub-network in the image processing model and the fourth occlusion image features. In this way, it is possible to automatically determine whether occlusions exist in the acquired image, thus facilitating the determination of whether to remove them from the acquired image.

[0018] In one possible implementation, the feature extraction subnetwork includes: four convolutional stages, two fast transformers, and four upsampling stages; the fast transformer includes: a 3×3 convolutional layer, three 1×1 convolutional layers, two transformers, a feature fusion layer, and a feature summing layer; the 3×3 convolutional layer, the first 1×1 convolutional layer of the three 1×1 convolutional layers, the two transformers, the second 1×1 convolutional layer of the three 1×1 convolutional layers, the feature fusion layer, the third 1×1 convolutional layer of the three 1×1 convolutional layers, and the feature summing layer are sequentially connected in communication; the feature fusion layer and the first 1×1 convolutional layer are also connected in communication.

[0019] Based on the aforementioned technical means, this application can rapidly construct a full-image attention extraction mechanism using a fast converter, and perform local feature extraction using a 3×3 convolutional layer and a 1×1 convolutional layer. Two converters and one 1×1 convolutional layer can be used for global feature extraction. Then, global and local features can be combined to improve image feature extraction. Thus, applying this to the feature extraction subnetwork in an image processing network can better extract features from the image.

[0020] According to a second aspect of this application, a model training apparatus is provided, comprising an acquisition unit and a processing unit; the acquisition unit is configured to acquire training images; the training images include: occluded images and unoccluded images; the occluded images and the unoccluded images are images from different scenes; the processing unit is configured to determine a first processing image and a second processing image based on an image processing network; the first processing image is an image after removing occluders from the occluded image; the second processing image is an image after adding occluders to the unoccluded image; the processing unit is further configured to input the first processing image and the unoccluded image into a first image discrimination network to obtain a first loss, and input the second processing image and the occluded image into a second image discrimination network to obtain a second loss; the processing unit is further configured to train a model to be trained based on a loss set to obtain an image processing model; the loss set includes the first loss and the second loss; the model to be trained includes the image processing network, the first image discrimination network, and the second image discrimination network; the image processing model is used to remove occluders from the image to be processed.

[0021] In one possible implementation, the image processing network includes: a feature extraction subnetwork, an occlusion addition subnetwork, and an occlusion removal subnetwork; and a processing unit specifically configured to: perform feature extraction processing on the occluded image and the unoccluded image through the feature extraction subnetwork to obtain a first occluded image feature corresponding to the occluded image and a second occluded image feature corresponding to the unoccluded image; input the first occluded image feature and the occluded image into the occlusion removal subnetwork to obtain a first processed image; and input the second occluded image feature and the unoccluded image into the occlusion addition subnetwork to obtain a second processed image.

[0022] In one possible implementation, the loss set further includes: a third loss, and / or, a discriminative loss; the discriminative loss includes: a fourth loss and a fifth loss; when the loss set includes the third loss, the model training method further includes: a processing unit, further configured to generate a target noise matrix of the same size as the training image through a feature extraction subnetwork; a processing unit, further configured to determine a filtering kernel corresponding to the target noise matrix; a processing unit, further configured to filter the target noise matrix through the filtering kernel corresponding to the target noise matrix to obtain a filtering result; a processing unit, further configured to normalize the filtering result to obtain a third occlusion image feature; a processing unit, further configured to consider the loss between the first occlusion image feature and the third occlusion image feature; and a processing unit, further configured to determine the loss between the second occlusion image feature and the third occlusion image feature as the third loss.

[0023] When the loss set includes a discriminative loss, the model training method further includes: a processing unit, which is further configured to process the first processed image through a feature extraction subnetwork and an occlusion addition subnetwork to obtain a third processed image; a processing unit, which is further configured to process the second processed image through a feature extraction subnetwork and an occlusion removal subnetwork to obtain a fourth processed image; a processing unit, which is further configured to determine the consistency loss between the third processed image and the image with occlusion as the fourth loss; and a processing unit, which is further configured to determine the consistency loss between the fourth processed image and the image without occlusion as the fifth loss.

[0024] In one possible implementation, the acquisition unit is further configured to acquire an original image; the processing unit is further configured to determine the original image as an image to be processed when it is determined that the original image contains an occlusion; the processing unit is further configured to remove the occlusion in the image to be processed through a feature extraction subnetwork and an occlusion removal subnetwork in the image processing model.

[0025] In one possible implementation, the acquisition unit is further configured to acquire the image to be processed; the processing unit is further configured to input the image to be processed into the feature extraction sub-network in the image processing model to obtain the fourth occlusion image features corresponding to the image to be processed; the processing unit is further configured to remove the occlusion in the image to be processed by the occlusion removal sub-network in the image processing model and the fourth occlusion image features when the proportion of occlusion pixels in the fourth occlusion image features is greater than a preset proportion.

[0026] In one possible implementation, the feature extraction subnetwork includes: four convolutional stages, two fast transformers, and four upsampling stages; the fast transformer includes: a 3×3 convolutional layer, three 1×1 convolutional layers, two transformers, a feature fusion layer, and a feature summing layer; the 3×3 convolutional layer, the first 1×1 convolutional layer of the three 1×1 convolutional layers, the two transformers, the second 1×1 convolutional layer of the three 1×1 convolutional layers, the feature fusion layer, the third 1×1 convolutional layer of the three 1×1 convolutional layers, and the feature summing layer are sequentially connected in communication; the feature fusion layer and the first 1×1 convolutional layer are also connected in communication.

[0027] According to a third aspect provided in this application, an electronic device is provided, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the method of the first aspect described above and any possible implementation thereof.

[0028] According to a fourth aspect provided in this application, a computer-readable storage medium is provided that, when the instructions in the computer-readable storage medium are executed by a processor of an electronic device, enables the electronic device to perform the methods described in the first aspect and any possible implementation thereof.

[0029] According to the fifth aspect provided in this application, a computer program product is provided, the computer program product including computer instructions, which, when executed on an electronic device, cause the electronic device to perform the method described in the first aspect and any possible implementation thereof.

[0030] Therefore, the above-mentioned technical features of this application have the following beneficial effects:

[0031] (1) By inputting occluded and unoccluded images from different scenes into an image processing network, a first processed image and a second processed image can be obtained. A first image discrimination network can then distinguish between the first processed image after removing occlusions and the unoccluded image to obtain a first loss. A second image discrimination network can then distinguish between the second processed image after adding occlusions and the occluded image to obtain a second loss. Afterward, a model to be trained can be trained using a loss set including the first and second losses, so that the image processing model corresponding to the model to be trained can remove occlusions from the image to be processed. In this way, an image processing model for removing occlusions from an image can be trained without needing occluded and unoccluded images from the same scene.

[0032] (2) Occlusions in images with occlusions can be removed using the feature extraction subnetwork and the occlusion removal subnetwork in the image processing network, and occlusions can be added to images without occlusions using the feature extraction subnetwork and the occlusion addition subnetwork in the image processing network. In this way, the image processing network can be trained by adding the occlusion subnetwork, and the occlusion map in the images with occlusions can be removed using the trained feature extraction subnetwork and the occlusion removal subnetwork to obtain a natural image.

[0033] (3) This application can also generate a target noise matrix through a feature extraction subnetwork, and filter and normalize it through the filter kernel corresponding to the target noise matrix to obtain the third occlusion image features. In this way, the electronic device can determine the third occlusion image features as relatively accurate features. Then, the electronic device determines the loss between the first occlusion image features and the third occlusion image features extracted by the feature extraction subnetwork, as well as the loss between the second occlusion image features and the third occlusion image features, as the third loss, so that the quality of feature extraction by the feature extraction subnetwork can be determined by the third loss.

[0034] Furthermore, this application can add occlusions to the first processed image using a feature extraction subnetwork and an occlusion addition subnetwork in the image processing network, and remove occlusions from the second processed image using a feature extraction subnetwork and an occlusion removal subnetwork in the image processing network. Since the first processed image is the image after removing occlusions, and the second processed image is the image after adding occlusions, both the occlusion addition subnetwork and the occlusion removal subnetwork can be trained on images with and without occlusions in different scenarios, resulting in better model training performance. Moreover, since the fourth loss is the consistency loss between the third processed image and the image with occlusions, and the fifth loss is the consistency loss between the fourth processed image and the image without occlusions, the model to be trained can also be trained using a loss set that includes a discriminative loss.

[0035] (4) When it is determined that the original image obtained is an image to be processed containing occlusions, the occlusions in the image to be processed can be removed by the feature extraction subnetwork and the occlusion removal subnetwork in the trained image processing model.

[0036] (5) The image to be processed can be processed by the feature extraction sub-network in the image processing model to obtain the fourth occlusion image features. When the proportion of pixels of the occlusion is greater than the preset proportion, the occlusion in the image to be processed is removed by the occlusion removal sub-network in the image processing model and the fourth occlusion image features. In this way, it is possible to automatically determine whether there is an occlusion in the acquired image, so as to determine whether to remove the occlusion from the acquired image.

[0037] (6) A full-image attention extraction mechanism can be quickly constructed using a fast transformer, and local feature extraction can be performed using a 3×3 convolutional layer and a 1×1 convolutional layer. Two transformers and one 1×1 convolutional layer can be used for global feature extraction. Then, global and local features can be combined to improve image feature extraction. In this way, applying it to the feature extraction subnetwork in an image processing network can better extract features from the image.

[0038] It should be noted that the technical effects of any of the implementation methods in the second to fifth aspects can be found in the technical effects of the corresponding implementation methods in the first aspect, and will not be repeated here.

[0039] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0040] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application, and do not constitute an undue limitation of this application.

[0041] Figure 1 This is a flowchart of a model training method according to an exemplary embodiment. Figure 1 ;

[0042] Figure 2 This is a schematic diagram illustrating the structure of a first image discrimination network or a second image discrimination network according to an exemplary embodiment;

[0043] Figure 3 This is a flowchart of a model training method according to an exemplary embodiment. Figure 2 ;

[0044] Figure 4 This is a schematic diagram of the structure of a fast converter according to an exemplary embodiment;

[0045] Figure 5 This is a schematic diagram illustrating the structure of a feature extraction subnetwork according to an exemplary embodiment;

[0046] Figure 6 This is a schematic diagram illustrating a structure for removing or adding an occlusion subnetwork according to an exemplary embodiment;

[0047] Figure 7 This is a schematic diagram illustrating the generation of a first processed image via an image processing network according to an exemplary embodiment;

[0048] Figure 8 This is a schematic diagram illustrating the generation of a second processed image via an image processing network according to an exemplary embodiment;

[0049] Figure 9 This is a flowchart of a model training method according to an exemplary embodiment. Figure 3 ;

[0050] Figure 10 This is a flowchart of a model training method according to an exemplary embodiment. Figure 4 ;

[0051] Figure 11 This is a schematic diagram illustrating the structure of a model to be trained according to an exemplary embodiment;

[0052] Figure 12 This is a flowchart of a model training method according to an exemplary embodiment. Figure 5 ;

[0053] Figure 13 This is a flowchart of a model training method according to an exemplary embodiment. Figure 6 ;

[0054] Figure 14 This is a schematic diagram illustrating a process for determining whether an image to be processed has an obstruction, according to an exemplary embodiment.

[0055] Figure 15 This is a block diagram illustrating a model training apparatus according to an exemplary embodiment;

[0056] Figure 16 This is a block diagram illustrating an electronic device according to an exemplary embodiment. Detailed Implementation

[0057] To enable those skilled in the art to better understand the technical solutions of this application, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings.

[0058] It should be noted that the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0059] In recent years, intelligent driving technology for automobiles has developed rapidly, and perception technology is a crucial component of this technology. Cars can use cameras to perceive information about their external environment, thereby providing information for vehicle control.

[0060] However, when encountering severe weather such as rain or snow, the images captured by the camera include rain and snow, which can significantly affect the accuracy of the car's perception, thus rendering the intelligent driving technology unusable.

[0061] Currently, deep learning techniques are commonly used to remove rain and snow from images to improve image quality and enhance the robustness of subsequent perceptual applications such as object detection and semantic segmentation. For example, rain can be removed from a single image using cascaded dilated convolutional neural networks, or image desnow removal algorithms can be developed by reusing low-level features to create a fusion of learning-based models and deep learning that provides higher accuracy in snow removal, or image desnow removal methods can be implemented using recurrent generative adversarial networks.

[0062] However, the methods described above all require training with both rainy / snowy and snowless images of the same scene. It is difficult to acquire paired rainy / snowy and snowless images of the same scene, and the methods do not utilize full-image information for training. Furthermore, these methods only process rain or snow, and cannot handle both simultaneously. Additionally, these methods require the presence of rain or snow in the image to be processed, which is often difficult to determine in practical applications.

[0063] To facilitate understanding, the model training method provided in this application will be described in detail below with reference to the accompanying drawings.

[0064] Figure 1 This is a flowchart illustrating a model training method according to an exemplary embodiment, such as... Figure 1 As shown, the model training method includes the following steps:

[0065] S101, Electronic equipment acquires training images.

[0066] The training images include images with occlusions and images without occlusions. The images with occlusions and images without occlusions represent images from different scenes.

[0067] Optionally, the electronic device can be a terminal, a server, or a vehicle controller in a vehicle; this application does not limit the specific type of device.

[0068] Optionally, the electronic device can be configured with an i7-4790K processor with a clock speed of 4.0 GHz, 32 gigabytes of memory, and a graphics card with 24 GB of video memory. This method is implemented under the operating system (Ubuntu 22.04) and the PyTorch deep learning framework, and the model to be trained is built using the PyTorch framework.

[0069] Specifically, in intelligent driving technology, since images captured by cameras may contain obstructions such as rain and snow, electronic devices can use historical images as training images to train a model. This allows the trained model to remove obstructions like rain and snow from the camera-captured images. Furthermore, electronic devices can add weak labels to the training images, classifying them as images with or without obstructions, enabling the device to train based on these weak labels.

[0070] Optionally, the electronic device can store images with occlusions in one folder and images without occlusions in another folder to classify training images.

[0071] Optionally, the electronic device can also divide the training data into a training set and a test set in a 3:1 ratio (or other ratio). The training set includes images with occlusions and images without occlusions. The test set includes images with occlusions and images without occlusions.

[0072] It should be noted that the rain and snow in the embodiments of this application refer to rain and / or snow.

[0073] S102, The electronic device determines the first processed image and the second processed image based on the image processing network.

[0074] The first processed image is the image after removing the occluders from the image with occluders. The second processed image is the image after adding occluders to the image without occluders.

[0075] Specifically, when training the model based on training images, the electronic device can input an image with occlusions and an image without occlusions into the image processing network. In this way, the electronic device can use the image processing network to remove occlusions from the image with occlusions to obtain a first processed image, and to add occlusions to the image without occlusions to obtain a second processed image.

[0076] S103. The electronic device inputs the first processed image and the image without obstructions into the first image discrimination network to obtain the first loss, and inputs the second processed image and the image with obstructions into the second image discrimination network to obtain the second loss.

[0077] Specifically, during training, since the image processing network is not yet fully trained, the quality of the first and second processed images generated by the network is poor. In this case, the electronic device can input the first processed image and the image without occlusion into the first image discrimination network. Then, the electronic device can compare the first processed image with the image without occlusion to determine whether it was generated by the image processing network. The binary loss of the first image discrimination network in identifying the first processed image as not generated by the network is used as the first loss; that is, the electronic device can use the binary loss of the first image discrimination network in identifying the first processed image as a natural image as the first loss.

[0078] Accordingly, the electronic device can determine the second loss as the binary loss by which the second image discrimination network identifies the second processed image as a natural image.

[0079] It should be noted that the electronic device can determine the quality of the model being trained based on the first loss and the second loss. Since the electronic device determines whether the first processed image is a natural image by using a first image discrimination network and an unoccluded image, there is no need to compare it with an unoccluded image of the same scene as an occluded image. In this way, the electronic device can train the model using occluded and unoccluded images from different scenes.

[0080] It should be understood that the first image discrimination network and the second image discrimination network are constructed in the same way. Figure 2 A schematic diagram of a first image discrimination network or a second image discrimination network is shown. An electronic device can input an input image into convolution stage 1 to obtain feature M1. Feature M1 is then processed through convolution stage 2 to obtain feature M2, and feature M2 is processed through convolution stage 3 to obtain feature M3. Finally, feature M4 is processed through a fully connected layer to obtain a true or false discrimination result. When the judgment result is true, the first or second image discrimination network determines that the input image is a natural image; when the judgment result is false, the first or second image discrimination network determines that the input image is an image generated by an image generation network (or an electronic device).

[0081] Optionally, in the first image discrimination network and the second image discrimination network, the embodiments of this application do not limit the number of convolution stages, residual stages, and upsampling stages.

[0082] S104. The electronic device trains the model to be trained based on the loss set to obtain the image processing model.

[0083] The loss set includes a first loss and a second loss. The model to be trained includes an image processing network, a first image discrimination network, and a second image discrimination network. The image processing model is used to remove occlusions from the image to be processed.

[0084] Specifically, during training, the electronic device can train the model to be trained based on a loss set including a first loss and a second loss. The electronic device can then determine when the training of the model is complete based on the loss set, thus obtaining the image processing model. In this way, the electronic device can remove occlusions such as rain and snow from images using the trained image processing model.

[0085] Optionally, when the first loss in the loss set is less than a first preset threshold and the second loss is less than a second preset threshold, the electronic device can determine that the training of the model to be trained is complete.

[0086] It should be noted that since the image processing network, the first image discrimination network, and the second image discrimination network are all networks under training, the model to be trained includes the image processing network, the first image discrimination network, and the second image discrimination network, and the image processing model also includes the image processing network, the first image discrimination network, and the second image discrimination network.

[0087] However, since the first and second image discrimination networks are used to determine whether the images generated by the image processing network are natural images, in practical applications, the image processing model removes occlusions from the images to be processed through the feature extraction subnetwork and the occlusion removal subnetwork in the image processing network.

[0088] In some embodiments, the image processing network includes: a feature extraction subnetwork, an occlusion addition subnetwork, and an occlusion removal subnetwork; combined with Figure 1 ,like Figure 3 As shown, in S102 above, the electronic device determines the first processed image and the second processed image based on the image processing network, specifically including:

[0089] S301. The electronic device performs feature extraction processing on the occluded image and the unoccluded image through a feature extraction sub-network to obtain the first occluded image feature corresponding to the occluded image and the second occluded image feature corresponding to the unoccluded image.

[0090] The image processing network includes a feature extraction sub-network. This sub-network comprises four convolutional stages, two fast transformers, and four upsampling stages. The four convolutional stages are sequentially connected; the third convolutional stage is connected to the first fast transformer; the fourth convolutional stage is connected to the second fast transformer; the first fast transformer is connected to the second upsampling stage; the second fast transformer is connected to the first upsampling stage; the four upsampling stages are sequentially connected; the first convolutional stage is connected to the fourth upsampling stage; and the second convolutional stage is connected to the third upsampling stage.

[0091] The fast transformer comprises: a 3×3 convolutional layer, three 1×1 convolutional layers, two transformers, a feature fusion layer, and a feature summing layer; the 3×3 convolutional layer, the first 1×1 convolutional layer of the three 1×1 convolutional layers, the two transformers, the second 1×1 convolutional layer of the three 1×1 convolutional layers, the feature fusion layer, the third 1×1 convolutional layer of the three 1×1 convolutional layers, and the feature summing layer are sequentially connected in communication; the feature fusion layer and the first 1×1 convolutional layer are also connected in communication.

[0092] The four convolutional stages in the feature extraction subnetwork consist of two 3×3 convolutional layers and one pooling layer with a 2x downsampling. The four upsampling stages consist of upsampling and one 1×1 convolutional layer.

[0093] It should be noted that, considering the memory capacity and computing power of electronic devices, the fast converter is only applied to the third and fourth convolution stages. Figure 4 The diagram illustrates the structure of a fast converter. In this fast converter, the first layer is a 3×3 convolutional layer, the second layer is a 1×1 convolutional layer, the third layer consists of two converters, the fourth layer is a 1×1 convolutional layer, and the fifth layer is a feature fusion layer (i.e.,...). Figure 4 The sixth layer is a 1×1 convolutional layer, and the seventh layer is a feature summing layer (i.e., the "C" in the name). Figure 4 The image, after being input to the fast converter, is first processed through a 3×3 convolutional layer to obtain feature L1. Next, feature L1 is processed through a 1×1 convolutional layer to obtain feature L2. Then, feature L2 is processed through two converters in the third layer to obtain feature L3. Next, feature L3 is processed through a 1×1 convolutional layer in the fourth layer to obtain feature L4. Then, feature L4 and feature L2 are fused through a feature fusion layer in the fifth layer to obtain feature L5. Next, L5 is processed through a 1×1 convolutional layer in the sixth layer to obtain feature L6. Finally, feature addition is performed in the seventh layer to add the input image and feature L6 to obtain the output feature image.

[0094] Furthermore, the first and second layers of the fast converter are used to extract local features from the input image, while the two converters in the third layer and the fourth layer are used to extract global features. Then, feature fusion combines the local and global features. This fusion of global and local features improves the feature extraction performance of the fast converter.

[0095] Specifically, when an electronic device processes training images through an image processing network, it can first extract features through the feature extraction subnetwork in the image processing network. Figure 5 A schematic diagram of a feature extraction subnetwork is shown. The electronic device inputs the training image into convolution stage 1 (the first convolution stage) to obtain feature I1. Feature I1 is then processed through convolution stage 2 (the second convolution stage) to obtain feature I2. Feature I2 is further processed through convolution stage 3 (the third convolution stage) to obtain feature I3. Feature I3 is then processed through convolution stage 4 (the fourth convolution stage) to obtain feature I4. Feature I3 is then processed through fast converter 1 (the first fast converter) to obtain feature I5. Feature I4 is then processed through fast converter 2 (the second fast converter) to obtain feature I6. Subsequently, feature I6 is processed through upsampling stage 1 (the first upsampling stage) to obtain feature I7. Features I5 and I7 are then processed through upsampling stage 2 (the second upsampling stage) to obtain feature I8. Finally, feature I9 is ​​processed through upsampling stage 3 (the third upsampling stage) to obtain feature I1. 10 Feature I is processed through upsampling stage 4 (i.e., the fourth upsampling stage). 10 The image is processed to obtain the first occluder image features or the second occluder image features.

[0096] It should be understood that the features in the embodiments of this application are all feature images. The occlusion image features (e.g., first occlusion image features and second occlusion image features) proposed by the feature extraction sub-network are the same size as the training images.

[0097] Optionally, the first occluder image feature and the second occluder image feature can be an occluder mask, such as a rain / snow line mask or a rain / snow mask.

[0098] S302, the electronic device inputs the first occlusion image features and the occluded image into the occlusion removal sub-network to obtain a first processed image, and inputs the second occlusion image features and the unoccluded image into the occlusion addition sub-network to obtain a second processed image.

[0099] Specifically, to remove occlusions, the electronic device can input the first occlusion image features and the image with occlusions into the occlusion removal sub-network. Since the first occlusion image features (e.g., an occlusion mask) can represent the position of the occlusion in the image with occlusions, the electronic device can remove the occlusions in the image with occlusions using the first occlusion image features to obtain the first processed image.

[0100] Correspondingly, the electronic device can input the second occluded image features and the unoccluded image into the occlusion-adding sub-network. Since the second occluded image features are features about the occluded object generated by the feature extraction sub-network, the occlusion-adding sub-network can add occluders to the unoccluded image based on the second occluded image features to obtain the second processed image.

[0101] It should be noted that the construction of removing the occlusion subnetwork and adding the occlusion subnetwork is the same, and the parameters of each layer are not shared. Figure 6 A schematic diagram of a subnetwork for removing or adding occlusions is shown. An electronic device can input an occlusion mask (i.e., first or second occlusion image features) and an input image (e.g., a training image) into convolution stage 1 to obtain feature X1. Feature X1 is then processed through convolution stage 2 to obtain feature X2. Feature X2 is further processed through convolution stage 3 to obtain feature X3. Feature X3 is then processed through convolution stage 4 to obtain feature X4. Feature X4 is processed through a residual stage to obtain feature X5. Features X5 and X4 are then processed through upsampling stage 1 to obtain feature X6. Features X6 and X3 are then processed through upsampling stage 2 to obtain feature X7. Features X7 and X2 are then processed through upsampling stage 3 to obtain feature X8. Finally, features X8 and X1 are processed through upsampling stage 4 to obtain an output image (e.g., a first or second processed image).

[0102] Furthermore, the convolution stage consists of two 3×3 convolutional layers and one pooling layer with a 2x downsampling. The residual stage includes six residual layers, and the upsampling stage consists of upsampling and a 1×1 convolutional layer. The features from the convolution stage are fused through summation. After inputting the rain / snow line mask and the input image, the electronic device can input the result of concatenating the input image and the rain / snow line mask along the channel dimension into convolution stage 1.

[0103] Optionally, in the image processing network, the embodiments of this application do not limit the number of convolution stages, fast converters, and upsampling stages.

[0104] For example, assuming the obstruction is rain or snow, then an image with obstruction is an image with rain or snow, and an image without obstruction is an image without rain or snow. Figure 7 A schematic diagram is shown illustrating the generation of a first processed image via an image processing network. Electronic devices can generate images with rain and snow. W The rain / snow line generation module N of the input image processing network trans (i.e., the feature extraction subnetwork in this application) to obtain the rain and snow line mask M W (i.e., the first occlusion image features in this application). Next, the electronic device can mask the rain / snow line M.W Images with rain and snow W Input rain and snow image generation module N gen_dw (i.e., the occlusion removal sub-network in this application) to obtain the de-rained and de-snowed image I d_w (i.e., the first processed image in this application).

[0105] Figure 8 A schematic diagram is shown illustrating the generation of a second processed image via an image processing network. Electronic devices can generate rain-free or snow-free images I... N The rain / snow line generation module N of the input image processing network trans (i.e., the feature extraction subnetwork in this application) to obtain the rain and snow line mask M N (i.e., the second occlusion image features in this application). Next, the electronic device can mask the rain / snow line M. NN Image I without rain or snow N Input rain and snow image generation module N gen_gw (i.e., the addition of an occlusion sub-network in this application) to obtain rain and snow enhancement images I g_w (i.e., the second processed image in this application).

[0106] In some embodiments, the loss set further includes: a third loss, and / or, a discriminative loss; where the loss set includes a third loss, combined with Figure 3 ,like Figure 9 As shown, the model training method provided in this application embodiment further includes:

[0107] S901, the electronic device generates a target noise matrix of the same size as the training image through a feature extraction sub-network.

[0108] Specifically, the electronic device can generate random noise with values ​​from 0 to 256 on a matrix of the same size as the training image to obtain a first noise matrix. Next, the electronic device can set the positions in the initial noise matrix less than 256 - 0.01 * V to 0 to obtain a second noise matrix. Afterwards, the electronic device can apply a filtering kernel to the second noise matrix... The filtering is performed to obtain the target noise matrix.

[0109] Where V is the intensity of the occlusion in the model to be trained, ranging from 100 to 10000, and can fluctuate randomly within this range.

[0110] S902, The electronic equipment determines the filter kernel corresponding to the target noise matrix.

[0111] Specifically, after determining the target noise matrix, the electronic device can determine the translation amount as (L / 2, L / 2) and the rotation matrix as... In addition, electronic devices can perform affine transformations on a diagonal matrix of size L and apply Gaussian blurring with a window of (W,W) to obtain the filter kernel corresponding to the target matrix.

[0112] Among the training hyperparameters, the length of the occluder (e.g., the length of the rain / snow line) L ranges from 1 to 10, the size of the occluder (e.g., the size of the rain / snow line) W ranges from 1 or 3, and the angle of the occluder (e.g., the angle of the rain / snow line) θ ranges from -45 degrees to 45 degrees. These hyperparameters can be used as weak labels for training and fluctuate randomly within the above ranges.

[0113] S903. The electronic device filters the target noise matrix through the filter kernel corresponding to the target noise matrix to obtain the filtering result, and normalizes the filtering result to obtain the image features of the third occluder.

[0114] Specifically, after determining the filter kernel corresponding to the target noise matrix, the electronic device can filter the target noise matrix using the filter kernel to obtain the filtering result. Then, the electronic device can normalize the filtering result to obtain the image features of the third occluder.

[0115] S904, the electronic device determines the loss between the first occluder image features and the third occluder image features, as well as the loss between the second occluder image features and the third occluder image features, as the third loss.

[0116] Specifically, since both the first and third occlusion image features are features of the occlusion itself, and the third occlusion image feature is generated by the electronic device through the feature extraction sub-network, the electronic device can use the third occlusion image feature as the accurate feature of the occlusion. Then, the electronic device can compare the first and third occlusion image features and determine the loss between them, as well as the loss between the second and third occlusion image features, as the third loss. This allows the electronic device to determine the quality of feature extraction performed by the feature extraction sub-network based on the third loss.

[0117] In some embodiments, the discriminative loss includes: a fourth loss and a fifth loss, in combination where the loss set includes the discriminative loss. Figure 9 ,like Figure 10 As shown, the model training method provided in this application embodiment further includes:

[0118] S1001. The electronic device processes the first processed image through a feature extraction subnetwork and an occlusion removal subnetwork to obtain a third processed image, and processes the second processed image through a feature extraction subnetwork and an occlusion removal subnetwork to obtain a fourth processed image.

[0119] Specifically, since the first processed image is the image after removing the occluders from the image with occluders, while the image without occluders and the first processed image are images without occluders in different scenes, the electronic device can further process the first processed image through a feature extraction sub-network and an occlusion-adding sub-network to obtain the third processed image. In this way, the electronic device can train the feature extraction sub-network and the occlusion-adding sub-network using images without occluders in different scenes (i.e., the first processed image and the image without occluders).

[0120] Correspondingly, since the second processed image is an image without occlusions but with occlusions added, and the image with occlusions and the second processed image are images with occlusions in different scenes, the electronic device can further process the second processed image using a feature extraction sub-network and an occlusion removal sub-network to obtain a fourth processed image. In this way, the electronic device can train the feature extraction sub-network and the occlusion removal sub-network using images with occlusions in different scenes (i.e., the second processed image and the image with occlusions).

[0121] In this scenario, even if the same scene only includes images without occlusions and not images with occlusions in the training images, the electronic device can train the feature extraction subnetwork and the occlusion removal subnetwork using the second processed image corresponding to the image without occlusions. This allows the occlusion removal subnetwork to be trained using images with occlusions in the scene (i.e., the second processed image), without needing to collect images with and without occlusions in the same scene.

[0122] Correspondingly, even if the electronic device can train the feature extraction subnetwork and the occlusion addition subnetwork on the first processed image so that the occlusion addition subnetwork adds more realistic occlusions to the unoccluded image, that is, the second processed image generated by the occlusion addition subnetwork is more realistic.

[0123] It should be understood that the model to be trained includes both the generation of the first and second processed images through the image processing network, and the generation of the third and fourth processed images through the image processing network. The model to be trained can also be referred to as a weakly supervised recurrent transformation network.

[0124] Based on the examples above, Figure 11 A schematic diagram of the structure of a model to be trained is shown. In the model to be trained, the electronic device can display images of rain and snow. W Image I without rain or snow N Input rain and snow line generation module N trans (i.e., the feature extraction subnetwork in this application) to obtain rain and snow images I W Corresponding rain and snow line mask M W(i.e., the first occlusion image features in this application) and image I without rain or snow N Corresponding rain and snow line mask M N (i.e., the second occlusion image features in this application). Next, the electronic device can mask the rain / snow line M. W Images with rain and snow W Input rain and snow image generation module N gen_dw (i.e., the occlusion removal sub-network in this application) to obtain the de-rained and de-snowed image I d_w (i.e., the first processed image in this application), and the rain and snow line mask M NN Image I without rain or snow N Input rain and snow image generation module N gen_gw (i.e., the addition of an occlusion sub-network in this application) to obtain rain and snow enhancement images I g_w (i.e., the second processed image in this application).

[0125] Next, the electronic device can also display rain and snow images. d_w and rain / snow enhancement images I g_w Input the rain / snow line generation module to obtain the de-rain / snow image I. d_w Corresponding rain and snow line mask M d_w and rain and snow enhancement images I g_w Corresponding rain and snow line mask M g_w Next, the electronic device can remove rain and snow images. d_w Corresponding rain and snow line mask M d_w Images of rain and snow d_w Input rain and snow image generation module N gen_gw To obtain rain and snow recovery images I r_w (i.e., the third processed image in this application), and the image I for increasing rain and snow g_w Corresponding rain and snow line mask M g_w and rain / snow enhancement images I g_w Input rain and snow image generation module N gen_dw To obtain rain- and snow-free recovery images I r_n (i.e., the fourth processed image in this application).

[0126] In addition, electronic devices can also display rain and snow images. d_w Image I without rain or snow N Input rain and snow image discrimination module N dis_dw (i.e., the first image discrimination network in this application), so that the electronic device passes the rain and snow image discrimination module N. dis_gw Determine the rain and snow image I d_w Whether it is a natural image (or an image generated by an electronic device), and whether it will be used to enhance rain and snow images. g_w Images with rain and snow WInput rain and snow image discrimination module N dis_gw (i.e., the second image discrimination network in this application), so that the electronic device can pass the rain and snow image discrimination module N. dis_gw Discriminant Rain and Snow Image I g_w Is it a natural image?

[0127] S1002, the electronic device determines the consistency loss between the third processed image and the image with occlusion as the fourth loss, and determines the consistency loss between the fourth processed image and the image without occlusion as the fifth loss.

[0128] Specifically, since the third processed image is the image after adding occlusions to the first processed image, and the first processed image is the image after removing occlusions from the occluded image, the third processed image can be the image after removing occlusions from the occluded image and then adding occlusions again. In this case, the electronic device can determine whether the third processed image and the occluded image are consistent, and determine the consistency L1 loss between the third processed image and the occluded image as the fourth loss.

[0129] Accordingly, since the fourth processed image is the image after removing the occlusions from the second processed image, and the second processed image is the image after adding occlusions to the unoccluded image, the fourth processed image can be the image after adding occlusions to the unoccluded image and then removing the occlusions. In this case, the electronic device can determine whether the fourth processed image and the unoccluded image are consistent, and determine the consistency L1 loss between the fourth processed image and the unoccluded image as the fifth loss.

[0130] Optionally, the electronic device can determine that the training of the model to be trained is complete when the first loss in the loss set is less than a first preset threshold, the second loss is less than a second preset threshold, the third loss is less than a third preset threshold, the fourth loss is less than a fourth preset threshold, and the fifth loss is less than a fifth preset threshold.

[0131] In one possible implementation, the loss set may include loss L G Loss L dis_dw and loss L dis_gw Loss L G This may include: a binary loss L for the second image discrimination network to determine that the second processed image is real (i.e., a natural image). adv1 That is, the first loss, the binary loss L for the first image discrimination network to determine that the first processed image is true. adv2 That is, the second loss, the L2 loss between the first and third occlusion image features generated by the feature extraction subnetwork, and the L2 loss between the second and third occlusion image features. maskThat is, the third loss, the consistency loss between the third processed image and the image with occlusions, L1 loss. con1 That is, the fourth loss, the consistency loss between the fourth processed image and the unoccluded image, L1 loss. con2 This is the fifth loss.

[0132] Loss L G for:

[0133] L G =λ mask L mask +λ adv1 L adv1 +λ adv2 L adv2 +λ con1 L con1 +λ con2 L con2 ;

[0134] Where, λ mask , λ adv1 , λ adv2 , λ con1 , λ con2 These are hyperparameters (i.e. weight parameters), and can be set to 10, 1, 1, 10, and 10 respectively.

[0135] Loss L dis_dw This can include: a binary loss L for electronic devices to determine if an image without obstructions is real. real_dw The binary loss L that the electronic device uses to determine that the first processed image is fake (i.e., determines that the first processed image is an image generated by the image processing network) fake_dw .

[0136] Loss L dis_dw for:

[0137] L dis_dw =λ real_dw L real_dw +λ fake_dw L fake_dw ;

[0138] Where, λ real_dw , λ fake_dw These are hyperparameters (i.e. weight parameters), and can be set to 0.5 and 0.5 respectively.

[0139] Loss L dis_gw This may include: a binary loss L that the electronic device uses to determine if an image with an obstruction is real. real_gw The binary loss L that the electronic device uses to determine that the second processed image is fake (i.e., determines that the second processed image is an image generated by the image processing network) fake_gw .

[0140] Loss L dis_gw for:

[0141] L dis_gw =λ real_gw L real_gw +λ fake_gw L fake_gw ;

[0142] Where, λ real_gw , λ fake_gw These are hyperparameters (i.e. weight parameters), and can be set to 0.5 and 0.5 respectively.

[0143] The training objective of the model to be trained is:

[0144] L total =L G +L dis_dw +L dis_gw ;

[0145] The training objective of the model to be trained is optimized and solved: That is, the loss L in the loss set of the electronic device. G Minimum, and loss L dis_dw and loss L dis_gw At its maximum, the electronic device determines that the training of the model to be trained is complete.

[0146] Furthermore, when training the model, after one round of training, the electronic device can store the model parameters and first update the loss L. G Then update the loss L dis_dw and loss L dis_gw .

[0147] Optionally, during training, the initial learning rate can be 0.0005, the number of training epochs can be 100, the batch size can be 8, and the optimizer can be the Adam optimizer (Adaptive Moment Estimation).

[0148] Optionally, the loss set may include a first loss and a second loss, or a third loss, or a first loss, a second loss, a fourth loss, and a fifth loss.

[0149] In some embodiments, such as Figure 12 As shown, the model training method provided in this application embodiment further includes:

[0150] S1201, The electronic device acquires the original image.

[0151] Specifically, after the model to be trained is completed, electronic devices can acquire the original images through devices such as cameras.

[0152] S1202. When it is determined that the original image contains an obstruction, the electronic device determines the original image as the image to be processed.

[0153] Specifically, since electronic devices cannot identify whether there are occlusions in the original image, they can manually determine the presence of occlusions. Then, when the electronic device determines that there are occlusions in the original image, it can designate the original image as the image to be processed, allowing the image processing model to process it.

[0154] Optionally, the presence of an obstruction in the original image can be determined by factors such as the car's windshield wiper switch. For example, when the car's windshield wipers are turned on, the electronic device determines that the acquired original image includes an obstruction. Alternatively, the presence of an obstruction in the original image can be determined manually, for example, when a switch corresponding to the image processing model is manually turned on, the electronic device will determine that the acquired original image includes an obstruction. This application does not limit the scope of this method.

[0155] S1203. The electronic device removes occlusions from the image to be processed through the feature extraction subnetwork and the occlusion removal subnetwork in the image processing model.

[0156] Specifically, after determining the image to be processed, the electronic device can extract the occlusion mask (i.e., occlusion image features) of the image through a trained feature extraction sub-network. Then, the occlusions in the image are removed by the occlusion removal sub-network and the occlusion mask to obtain the corresponding natural image without occlusions. Based on this natural image, the car can perform intelligent driving.

[0157] Correspondingly, when it is determined that the image to be processed does not contain occlusions, the electronic device can add occlusions to the image to be processed through the feature extraction subnetwork and the occlusion addition subnetwork in the image processing model.

[0158] In some embodiments, such as Figure 13 As shown, the model training method provided in this application embodiment further includes:

[0159] S1301, The electronic device acquires the image to be processed.

[0160] Specifically, electronic devices can acquire images to be processed through car cameras and other means.

[0161] S1302. The electronic device inputs the image to be processed into the feature extraction sub-network in the image processing model to obtain the image features of the fourth occluder corresponding to the image to be processed.

[0162] Specifically, in order to determine whether there are occluders in the image to be processed, the electronic device can input the image to be processed into the feature extraction sub-network in the image processing model. The trained feature extraction sub-network processes the image to obtain the fourth occluder image features corresponding to the image to be processed, namely the occluder mask.

[0163] S1303. When the proportion of pixels of the occluder in the fourth occluder image features is greater than the preset proportion, the electronic device removes the occluder in the image to be processed through the occluder removal sub-network and the fourth occluder image features in the image processing model.

[0164] Specifically, when the proportion of occluder pixels in the fourth occluder image features is greater than a preset proportion (e.g., 0.01), the electronic device can remove the occluder in the image to be processed through the occluder removal sub-network and the fourth occluder image features in the image processing model, so as to obtain a natural image without occluders corresponding to the image to be processed.

[0165] Correspondingly, when the pixel ratio of the occluder in the fourth occluder image feature is less than or equal to a preset ratio, the electronic device can determine that there is no occluder in the image to be processed and will not process the image to be processed.

[0166] Based on the examples above, Figure 14 A schematic diagram illustrating a process for determining whether an image to be processed contains occlusions is shown. The electronic device can input the image to be processed into a trained rain / snow line generation module N. trans The electronic device obtains a rain and snow line mask (i.e., the fourth occlusion image feature in this application). Next, the electronic device determines whether the proportion of the area of ​​rain and snow (i.e., the pixels of rain and snow in this application) in the rain and snow line mask is greater than a preset proportion. When the electronic device determines that the proportion of the area of ​​rain and snow in the rain and snow line mask is greater than the preset proportion, the electronic device inputs the rain and snow line mask and the image to be processed to the rain and snow image generation module N. gen_dw In order to obtain a rain and snow removed image (i.e., a natural image after removing occlusions corresponding to the image to be processed in this application).

[0167] The foregoing primarily describes the solutions provided by the embodiments of this application from a methodological perspective. To achieve the aforementioned functions, the model training device or electronic device includes corresponding hardware structures and / or software modules for executing each function. Those skilled in the art should readily recognize that, based on the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0168] This application embodiment can, based on the above method, exemplarily divide a model training device or electronic device into functional modules. For example, the model training device or electronic device may include functional modules corresponding to each functional division, or two or more functions may be integrated into one processing module. The integrated module can be implemented in hardware or as a software functional module. It should be noted that the module division in this application embodiment is illustrative and only represents one logical functional division; in actual implementation, there may be other division methods.

[0169] Figure 15 This is a block diagram illustrating a model training apparatus according to an exemplary embodiment. (Refer to...) Figure 15 The model training device 1500 includes an acquisition unit 1501 and a processing unit 1502.

[0170] The acquisition unit 1501 is used to acquire training images; the training images include: images with occlusion and images without occlusion; the images with occlusion and images without occlusion are images from different scenes.

[0171] The processing unit 1502 is used to determine a first processed image and a second processed image based on an image processing network; the first processed image is the image after removing occluders from the image with occluders; the second processed image is the image after adding occluders to the image without occluders.

[0172] The processing unit 1502 is further configured to input the first processed image and the image without occlusion into the first image discrimination network to obtain a first loss, and to input the second processed image and the image with occlusion into the second image discrimination network to obtain a second loss.

[0173] The processing unit 1502 is also used to train the model to be trained based on the loss set to obtain an image processing model; the loss set includes a first loss and a second loss; the model to be trained includes an image processing network, a first image discrimination network and a second image discrimination network; the image processing model is used to remove occlusions in the image to be processed.

[0174] In one possible implementation, the image processing network further includes: a feature extraction subnetwork, an occlusion addition subnetwork, and an occlusion removal subnetwork; the processing unit 1502 is specifically used for:

[0175] The feature extraction sub-network is used to extract features from images with and without occlusions to obtain the first occluded image features corresponding to the occluded image and the second occluded image features corresponding to the unoccluded image.

[0176] The first occluded image features and the image with occlusion are input into the occlusion removal sub-network to obtain the first processed image, and the second occluded image features and the image without occlusion are input into the occlusion addition sub-network to obtain the second processed image.

[0177] In one possible implementation, the loss set further includes: a third loss, and / or, a discriminant loss; the discriminant loss includes: a fourth loss and a fifth loss;

[0178] In the case where the loss set includes a third loss.

[0179] The processing unit 1502 is also used to generate a target noise matrix of the same size as the training image through the feature extraction sub-network; the processing unit is also used to determine the filter kernel corresponding to the target noise matrix.

[0180] The processing unit 1502 is also used to filter the target noise matrix through the filter kernel corresponding to the target noise matrix to obtain the filtering result, and to normalize the filtering result to obtain the third occlusion image features.

[0181] The processing unit 1502 is further configured to determine the loss between the first occluder image features and the third occluder image features, as well as the loss between the second occluder image features and the third occluder image features, as the third loss.

[0182] When the loss set includes discriminative loss.

[0183] The processing unit 1502 is further configured to process the first processed image through a feature extraction subnetwork and an occlusion addition subnetwork to obtain a third processed image, and to process the second processed image through a feature extraction subnetwork and an occlusion removal subnetwork to obtain a fourth processed image.

[0184] The processing unit 1502 is further configured to determine the consistency loss between the third processed image and the image with occlusion as the fourth loss, and to determine the consistency loss between the fourth processed image and the image without occlusion as the fifth loss.

[0185] In one possible implementation, the acquisition unit 1501 is also used to acquire the original image.

[0186] The processing unit 1502 is further configured to determine the original image as the image to be processed when it is determined that the original image contains an occluder; the processing unit is further configured to remove the occluder in the image to be processed through the feature extraction subnetwork and the occluder removal subnetwork in the image processing model.

[0187] In one possible implementation, the acquisition unit 1501 is also used to acquire the image to be processed.

[0188] The processing unit 1502 is also used to input the image to be processed into the feature extraction sub-network in the image processing model to obtain the image features of the fourth occluder corresponding to the image to be processed.

[0189] The processing unit 1502 is further configured to remove the occluder in the image to be processed by using the occluder removal sub-network and the fourth occluder image features in the image processing model when the proportion of occluder pixels in the fourth occluder image features is greater than a preset proportion.

[0190] In one possible implementation, the feature extraction subnetwork includes: four convolutional stages, two fast transformers, and four upsampling stages; the fast transformer includes: a 3×3 convolutional layer, three 1×1 convolutional layers, two transformers, a feature fusion layer, and a feature summing layer; the 3×3 convolutional layer, the first 1×1 convolutional layer of the three 1×1 convolutional layers, the two transformers, the second 1×1 convolutional layer of the three 1×1 convolutional layers, the feature fusion layer, the third 1×1 convolutional layer of the three 1×1 convolutional layers, and the feature summing layer are sequentially connected in communication; the feature fusion layer and the first 1×1 convolutional layer are also connected in communication.

[0191] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0192] Figure 16 This is a block diagram illustrating an electronic device according to an exemplary embodiment. Figure 16 As shown, the electronic device 1600 includes, but is not limited to, a processor 1601 and a memory 1602.

[0193] The aforementioned memory 1602 is used to store the executable instructions of the aforementioned processor 1601. It is understood that the aforementioned processor 1601 is configured to execute instructions to implement the model training method in the above embodiments.

[0194] It should be noted that those skilled in the art will understand that Figure 16 The electronic device structure shown does not constitute a limitation on the electronic device; the electronic device may include, but is not limited to, other electronic devices. Figure 16 This may indicate more or fewer components, or a combination of certain components, or a different arrangement of components.

[0195] Processor 1601 is the control center of the electronic device. It connects various parts of the electronic device via various interfaces and lines. By running or executing software programs and / or modules stored in memory 1602, and by calling data stored in memory 1602, it performs various functions and processes data, thereby providing overall monitoring of the electronic device. Processor 1601 may include one or more processing units. Optionally, processor 1601 may integrate an application processor and a modem processor. The application processor mainly handles the operating system, user interface, and applications, while the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into processor 1601.

[0196] The memory 1602 can be used to store software programs and various data. The memory 1602 may primarily include a program storage area and a data storage area. The program storage area may store the operating system, application programs required by at least one functional module (such as a determination unit, processing unit, etc.), etc. Furthermore, the memory 1602 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device.

[0197] In an exemplary embodiment, a computer-readable storage medium including instructions is also provided, such as a memory 1602 including instructions, which can be executed by a processor 1601 of an electronic device 1600 to implement the methods in the above embodiments.

[0198] In actual implementation, Figure 15 The functions of the acquisition unit 1501 and the processing unit 1502 can both be provided by Figure 16 The processor 1601 calls the computer program stored in the memory 1602 to implement the process. The specific execution process can be found in the method section of the previous embodiment, and will not be repeated here.

[0199] Optionally, the computer-readable storage medium may be a non-transitory computer-readable storage medium, such as a read-only memory (ROM), random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device.

[0200] In an exemplary embodiment, this application also provides a computer program product including one or more instructions, which can be executed by a processor 1601 of an electronic device to perform the methods described above.

[0201] It should be noted that when one or more instructions in the computer-readable storage medium or computer program product are executed by the processor of an electronic device, they implement the various processes of the above method embodiments and achieve the same technical effect as the above method. To avoid repetition, they will not be described again here.

[0202] Through the above description of the embodiments, those skilled in the art can clearly understand that, for the sake of convenience and brevity, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.

[0203] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another apparatus, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0204] The units described as separate components may or may not be physically separate. A component shown as a unit can be one or more physical units; that is, it can be located in one place or distributed in multiple different locations. Some or all of the classified units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0205] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0206] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solution of the embodiments of this application, essentially, or the part that contributes to the prior art, or a complete or partial classification of the technical solution, can be embodied in the form of a software product. This software product is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, ROM, RAM, magnetic disks, or optical disks.

[0207] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A model training method, characterized in that, include: Acquire training images; The training images include: images with occlusions and images without occlusions; the images with occlusions and the images without occlusions are images from different scenes; The feature extraction subnetwork is used to extract features from the occluded image and the unoccluded image to obtain the first occluded image feature corresponding to the occluded image and the second occluded image feature corresponding to the unoccluded image. The first occluder image features and the occluded image are input into the occluder removal sub-network to obtain the first processed image; The second occluder image features and the unoccluded image are input into the occluder addition sub-network to obtain the second processed image; the first processed image is the image after removing the occluders from the occluded image; the second processed image is the image after adding occluders to the unoccluded image; The first processed image and the image without occlusion are input into a first image discrimination network to obtain a first loss, and the second processed image and the image with occlusion are input into a second image discrimination network to obtain a second loss. The feature extraction subnetwork generates a target noise matrix of the same size as the training image. Determine the filter kernel corresponding to the target noise matrix; The target noise matrix is ​​filtered by the filter kernel corresponding to the target noise matrix to obtain the filtering result, and the filtering result is normalized to obtain the third occlusion image features. The loss between the first occluder image features and the third occluder image features, as well as the loss between the second occluder image features and the third occluder image features, are determined as the third loss; The model to be trained is trained based on a loss set to obtain an image processing model; the loss set includes the first loss, the second loss, and the third loss; the model to be trained includes the feature extraction subnetwork, the occlusion removal subnetwork, the occlusion addition subnetwork, the first image discrimination network, and the second image discrimination network; the image processing model is used to remove occlusions from the image to be processed.

2. The model training method according to claim 1, characterized in that, The loss set further includes: a discriminative loss; the discriminative loss includes: a fourth loss and a fifth loss; When the loss set includes the discriminative loss, the model training method further includes: The first processed image is processed by the feature extraction subnetwork and the occlusion addition subnetwork to obtain a third processed image, and the second processed image is processed by the feature extraction subnetwork and the occlusion removal subnetwork to obtain a fourth processed image. The consistency loss between the third processed image and the image with occlusion is determined as the fourth loss, and the consistency loss between the fourth processed image and the image without occlusion is determined as the fifth loss.

3. The model training method according to any one of claims 1-2, characterized in that, The model training method also includes: Obtain the original image; When it is determined that the original image contains an obstruction, the original image is determined as the image to be processed; The occlusions in the image to be processed are removed by the feature extraction subnetwork and the occlusion removal subnetwork in the image processing model.

4. The model training method according to any one of claims 1-2, characterized in that, The model training method also includes: Obtain the image to be processed; The image to be processed is input into the feature extraction sub-network in the image processing model to obtain the image features of the fourth occluder corresponding to the image to be processed; When the proportion of pixels of the occluder in the fourth occluder image feature is greater than a preset proportion, the occluder in the image to be processed is removed by the occluder removal sub-network in the image processing model and the fourth occluder image feature.

5. The model training method according to claim 1, characterized in that, The feature extraction subnetwork includes: four convolutional stages, two fast transformers, and four upsampling stages; The fast converter includes: a 3×3 convolutional layer, three 1×1 convolutional layers, two converters, a feature fusion layer, and a feature summing layer; the 3×3 convolutional layer, the first 1×1 convolutional layer of the three 1×1 convolutional layers, the two converters, the second 1×1 convolutional layer of the three 1×1 convolutional layers, the feature fusion layer, the third 1×1 convolutional layer of the three 1×1 convolutional layers, and the feature summing layer are sequentially and communicatively connected, and the feature fusion layer and the first 1×1 convolutional layer are communicatively connected.

6. A model training device, characterized in that, include: Acquisition unit and processing unit; The acquisition unit is used to acquire training images; The training images include: images with occlusions and images without occlusions; the images with occlusions and the images without occlusions are images from different scenes; The processing unit is configured to perform feature extraction processing on the occluded image and the unoccluded image through a feature extraction sub-network to obtain a first occluded image feature corresponding to the occluded image and a second occluded image feature corresponding to the unoccluded image; input the first occluded image feature and the occluded image into an occlusion removal sub-network to obtain a first processed image; input the second occluded image feature and the unoccluded image into an occlusion addition sub-network to obtain a second processed image; the first processed image is the image after removing the occluders from the occluded image; the second processed image is the image after adding occluders to the unoccluded image; The processing unit is further configured to input the first processed image and the unoccluded image into a first image discrimination network to obtain a first loss, and input the second processed image and the occluded image into a second image discrimination network to obtain a second loss; generate a target noise matrix of the same size as the training image through the feature extraction sub-network; determine the filtering kernel corresponding to the target noise matrix; filter the target noise matrix through the filtering kernel corresponding to the target noise matrix to obtain a filtering result, and normalize the filtering result to obtain a third occluded image feature; and determine the loss between the first occluded image feature and the third occluded image feature, and the loss between the second occluded image feature and the third occluded image feature, as the third loss; The processing unit is further configured to train the model to be trained based on a loss set to obtain an image processing model; the loss set includes the first loss, the second loss, and the third loss; the model to be trained includes the feature extraction subnetwork, the occlusion removal subnetwork, the occlusion addition subnetwork, the first image discrimination network, and the second image discrimination network; the image processing model is used to remove occlusions from the image to be processed.

7. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the method as described in any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that, When the computer-executable instructions stored in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is capable of performing the method as described in any one of claims 1 to 5.