Neural network model for image segmentation

By employing a two-stage approach, a pixel-wise label distribution is generated using a calibration network, which is then combined with the training of a thinning network and a discriminator network. This approach addresses the issue of inconsistent semantic segmentation under noisy data, achieving highly accurate and stable segmentation results.

CN115485741BActive Publication Date: 2026-06-16TOMTOM GLOBAL CONTENT

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TOMTOM GLOBAL CONTENT
Filing Date
2021-05-27
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing likelihood-based semantic segmentation methods tend to generate disjointed semantic graphs when dealing with noisy data. The generator network is unstable during training in GAN models, resulting in poor classification results.

Method used

A two-stage approach is adopted. First, a pixel-wise predicted label distribution is generated through a calibration network. Then, multiple segmentation maps are generated from the noise through a thinning network and trained using a discriminator network. The objective function includes cross-entropy loss and calibration loss to ensure that the average value of the segmentation map is consistent with the predicted label distribution.

🎯Benefits of technology

It improves the accuracy and stability of semantic image segmentation, effectively handles noisy data, and achieves segmentation results with fast convergence and high pattern coverage.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115485741B_ABST
    Figure CN115485741B_ABST
Patent Text Reader

Abstract

A computer processing system is configured to train a model for use in semantic image segmentation. The model includes a refinement neural network, a discriminator neural network. The refinement neural network is configured to receive a predicted label distribution for an image, obtain one or more random values from a random or pseudo-random noise source, generate a plurality of predicted segmentation maps from the received predicted label distribution using the one or more random values, and output the plurality of predicted segmentation maps to the discriminator neural network. The computer processing system is configured to train the refinement neural network using an objective function that is a function of an output of the discriminator neural network and further includes a term representing a difference between the predicted label distribution and an average of the plurality of predicted segmentation maps output by the refinement neural network for the predicted label distribution.
Need to check novelty before this filing date? Find Prior Art

Description

Background Technology

[0001] This invention relates to neural networks used in semantic image segmentation.

[0002] By treating various computer vision problems as image segmentation problems, convolutional neural networks (CNNs) have been successfully applied to a wide range of computer vision tasks. Examples include road scene understanding for autonomous driving and interpretation of medical imaging. For such applications, networks are typically trained using multi-class per-pixel labels, which together form an image-sized segmentation map (also referred to as "labels" in this paper). The output of such networks is then another image-sized map representing the per-pixel class probability.

[0003] In likelihood-based semantic segmentation, a neural network can be trained to perform semantic segmentation by training data consisting of image pairs (e.g., photos) and corresponding human-labeled segmentation maps. The final segmentation map for a given input image is then obtained from the trained network by applying the argmax function along the category dimension of the predicted probabilities (i.e., selecting the most likely category for each pixel). For example, this segmentation map can be useful in enabling autonomous vehicles to determine whether an object in their field of vision is another vehicle or a pedestrian.

[0004] However, a drawback of likelihood-based semantic segmentation is its ability to generate incoherent semantic maps. The underlying reason is the way the training loss is formulated (e.g., using per-pixel cross-entropy), which makes each output pixel in the segmentation map considered independently of all other pixels—that is, it does not enforce explicit inter-pixel consistency. Specifically, for noisy real-world datasets, maximizing factorized likelihood leads to uncertain predictions of label-related inconsistent regions.

[0005] Generative Adversarial Networks (GANs) have been applied to the semantic segmentation problem in an attempt to address the aforementioned per-pixel loss issue. GANs work by training two networks alternately in a minimax game: a generator neural network is trained to produce semantic segmentation maps, while a binary discriminator neural network is trained to distinguish between the generated (predicted) segmentation map data (“false”) and ground truth labels (“true”). During training, the generator produces semantic segmentation maps, while the discriminator alternately observes the ground truth labels and the predicted segmentation maps. After training is complete, a copy of the trained generator network can be used to perform semantic segmentation on received images—for example, to interpret live video of street scenes in autonomous vehicles.

[0006] However, in practice, GAN models can lead to poor classification results in semantic image segmentation.

[0007] In WO 2019 / 238560, the applicant proposed training a generator network to perform image segmentation using an objective function that includes an adversarial loss term and a pixel-level cross-entropy loss term. Such methods can help stabilize the dynamics of adversarial training, but are not always ideal, especially when the training data is noisy.

[0008] However, the applicant has now devised a different approach that still delivers good image segmentation performance even when trained on noisy data. Summary of the Invention

[0009] In a first aspect, the present invention provides a computer processing system configured to train a model for use in semantic image segmentation, wherein the model comprises:

[0010] Refine the neural network;

[0011] Discriminator neural network

[0012] The refined neural network is configured to:

[0013] The predicted label distribution of the received image;

[0014] Obtain one or more random values ​​from a random or pseudo-random noise source;

[0015] Multiple predicted segmentation maps are generated from the received predicted label distribution using one or more random values; and

[0016] The multiple predicted segmentation maps are output to the discriminator neural network, and

[0017] The computer processing system is configured to train the refined neural network using an objective function, which is a function of the output of the discriminator neural network, and further includes a term representing the difference between the predicted label distribution and the average of the plurality of predicted segmentation maps output by the refined neural network for the predicted label distribution.

[0018] The computer processing system may be further configured to implement the refined neural network and / or the discriminator neural network.

[0019] In a second aspect, the present invention provides a method for training a model for use in semantic image segmentation, wherein the model comprises:

[0020] Refining neural networks; and

[0021] Discriminator neural network

[0022] The refined neural network is configured to:

[0023] The predicted label distribution of the received image;

[0024] Obtain one or more random values ​​from a random or pseudo-random noise source;

[0025] Multiple predicted segmentation maps are generated from the received predicted label distribution using one or more random values; and

[0026] The multiple predicted segmentation maps are output to the discriminator neural network, and

[0027] The method includes training the refined neural network using an objective function, which is a function of the output of the discriminator neural network, and further includes a term representing the difference between the predicted label distribution and the average of the plurality of predicted segmentation maps output by the refined neural network for the predicted label distribution.

[0028] In a third aspect, the present invention provides computer software including instructions that, when executed on a computer processing system, cause the computer processing system to train a model for use in semantic image segmentation, wherein the model includes:

[0029] Refine the neural network;

[0030] Discriminator neural network

[0031] The refined neural network is configured to:

[0032] The predicted label distribution of the received image;

[0033] Obtain one or more random values ​​from a random or pseudo-random noise source;

[0034] Multiple predicted segmentation maps are generated from the received predicted label distribution using one or more random values; and

[0035] The multiple predicted segmentation maps are output to the discriminator neural network, and

[0036] The instructions wherein the computer processing system uses an objective function to train the refined neural network, the objective function being a function of the output of the discriminator neural network, and further include a term representing the difference between the predicted label distribution and the average of the plurality of predicted segmentation maps output by the refined neural network for the predicted label distribution.

[0037] The computer software may be stored on a non-transient storage medium (e.g., magnetic or solid-state memory) or carried on a transient signal (e.g., an electrical or electromagnetic signal). The computer software may further include instructions for the computer processing system to implement the refined neural network and / or the discriminator neural network.

[0038] In a fourth aspect, the present invention provides computer software or a computer processing system for implementing a trained model for use in semantic image segmentation, said trained model comprising a trained refined neural network configured to:

[0039] The predicted label distribution of the received image;

[0040] Obtain one or more random values ​​from a random or pseudo-random noise source;

[0041] Multiple predicted segmentation maps are generated from the received predicted label distribution using one or more random values; and

[0042] Output the multiple predicted segmentation maps.

[0043] The trained refinement neural network may have been trained as disclosed herein—that is, using a function as the output of the discriminator neural network and further including a target function that represents the difference between the predicted label distribution and the average of a plurality of predicted segmentation maps output by the refinement neural network for the predicted label distribution.

[0044] In some embodiments, the trained model is further configured to compute an average of the plurality of predicted segmentation maps. It may be configured to output or further process an averaged predicted segmentation map. It may be configured to apply an argmax operation to the averaged predicted segmentation map.

[0045] Therefore, it should be seen that, according to the present invention, semantic image segmentation is performed in multiple stages, wherein predicted label distributions are generated in a first stage, and wherein a second stage is trained to refine these predicted distributions, which uses a generative adversarial method to generate multiple segmentation maps for each predicted label distribution, using an objective function that encourages the average of the multiple predicted maps to align with the classification distribution of the corresponding predicted label distribution.

[0046] This approach differs architecturally from the method disclosed in WO 2019 / 238560, which uses a generator network to perform a complete segmentation from the input image to the segmentation map in a single step, without the need to create multiple predictions from each input image.

[0047] The multi-stage approach of this method, in which the predicted label distribution is provided instead of the original image as input for training the generative adversarial stage, has been found to lead to particularly fast convergence in at least some cases.

[0048] Incorporating randomness to generate multiple plausible output segmentation maps from the generative refinement network allows the model to handle uncertainty in the input data well. This method has been found to produce high pattern coverage (i.e., producing outputs across the entire space with virtually no missing patterns). Simultaneously, limiting the average of multiple output labels to align with the predicted label distribution enables the proposal to accurately reflect the distribution of the underlying data.

[0049] The predicted label distribution can be a pixel-wise distribution of the image size across a set of labels on the image. It can be a calibrated distribution. It can also be a likelihood or probability map of the image.

[0050] The refined neural network can be configured to receive a predicted label distribution from a “calibration” neural network. In some embodiments, the model may additionally include this calibration neural network. The calibration neural network can be configured to receive input images and output a corresponding predicted label distribution for each input image. It can map each input image to a per-pixel classification distribution. Specifically, it can predict a calibrated per-pixel distribution of image size on a set of labels for the input image. It can perform likelihood-based semantic segmentation. It can use a cross-entropy loss function. It can be a well-calibrated neural network. This can be achieved by calibrating the refined neural network using a calibrated calibration neural network, as described herein, thereby making the entire model well-calibrated.

[0051] The training can be implemented through training logic in the software or computer processing system.

[0052] The calibration neural network can be trained on the same training data as the refinement and discriminator network. It can be trained by the processing system.

[0053] However, this is not necessary, and in some embodiments, the refinement network may receive the predicted label distribution from a calibration neural network, which has been independently trained, for example, on different systems or at different time periods. In some embodiments, the calibration network may be implemented or executed in inference mode (i.e., untrained) while the refinement network is being trained. This can advantageously enable the model to be trained with a reduced peak computational load.

[0054] The processing system can be configured to train a discriminator neural network. The discriminator can be trained using a set of predicted segmentation maps, and then alternately using one or more ground truth segmentation maps (which may be ground truth segmentation maps corresponding to the predicted segmentation maps, or different ground truth segmentation maps). The refinement neural network and the discriminator neural network can form a generative adversarial network (GAN). They can be trained together. The processing system can be configured to alternately train the refinement network using the discriminator.

[0055] The thinning and discriminator neural network can be trained on training data including a set of predicted label distributions and a set of ground truth (e.g., human markers) segmentation maps. The training data may additionally include corresponding input images (e.g., photographs or video frames). In some embodiments, the thinning and / or discriminator neural network can be tuned to the input images; this may improve the quality of the segmentation results. The same training data (e.g., the input images and corresponding ground truth segmentation maps) can be used to train the calibration neural network, but this is not necessary, as the calibration network can be trained independently using different training data.

[0056] The objective function used to train the refined neural network can be a loss function. It can be a function of a first objective term and a second objective term (e.g., including a weighted sum). The first objective term can depend on the output of the discriminator network. The second objective term (also referred to herein as a calibration term) can depend on the difference between the predicted label distribution and the mean of the segmentation map generated by the refined network. The first objective term increases the probability that the predicted segmentation map and the ground truth map cannot be distinguished by the discriminator network. The second objective term minimizes the difference between the predicted label distribution and the mean segmentation map.

[0057] The objective function may include a weighting parameter λ for weighting the first objective term relative to the second term. This allows its relative influence to be adjusted. The system may include input for receiving the values ​​of the weighting parameter—for example, from a user.

[0058] In some embodiments, the average value of the segmentation maps can be determined by calculating the pixel-by-pixel arithmetic mean of multiple segmentation maps. However, other embodiments may use other types of average values, such as weighted averages.

[0059] The difference between the predicted label distribution and the average segmentation map can be expressed in any suitable manner. However, in some embodiments, the difference may represent the cross-entropy between the predicted label distribution and the average segmentation map. The difference can be the Kullback-Leibler divergence (positive or negative).

[0060] The discriminator network can be trained to minimize its loss in discriminating between the predicted segmentation map output by the refinement network and the ground truth label data.

[0061] Training any of the components in a neural network can include a gradient descent process.

[0062] The model can be trained on multiple input images, which may include 100, 1000, 10000 or more images.

[0063] In some embodiments, the input image may be a photographic image from a camera—for example, a frame from a video stream. It may be an image of a street scene. Ground truth label data may include data representing one or more object categories in the image.

[0064] In some embodiments, each segmentation map has a common size, which may be the number of pixels in the input image and / or the predicted label distribution. The predicted label distribution may assign a set of likelihood values ​​to each pixel of the map, representing the probability that the pixel has a corresponding label (i.e., belongs to a corresponding category). There may be any number of labels—for example, 5, 10, 50, or 100 or more. In some embodiments, the labels are related to categories in a street scene and may include “vehicles,” “people,” “buildings,” etc. Each segmentation map may assign one of these predicted labels to each pixel.

[0065] The model may have a training mode and a trained (inference) mode. Training may occur during the training phase. After the training phase, the network may be configured to receive an input image and segment the input image. In some embodiments, weights may be extracted from a trained refined neural network and used to create a trained model; these may be independent inference models that do not contain a discriminator neural network or any training logic. The trained model may be configured to output one or more predicted segmentation maps of an image—for example, multiple segmentation maps, their average, or argmax or other functions of these.

[0066] The computer processing system may include input for receiving image data from a camera. It may be an in-vehicle computer processing system. It may be configured to output segmented data—for example, to an autonomous driving system.

[0067] Images, predicted label distributions, and segmentation maps can be represented and encoded in any suitable manner. Data (including training data) can be stored in and accessed from databases or other data retrieval systems. The weights of the neural network can be stored as values ​​in digital memory.

[0068] The calibrated neural network and / or refined neural network and / or discriminator neural network may each include any number of convolutional layers, dense blocks, and other processing layers. The model and / or training logic may include software instructions for one or more processors, or may include dedicated hardware logic, or may be implemented using a combination of software and dedicated hardware. The software may include instructions stored in the memory of a computer processing system. The computer processing system may include one or more of the following: CPU, DSP, GPU, FPGA, ASIC, volatile memory, non-volatile memory, inputs, outputs, a display, network connectivity, power supply, radio, clock, and any other suitable components. It may include one or more servers or supercomputers. It may include a microcontroller or system-on-a-chip (e.g., when implementing a trained model for inference operations). It may be configured to store, display, or output predicted segmentation maps or other segmentation data.

[0069] Features of any aspect or embodiment described herein may be applied, where appropriate, to any other aspect or embodiment described herein. When referring to different embodiments or groups of embodiments, it should be understood that these embodiments are not necessarily different, but may overlap. Attached Figure Description

[0070] Some preferred embodiments of the invention will now be described by way of example only with reference to the accompanying drawings, in which:

[0071] Figure 1 This is a schematic diagram illustrating the semantic segmentation model of the present invention;

[0072] Figure 2 This is pseudocode used to train the model;

[0073] Figure 3 This is pseudocode used to run the model in inference mode;

[0074] Figure 4 This is a graph of the log-likelihood of data from a simplified implementation of the model of the present invention;

[0075] Figure 5 It is a graph of the various outputs extracted from the example's simple implementation when using calibration loss terms;

[0076] Figure 6 It is a graph of the various outputs extracted from the example's simple implementation when the calibration loss term is not used;

[0077] Figure 7 Three street scenes from the Cityscapes dataset are presented with increased blur.

[0078] Figure 8Showing four predicted segmentation maps for each of three street scenes generated by the model embodying the present invention; and

[0079] Figure 9 This is a schematic diagram illustrating the computer system of the present invention. Detailed Implementation

[0080] The following describes some exemplary neural networks embodying the present invention, which use adversarial training to perform semantic image segmentation. The techniques disclosed herein have been tested on the segmentation of street scenes captured by cameras on vehicles and have been found to be particularly effective in this task, especially when the input data contains noise or ambiguity. However, it should be understood that these networks can be applied to many other image segmentation tasks in different domains.

[0081] Given an input image with height H, width W, and color space C Semantic segmentation is the task of predicting pixel-wise category labels y∈{1,...,K} HxW In some applications, image x can be an RGB image of a street scene, and category K can include "road", "sidewalk", "building", "wall", "person", "vehicle", etc.

[0082] To provide context for the description and the terms used below, two common semantic segmentation methods will first be described.

[0083] Likelihood-based semantic segmentation

[0084] In likelihood-based semantic segmentation, a neural network can be trained to perform semantic segmentation by training data consisting of image pairs (e.g., photos) and corresponding human-labeled segmentation maps.

[0085] For a dataset of N images and label pairs Conditional distribution The likelihood q parameterized by the neural network F using weights θ and the softmax output activation function θ (y|x) is used to define the model.

[0086] A simple and effective way to teach class probabilities to a neural network is to use training labels y∈{0,1} HxWxK Represented as a one-hot encoded label graph, and q θ It is set to a pixel-wise factorized classification distribution, which is implemented using a softmax vector. q θ The probability mass is then given as follows:

[0087]

[0088] Due to the minimization of the positive Kolb-Leibler divergence with respect to θ Equivalent to minimization With q θ The cross-entropy between them, therefore for q in equation (1) θ With the choice of , the loss function simplifies to:

[0089]

[0090] Typically, F θ It is implemented as a convolutional neural network (CNN), and its weights are optimized using stochastic gradient descent (SGD) or a variant thereof.

[0091] The final segmentation map of a given input image can be obtained by applying the argmax function along the category dimension of the predicted probabilities (i.e., selecting the most likely category for each pixel).

[0092] The drawback of this method is that it generates incoherent semantic graphs. For noisy real-world datasets, maximizing factorized likelihood leads to uncertain predictions in regions of inconsistent label relevance.

[0093] Adversarial semantic segmentation

[0094] It is known that attempts have been made to mitigate this problem by using generative adversarial networks (GANs) to perform conditional semantic segmentation.

[0095] In a typical adversarial training scenario, the discriminator network learns to distinguish between manually labeled ground truth segmentations (i.e., "true" labels) and predictions provided by the generator network (i.e., "false" labels). By engaging the generator (or segmenter) and discriminator in alternating minimax training games, the generator faces the challenge of improving its predictions to deceive the discriminator by providing realistic segmentation predictions, ideally in a way that guides the generator to learn to produce desired high-level structural qualities, such as connectivity, inter-pixel consistency, and smoothness, without explicitly specifying these properties.

[0096] Formally, this involves training a binary discriminator network D. ψ To optimally distinguish between ground truth and predicted semantic labels, a segmentation network G is trained simultaneously. φ To maximize the prediction sample G φ (x i ) by D ψ The probability that the perception is true.

[0097] To account for ambiguity in the labels, the segmentation network can also be tuned for external noise variables (e.g., Gaussian noise ∈ ~N(0,1)).

[0098] The unsaturated loss function of the generator can be expressed as:

[0099]

[0100] The nonsaturation loss function of the discriminator can be expressed as:

[0101]

[0102] Essentially, GAN methods leverage the maximum likelihood learning paradigm to customize an adaptive loss function parameterized by D for the training data. Compared to explicit pixel-wise likelihood maximization, this adversarial setting learns an implicit sampler through G, which has the potential to model the joint pixel configuration of the synthesized labels and capture the local and global consistency revealed by ground truth.

[0103] In fact, simply using a noise vector as an additional input to G does not produce a different output. The lack of regularization between the noise vector and the output is the main driving force behind mode collapse, and various strategies (such as periodic consistency) have been proposed to mitigate this situation.

[0104] Furthermore, since the loss function is a moving target, the dynamic instability of adversarial training is well-known. Therefore, it is known to supplement the adversarial loss term with pixel-level cross-entropy loss from Equation 2.

[0105] Hybrid supervision of adversarial and cross-entropy losses can lead to improved empirical results. However, in the presence of noisy data, the two losses may have opposing objectives, and thus enforce them on the same set of parameters, in which case φ may be suboptimal. Classification cross-entropy loss recovers the base probability. It may have high entropy in noisy regions of the data; however, the adversarial term is optimized for low-entropy, sample-like outputs.

[0106] Calibrated multimodal adversarial semantic segmentation

[0107] In contrast to conventional methods, this embodiment decouples two conflicting losses in a two-stage, cascaded architecture, which consists of the following:

[0108] - The first stage is a likelihood-based "calibration" network F θ ,use Optimization (as in Equation 1 above); and

[0109] - Second-stage generative "refinement" network G φ Against the adversarial "discriminator" network D ψ Pairing, each using the target item and Optimize them, similar to those in equations 3 and 4.

[0110] Advantageously, this approach of decoupling cross-entropy optimization from adversarial optimization allows the complementary advantages of the two losses to be utilized, while avoiding the adverse interactions that may occur when they are linearly combined.

[0111] Furthermore, this method enables F θ Able to provide G φ This provides a good initial representation for extracting the final refined predictions. This can lead to sampling label proposals that better reflect the distribution of the underlying data.

[0112] Equally advantageous is the use of F θ The well-calibrated prediction is G φ The predicted distribution provides the target. This has been found to lead to high overall pattern coverage, stable training, and fast convergence.

[0113] Intuitively speaking, refining the network G φ It can be viewed as a sampler derived from the explicit likelihood of the calibration network model, so that the pixel-wise class probability and object coherence are preserved.

[0114] Figure 1 This demonstrates the overall machine learning model 1, which includes the calibration network 2, F θ ; Refine network 3G φ ; and discriminator network 4, D ψ . Figure 1 It also demonstrates the training data via F θ G φ and D ψ The forward flow, and various loss function terms and Input.

[0115] The calibration network 2 receives the input image 5 and outputs the pixel-by-pixel calibrated predicted label distribution 6 of the image size.

[0116] The refinement network 3 receives the predicted label distribution 6 from the calibration network 2 and Gaussian noise variables ∈ ~N(0,1) from the random noise generator 7. It outputs a set 8 of N predicted segmentation maps for the expected value of N greater than 1. Therefore, it can support multimodal classification, where a single input image can correspond to multiple valid outputs.

[0117] The discriminator network 4 receives a set 8 of N segmentation maps, and ground truth segmentation of the input image 5. Figure 9 .

[0118] Cross-entropy loss function It is a function of the predicted label distribution 6 and the ground truth label 9.

[0119] The output of binary discriminator network 4 in the generator loss term and discriminator loss function Used in.

[0120] Calibration loss term It is a function of the predicted label distribution 6 and the average predicted segmentation map 10, wherein the average predicted segmentation map 10 is the pixel-wise arithmetic mean of a set 8 of N segmentation maps.

[0121] G φ and D ψ The objectives differ from those in equations (3) and (4) to accommodate the calibration network F. θ Preprocessing.

[0122] Refined network G φ Trained to minimize the loss function that includes the loss term:

[0123]

[0124] Discriminator Network D ψ Trained to minimize the loss function:

[0125]

[0126] Refined network G φ Complete loss function It also includes calibration loss terms. as follows:

[0127]

[0128] Where λ is a variable weighting factor.

[0129] For optimal cross-entropy weights Entropy of the model distribution Equal to the entropy of the data distribution Therefore, in order to calibrate the predicted distribution, by encouraging the sample mean Matching by F θ (x) Predicts the base class probability and refines the network G φ Apply diversity regularization. That is, the calibration network is used as a vector network by G. φ The goal is to generate diverse samples.

[0130] Therefore, an auxiliary fully factorized similarity q can be defined. φ :

[0131]

[0132] Furthermore, it can optimize the inverse Kolbec-Leibler divergence KL(q) φ ||qθ ). Due to q θ and q φ Since both are categorical distributions, their divergence can be calculated accurately.

[0133] Therefore, the calibration loss term can be given as follows:

[0134]

[0135] Implementation Plan Details

[0136] The result of loss decomposition is that θ remains fixed while φ and ψ are learned. This allows some embodiments to use a pre-trained calibration network F only in inference mode. θ This reduces the overall peak computational burden.

[0137] The calculation was performed using Monte Carlo estimation. In fact, the number of predicted samples in the set of 8 necessary to find good performance may be on the order of 10, at least in some cases—for example, N=10.

[0138] Deep learning frameworks can be used, which allow samples to be included in batch dimensions and thus can be computed efficiently on GPUs.

[0139] Figure 2 This section outlines a demonstrative training process for a model using a learning rate η and calibration weights λ. In this example, the trained model includes a calibration network F. θ However, in other embodiments, the calibration network can be trained separately, for example, using different training data.

[0140] Figure 3 This section outlines exemplary use of the model in inference mode (i.e., after training).

[0141] Although not strictly necessary, adjusting the thinning network and discriminator network for the input image has been found to improve the quality of results, at least in some cases.

[0142] More generally, the tuning flexibility of this method allows any existing black-box semantic segmentation model B (which may not be able to provide multiple segmentation maps for the input image) to be tuned by adjusting F for the output of B. θ Expand upon this.

[0143] Example I - Synthetic Dataset

[0144] The following simple one-dimensional regression task provides an intuitive understanding of the mechanism of current calibration loss.

[0145] Map the input x∈[0,1] to the following random relation:

[0146]

[0147] We generate nine different scenarios by changing the degree of mode bias π∈{0, 0.1, 0.4} and mode noise σ∈{0.01, 0.02, 0.03}.

[0148] For each configuration, we train three 4-layer multilayer perceptrons (MLPs) for F, G, and D respectively, with a calibration loss coefficient λ = 1 and a learning rate η = 0.0001. We then compare the results with the case where λ = 0 (i.e., the refinement loss function has no calibration loss term).

[0149] For statistical significance, each experiment was repeated five times. Note that, unlike the similarity used in semantic segmentation tasks, we used Gaussian likelihood with a fixed scale parameter of 1. This changes the expressions for equations (2) and (8) to mean squared error loss.

[0150] Figure 4 , 5 Figure 6 shows the data configuration and the converged calibration network output, the time-varying data likelihoods for λ=1 and λ=0, and samples from the background with discriminator probabilities of the GAN.

[0151] Figure 4 The averaged log-likelihood of the data across all 90 experiments is presented.

[0152] Figure 5 The results are represented by a high bias and noisy configuration (π = 0.4, σ = 0.03) with calibration loss. Ground truth data is shown in diffuse black, and the output predictions of the refined (generator) network are shown as dots (light blue in the original). The calibration target (represented by a thicker line, red in the original) is followed by the average output from the refined network (represented by a thinner line—blue in the original). The discriminator output is represented by a background shading, where "false" is located in the region extending to the right from x = 0, y = 0 (blue in the original), and "true" is located in the darker regions around the top, right, and bottom edges of the graph (red in the original).

[0153] Figure 6 Display and Figure 5 The same experiment, but without calibrating the loss term. This typically leads to mode collapse. The "false" discriminator output is located in the upper right half of the graph, while the "true" discriminator output is located in the lower left half.

[0154] The results demonstrate faster convergence and less mode oscillation when using calibration loss. Sample plots show that mode collapse is also less pronounced in this case.

[0155] Example I—Modified Urban Landscape

[0156] This experiment follows the following evaluation scheme: the 19-category version of the Cityscapes dataset (www.cityscapes-dataset.com) is supplemented with five additional categories that fuzzily supplement the corresponding cityscape categories: (road, road2), (sidewalk, sidewalk2), (car, car2), (vegetation, vegetation2), and (person, person2).

[0157] In the ground truth segmentation map, for each of these original five categories, a random subset of pixels belonging to that category is swapped with the corresponding category (e.g., Road 2 instead of Road) with a fixed probability.

[0158] We used class flip probabilities of 8 / 17, 7 / 17, 6 / 17, 5 / 17, and 4 / 17 respectively, and, following the specification in Simon Kohl et al.'s paper "A probabilistic u-net for segmentation of ambiguous images," Advances in Neural Information Processing Systems, 2018, pp. 6965-6975, we used the average GED index to measure the quality of the samples.

[0159] Figure 7 Show three different street scenes with corresponding labels superimposed.

[0160] Figure 8 The presentation shows four different predictions obtained from the refined network for each of three street scenes (vertically stacked), arranged horizontally along each row.

[0161] It can be seen that the thinning network has learned to generate diverse and coherent segmentation maps that are consistent with the input image and arise due to the blurriness in the input image.

[0162] Figure 9 An exemplary computer processing system on which embodiments can be implemented is shown. Computer 100 includes processor 101 (e.g., Intel). TM The processor 101 is configured to execute software stored in memory 102. The processor 101 also uses memory 102 to read and write data, such as input data, intermediate computation results, and output data. The software can control the processor 101 to implement any of the methods disclosed herein. The computer 100 has input / output peripherals—for example, for receiving training data and / or for outputting data encoded by the trained network.

[0163] The model can be trained centrally (e.g., on a GPU-based supercomputer), and the trained model can be replicated and installed on other devices, such as car guidance or warning systems or autonomous vehicles. Alternatively, training can be performed in the field, for example, as continuous learning within the control system of an autonomous vehicle.

[0164] Those skilled in the art will appreciate that the invention has been described by way of one or more specific embodiments, but is not limited to these embodiments; many changes and modifications are possible within the scope of the appended claims.

Claims

1. A computer processing system configured to train a model for use in semantic image segmentation, wherein the model comprises: Calibrate the neural network; Refine the neural network; and Discriminator neural network The refined neural network is configured to: The predicted label distribution of the image received from the calibration neural network; Obtain one or more random values ​​from a random or pseudo-random noise source; Multiple predicted segmentation maps are generated from the received predicted label distribution using the one or more random values; and The multiple predicted segmentation maps are output to the discriminator neural network, and The computer processing system is configured to train the refined neural network using an objective function, which is a function of the output of the discriminator neural network, and further includes a term representing the difference between the predicted label distribution and the average of the plurality of predicted segmentation maps output by the refined neural network for the predicted label distribution.

2. The computer processing system according to claim 1, configured to implement the model.

3. The computer processing system of claim 1, wherein the calibration neural network is configured to receive an input image and output a corresponding predicted label distribution of the input image by performing likelihood-based semantic segmentation.

4. The computer processing system of claim 1, further configured to train the calibration neural network and the discriminator neural network.

5. The computer processing system according to claim 1 or 2, configured to train the refined neural network and the discriminator neural network using a generative adversarial network training process.

6. The computer processing system of claim 5, configured to adjust the thinning neural network and the discriminator neural network for an input image.

7. The computer processing system of claim 1 or 2, configured to train the discriminator neural network using a loss function relating to the ability of the discriminator neural network to distinguish between the predicted segmentation map output by the refined neural network and ground truth label data.

8. The computer processing system of claim 1 or 2, wherein the objective function for training the refined neural network comprises a first objective term depending on the output of the discriminator neural network and a second objective term depending on the difference between the predicted label distribution and the average value of the predicted segmentation map generated by the refined neural network.

9. The computer processing system of claim 1 or 2, configured to calculate the average value of the predicted segmentation maps by calculating the pixel-wise arithmetic mean of the plurality of predicted segmentation maps.

10. The computer processing system of claim 1 or 2, configured to determine the difference between the predicted label distribution and the average value of the predicted segmentation map by determining the forward or reverse Kolb-Leibler divergence of the predicted label distribution and the average value of the predicted segmentation map.

11. The computer processing system of claim 1 or 2, configured to train either the refined neural network or the discriminator neural network using a gradient descent process.

12. A method for training a model for use in semantic image segmentation, wherein the model comprises: Calibrate the neural network; Refine the neural network; and Discriminator neural network The refined neural network is configured to: The predicted label distribution of the image received from the calibration neural network; Obtain one or more random values ​​from a random or pseudo-random noise source; Multiple predicted segmentation maps are generated from the received predicted label distribution using the one or more random values; and The multiple predicted segmentation maps are output to the discriminator neural network, and The method includes training the refined neural network using an objective function, which is a function of the output of the discriminator neural network, and further includes a term representing the difference between the predicted label distribution and the average of the plurality of predicted segmentation maps output by the refined neural network for the predicted label distribution.

13. The method of claim 12, wherein the image is a street scene.

14. A computer-readable storage medium including instructions that, when executed on a computer processing system, cause the computer processing system to train a model for use in semantic image segmentation, wherein the model includes: Calibrate the neural network; Refine the neural network; and Discriminator neural network The refined neural network is configured to: The predicted label distribution of the image received from the calibration neural network; Obtain one or more random values ​​from a random or pseudo-random noise source; Multiple predicted segmentation maps are generated from the received predicted label distribution using the one or more random values; and The multiple predicted segmentation maps are output to the discriminator neural network, and The instructions wherein the computer processing system uses an objective function to train the refined neural network, the objective function being a function of the output of the discriminator neural network, and further include a term representing the difference between the predicted label distribution and the average of the plurality of predicted segmentation maps output by the refined neural network for the predicted label distribution.

15. The computer-readable storage medium of claim 14, further comprising instructions for causing the computer processing system to implement the refined neural network and the discriminator neural network.

16. A computer-readable storage medium including instructions that, when executed on a computer processing system, cause the computer processing system to implement a trained model for use in semantic image segmentation, the trained model including a trained refined neural network configured to: Predicted label distribution of images received from a calibration neural network; Obtain one or more random values ​​from a random or pseudo-random noise source; Multiple predicted segmentation maps are generated from the received predicted label distribution using the one or more random values; and Output the multiple predicted segmentation maps.

17. The computer-readable storage medium of claim 16, wherein the trained refined neural network has been trained according to the method of claim 12.

18. The computer-readable storage medium of claim 16 or 17, wherein the trained model is further configured to calculate and output the average value of the plurality of predicted segmentation maps.

19. The computer-readable storage medium of claim 16 or 17, further comprising instructions for generating the predicted label distribution from an input image using a trained and calibrated neural network.