Method and apparatus for image segmentation
By employing a two-stage training method with a multi-exit semantic segmentation network (MESS) and a positive filtering distillation technique, the latency problem of semantic segmentation on resource-constrained devices is solved, achieving efficient and accurate semantic image segmentation suitable for heterogeneous target devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SAMSUNG ELECTRONICS CO LTD
- Filing Date
- 2021-11-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing semantic segmentation techniques struggle to meet real-time latency requirements on resource-constrained devices, especially consumer devices such as smartphones, and existing methods for reducing inference latency are ineffective in semantic segmentation networks.
A two-stage training scheme is adopted for the Multi-Exit Semantic Segmentation Network (MESS). The backbone network and early exits are trained end-to-end, and the training of early exits is optimized by positive filtering distillation technology. Combined with architecture configuration search, it can adapt to the hardware constraints and inference requirements of different devices.
It achieves a significant reduction in inference latency without compromising accuracy, resulting in a 2.83x latency gain and a 5.33% accuracy improvement, adapting to the computing power and application requirements of different devices.
Smart Images

Figure CN116438570B_ABST
Abstract
Description
Technical Field
[0001] This application generally relates to a method for generating machine learning ML models to perform semantic image segmentation, and to a computer-implemented method for performing semantic image segmentation using a trained machine learning ML model. Background Technology
[0002] Semantic segmentation constitutes a core task in machine vision, and significant progress has been made in this area thanks to the advent of deep learning. Semantic image segmentation networks handle the finest-grained visual scene understanding tasks by predicting dense (per pixel) semantic labels for images of arbitrary resolution. These dense semantic predictions facilitate a wide range of applications related to mobile augmented reality / virtual reality (AR / VR) applications, autonomous robotics, navigation, semantic mapping, telepresence agents, efficient video communication, and more. Quality of service and security are paramount when deploying such real-time systems, which typically run on resource-constrained platforms such as smartphones, consumer robotic devices, and autonomous vehicles. Therefore, efficient and accurate segmentation is a core problem that needs to be addressed. Summary of the Invention
[0003] [Technical Issues]
[0004] Current semantic segmentation techniques typically involve computationally and storage-intensive deep learning models, which often fail to meet the real-time latency requirements of applications when deployed on consumer devices such as smartphones. Specifically, the per-pixel nature of the segmentation output requires preserving high-resolution feature maps throughout the underlying neural network (to avoid eliminating spatial information) while maintaining a large receptive field on the output (to incorporate context and achieve robust semantic predictions). Consequently, the resulting network architectures often consist of numerous layers, frequently replacing feature capacity downsampling with rate-increasing dilated convolutions, leading to a significant concentration of workload in deeper layers, which in turn results in latency-intensive inference. This situation worsens on lower- to mid-level devices, as these tend to have less processing power and memory than top-level devices. Therefore, reducing inference latency is desirable. Reducing inference latency can also improve user experience through smooth and seamless interaction, enhance functionality by freeing up space for other tasks running on shared device resources, and improve safety when semantic segmentation predictions contribute to real-time mission-critical decision-making (e.g., in autonomous vehicles). However, current methods for reducing inference latency include efficient hand-crafted model design and adaptive computational models. For coarser image classification tasks, this challenge is effectively addressed through cascaded systems and early exit architectures. However, semantic segmentation networks exhibit unique challenges when employing such methods.
[0005] [Problem Solving]
[0006] The applicant has recognized the need for an improved semantic image segmentation network or ML model that can make predictions faster without significant loss of accuracy.
[0007] Semantic segmentation is a cornerstone of many vision systems, from self-driving cars and robot navigation to augmented reality and remote conferencing. Often operating under strict latency constraints within limited resources, optimizing efficient execution becomes crucial. To address this, this technique provides a framework for transforming existing segmentation models into MESS networks, specifically trained convolutional neural networks (CNNs) that employ parameterized early exits along their depth to save computation during inference on simpler samples. Simply designing and training such networks can compromise performance. Therefore, this technique offers a two-stage training process that pushes semantically important features early in the network. The number, location, and architecture of the attached segmentation heads are jointly optimized with the exit strategy to suit device capabilities and application-specific requirements. By optimizing for speed, the MESS network achieves a 2.83x latency gain compared to state-of-the-art methods without sacrificing accuracy. Thus, with the same computational budget, this technique achieves an improvement of up to 5.33 percentage points in accuracy through optimization.
[0008] In a first method of this technology, a computer-implemented method for generating a machine learning (ML) model for semantic image segmentation is provided, the method comprising: providing a backbone feature extraction network of the ML model having multiple early exits in a backbone network to generate an over-feeding network including multiple candidate early exit segmentation network architectures, wherein each early exit includes a customized network architecture; obtaining a training dataset comprising multiple images; and training the backbone network, final exit, and early exits of the ML model by the following steps to output feature maps of the multiple images input to the backbone network: during a first training phase, training the final exit, the backbone network, and the early exits end-to-end; and after the end-to-end training is completed, freezing the weights of the backbone network and the final exit, and during a second training phase, training the early exits individually using the final exit as the teacher of the remaining early exits.
[0009] Preferably, each early exit includes a "segmentation head". The segmentation head has a neural network architecture, which can be, for example, a head based on a fully convolutional network (FCN head) or a head based on DeepLabV3 (DLB head). Therefore, each segmentation head includes a neural network for providing image segmentation predictions. Each early exit / segmentation head includes a customized network architecture. That is, each early exit in the candidate early exit architectures can have the same network architecture, or it can be different, or it can have a network architecture selected from a set of possible network architectures. This means that the early exit network architecture is not necessarily consistent across specific candidate early exit segmentation network architectures. This is advantageous because shallow exits benefit from network architectures with many lightweight layers, while deep exits favor channel-rich network architectures, thus allowing non-uniform early exit network architectures to be tailored to different devices, different inference settings, and different user inference requirements.
[0010] In other words, this technique provides a method for training ML models in the form of multi-exit semantic segmentation networks (or progressive segmentation networks). This network includes numerous early exit points (i.e., segmentation heads) attached to different depths of a backbone convolutional neural network (CNN) architecture. This provides segmentation predictions with varying workload (and accuracy) characteristics, introducing a "train once, deploy anywhere" approach for efficient semantic segmentation. Advantageously, this means the network can be parameterized without retraining to be deployed on heterogeneous target devices with varying capabilities (from low-end to high-end).
[0011] This is achieved through two processes. First, the technique includes a two-stage training scheme tailored for multi-exit semantic segmentation networks. In the first stage, a novel regularized end-to-end training algorithm is introduced, where the network's backbone architecture and all exit points (i.e., the final exit point and any early exit points) are trained together, with early exits being sequentially discarded in a round-robin manner during each training duration. (That is, the backbone and individual early exits are trained by sequentially discarding early exits during each training duration. This process is repeated for each combination of the backbone and individual early exits). The first stage fully trains the backbone network and the weights of the final exit in an exit-aware manner, while initializing the weights of the early exits for fine-tuning in the next stage. In the second stage, the backbone and the final exit are frozen (i.e., the weights of the backbone and the final exit are not updated), and the early exits are trained independently. This stage employs a novel knowledge distillation method that quantifies the difficulty of classifying each pixel (considering the correctness of the final exit prediction) and distills only samples correctly classified by the final exit. This two-stage scheme achieves high accuracy for both shallow and final exits.
[0012] The first training phase may include: iteratively training the backbone network and early exits, wherein during each iteration, training includes: selecting an early exit from a plurality of early exits to be updated; removing the remaining early exits; and training the backbone network and the selected early exit, and updating the weights of the backbone network and the selected early exit.
[0013] Preferably, for each selected early exit, the remainder of the early exit is sequentially discarded during each iteration of training the selected early exit.
[0014] The second training phase may include: using segmentation predictions of the image made by the final exit, determining the difficulty of each pixel in the image based on whether the prediction for each pixel is correct; and training the early exits using only the pixels in which the predictions are correct. That is, this technique provides a positive filtering distillation technique that selectively controls the flow of information to earlier exits using only the signal from the correct samples from the final exit. The proposed distillation scheme evaluates the difficulty of the correctness of each pixel in the input samples relative to the teacher's prediction (i.e., the final output). Subsequently, the stronger (higher entropy) ground truth reference signal fed to the early exits is filtered, allowing only information from "easy" pixels to pass through. Therefore, by avoiding contamination of the training algorithm with noisy gradients from contradictory loss terms, the training effort and learning capacity of each exit are focused on the "easier" pixels.
[0015] The method for generating ML models may also include performing an architecture configuration search to identify an architecture suitable for a specific application from multiple candidate early exit segmentation network architectures.
[0016] The method may further include: receiving hardware constraints and / or inference performance requirements; receiving inference settings for a specific device or device class to be used for processing the input image at inference time; and performing an architecture configuration search using the received hardware constraints and / or inference performance requirements and the received inference settings.
[0017] Therefore, this advantageously enables the identification of a suitable early exit segmentation network architecture from all possible candidates, which will be suitable for operation on devices with specific hardware constraints. Since this technology provides a "train once, deploy anywhere" approach, a trained network can be parameterized without retraining for deployment on devices with hardware constraints and / or varying inference time performance requirements, which can be user-configurable or application-specific. For example, for safety-critical autonomous vehicles, fast and high-accuracy semantic image segmentation may be required, while for other use cases / applications, slower processing and / or lower accuracy may be acceptable.
[0018] At least one hardware constraint may be one of the following: the device's computational load, the device's storage capacity, and the device's power consumption.
[0019] The method may also include sending or transmitting, or otherwise making available, the identified and extracted early exit segmentation network architecture to devices with the same hardware constraints and / or inference performance requirements and using the same inference settings.
[0020] The received inference setup can be a budgeted inference setup. In this case, the architecture configuration search output includes an architecture consisting of a backbone feature extraction network and a single early exit. During inference, all samples are processed by this architecture, deterministically satisfying requirements such as workload, memory, and size.
[0021] Inference performance requirements can be any of the following: required confidence level, required minimum accuracy, latency limit per image, latency limit for image group, and inference time limit.
[0022] The received inference settings can be on-demand inference settings. In this case, the architecture configuration search output includes the architecture of the backbone feature extraction network and multiple early exits. During inference time, samples are processed sequentially by each of the selected early exits, where each early exit provides a segmentation prediction that gradually improves / enhances over time. Other components of the system or the user can benefit from the early predictions at runtime.
[0023] When the received inference setting is an input-dependent inference setting, the architecture configuration search output includes an architecture comprising a backbone feature extraction network and multiple early exits. This architecture includes a confidence evaluation unit associated with each early exit to evaluate the confidence of the predictions made by each early exit during inference. In this case, the exits process each sample sequentially, and after each prediction, the confidence evaluation unit determines whether the current image requires further processing (via subsequent exits) or can terminate its computation if a prediction with sufficient confidence has already been provided at the image level (not per pixel). The confidence level attempts to capture the concept of image segmentation difficulty, allowing for early exits for simple samples, and performing "moderate computation" on each input sample at runtime.
[0024] The confidence assessment unit for each early exit can be configured to: calculate the confidence value of the image segmentation prediction made by the relevant early exit on the overall image; determine whether the confidence value is greater than or equal to a threshold confidence value; and instruct the processing to continue to the subsequent early exit when the confidence value is lower than the threshold confidence value, or instruct the processing to terminate when the confidence value is greater than or equal to the threshold confidence value.
[0025] Calculating the overall confidence value of an image can include: obtaining a per-pixel confidence map that includes the confidence values of each pixel in the image; identifying pixels located near semantic edges of objects in the prediction; and outputting the percentage of pixels in the image with a per-pixel confidence value greater than or equal to a threshold confidence value, where the contribution of the identified pixels is reduced. During architecture configuration search, the threshold confidence value can be optimized for each early exit, as well as the number, location, and configuration (architecture) of early exits.
[0026] Therefore, this technique includes an input-dependent inference method for multi-exit semantic segmentation networks. It employs a novel mechanism to estimate prediction confidence in segmentation tasks, namely dense per-pixel classification rather than per-image classification. This involves using the percentage of pixels whose prediction confidence exceeds a given confidence threshold. Furthermore, pixels closer to the semantic edges of objects are reweighted to reduce their contribution, based on their confidence tending towards lower observations. Thus, this technique provides a stable confidence estimate for segmentation predictions, unaffected by extremely unconfident pixels / regions in the image. This input-dependent inference method is used to estimate the prediction confidence for each exit, enabling "simple" inputs to exit early with corresponding performance gains.
[0027] In a second method of this technology, a computer-implemented method for performing semantic image segmentation on a device using a trained machine learning (ML) model is provided. The method includes: obtaining an instance of the trained ML model, the instance being an early exit segmentation network architecture associated with the device or the device category to which the device belongs and suitable for an inference setup used by the device; receiving an image to be processed by the instance of the trained ML model; and performing image segmentation on the received image using the instance of the trained ML model. It will be understood that the term "obtain" can mean obtaining an instance of the trained ML model from a server, which may occur once. It will also be understood that the term "obtain" can mean obtaining an instance of the trained ML model from memory or local storage on the device, which is used each time image segmentation is performed.
[0028] When the early exit segmentation network architecture includes a backbone feature extraction network and a single early exit, performing image segmentation can include outputting image segmentation predictions from the single early exit.
[0029] When the early exit segmentation network architecture includes a backbone feature extraction network and multiple early exits, performing image segmentation may include processing the image sequentially by the early exits. As mentioned above, the network architecture of each early exit may be the same or different (i.e., inconsistent). After processing by the early exits, the method may include: providing an image segmentation prediction from the early exits; calculating a confidence value for the overall image segmentation prediction; determining whether the confidence value is greater than or equal to a threshold confidence value; and processing the image using subsequent early exits when the confidence value is less than the threshold confidence value; or outputting the image segmentation prediction from the early exits when the confidence value is greater than or equal to the threshold confidence value. The overall confidence value of the image can be determined by considering the percentage of pixels in the image with a pixel-level confidence value higher than the threshold confidence value. Therefore, the number of pixels with a confidence value higher than the threshold confidence value is determined from the total number of pixels in the image, and this percentage is used to determine whether the entire image meets the threshold confidence value.
[0030] When the early exit segmentation network architecture includes a backbone feature extraction network, multiple early exits, and a confidence evaluation unit associated with each early exit, the confidence evaluation unit of each early exit is configured to: obtain a per-pixel confidence map including the confidence value of each pixel in the image; identify pixels located near the semantic edges of objects in the prediction; and generate a confidence value for the entire image, wherein the confidence value is the percentage of pixels in the image that have a per-pixel confidence value greater than or equal to a threshold confidence value associated with the early exit, wherein the contribution of the identified pixels is reduced.
[0031] In a third method of the present technology, an apparatus is provided for performing semantic image segmentation using a trained machine learning (ML) model. The apparatus includes: at least one processor coupled to a memory and arranged to: obtain an instance of the trained ML model, the instance being an early exit segmentation network architecture associated with the apparatus or a device category to which the apparatus belongs; receive an image to be processed by the trained ML model; and perform image segmentation on the received image using the instance of the trained ML model.
[0032] The features of the second method described above also apply to the third method.
[0033] The apparatus may also include at least one image capture device for capturing images or videos to be processed by the ML model.
[0034] The device may also include at least one interface for providing the processing results of the ML model to the user of the device.
[0035] The device can be any of the following: smartphone, tablet, laptop, computer or computing device, virtual assistant device, vehicle, drone, autonomous vehicle, robot or robotic device, robotic assistant, image capture system or device, augmented reality system or device, virtual reality system or device, gaming system, Internet of Things device, or smart consumer device (such as a smart refrigerator). It should be understood that this is a non-exhaustive and non-limiting list of example devices.
[0036] In a fourth method of this technology, a server is provided for generating a machine learning (ML) model for semantic image segmentation. The server includes: at least one processor connected to memory and configured to: provide a backbone feature extraction network of the ML model with multiple early exits in a backbone network to generate an over-feeding network including a segmentation network architecture with multiple candidate early exits; obtain a training dataset including multiple images; and train the backbone network and early exits of the ML model through the following steps to output feature maps of the multiple images input to the backbone network: training the backbone network and early exits end-to-end; and after end-to-end training is complete, freezing the weights of the backbone network and the final early exits, and using the final early exit as the teacher of the remaining early exits to train all remaining early exits individually.
[0037] The features described above for the first method also apply to the fourth method.
[0038] In the related methods of this technology, a non-transitory data carrier carrying processor control code is provided to implement the methods described herein. That is, a computer-readable storage medium including instructions that, when executed by a computer, cause the computer to perform the steps of the methods described herein.
[0039] As will be understood by those skilled in the art, this technology can be embodied as a system, method, or computer program product. Therefore, this technology can take the form of a purely hardware embodiment, a purely software embodiment, or an embodiment combining software and hardware aspects.
[0040] Furthermore, this technology can take the form of a computer program product embodied in a computer-readable medium having computer-readable program code contained thereon. The computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. The computer-readable medium can be, for example, but not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any suitable combination of the foregoing.
[0041] The computer program code used to perform the operations of this technology can be written in any combination of one or more programming languages, including object-oriented programming languages and traditional procedural programming languages. Code components can be embodied as procedures or methods, and can include sub-components, which can take the form of instructions or sequences of instructions at any level of abstraction, from direct machine instructions of the native instruction set to high-level compiled or interpreted language constructs.
[0042] Embodiments of this technology also provide a non-transitory data carrier that, when implemented on a processor, causes the processor to perform any of the methods described herein.
[0043] These technologies also provide processor control code to implement the methods described above on, for example, general-purpose computer systems or digital signal processors (DSPs). The technology also provides a carrier for the processor control code, which, when executed, implements any of the methods described above, particularly on a non-transitory data carrier. The code may be provided on a carrier such as a disk, microprocessor, CD-ROM or DVD-ROM, programmable memory such as non-volatile memory (e.g., flash memory) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. The code (and / or data) implementing embodiments of the technologies described herein may include source, object, or executable code in conventional programming languages (interpreted or compiled) such as Python, C, or assembly code; code for setting up or controlling an ASIC (Application-Specific Integrated Circuit) or FPGA (Field-Programmable Gate Array); or code for a hardware description language such as Verilog (RTM) or VHDL (Very High Speed Integrated Circuit Hardware Description Language). As those skilled in the art will understand, such code and / or data may be distributed among multiple interconnected components communicating with each other. The technology may include a controller comprising a microprocessor, working memory, and program memory coupled to one or more components of the system.
[0044] Those skilled in the art will also appreciate that all or part of the logical methods according to embodiments of the present technology can be suitably embodied in a logic device, which includes logic elements that perform the steps of the described methods, and such logic elements can include components such as logic gates in, for example, programmable logic arrays or application-specific integrated circuits. This logical arrangement can also be embodied in enabling elements for temporarily or permanently establishing logical structures in such arrays or circuits using, for example, a virtual hardware descriptor language, which can be stored and transmitted using a fixed or transmissible carrier medium.
[0045] In an embodiment, this technology can be implemented in the form of a data carrier having functional data thereon, the functional data including a functional computer data structure, which, when loaded into a computer system or network and thus run, enables the computer system to perform all the steps of the above method.
[0046] Using machine learning or artificial intelligence models, the methods described above can be performed, in whole or in part, on a device, i.e., an electronic device. The model can be processed by a dedicated AI processor designed within a hardware architecture specified for processing the AI model. The AI model can be obtained through training. Here, "obtained through training" means obtaining a predefined operating rule or AI model configured to perform desired features (or objectives) by training a basic AI model with multiple training data using a training algorithm. The AI model can include multiple neural network layers. Each of the multiple neural network layers includes multiple weight values, and neural network computation is performed through calculations between the results of the previous layer and the multiple weight values.
[0047] As described above, this technology can be implemented using an AI model. AI-related functions can be executed via non-volatile memory, volatile memory, and a processor. The processor can include one or more processors. Here, the one or more processors can be: general-purpose processors, such as a central processing unit (CPU), application processor (AP), etc.; graphics processing units only, such as a graphics processing unit (GPU), vision processing unit (VPU); and / or dedicated AI processors, such as a neural processing unit (NPU). The one or more processors control the processing of input data according to predefined operating rules or artificial intelligence (AI) models stored in the non-volatile memory and volatile memory. The predefined operating rules or AI models are provided through training or learning. Here, providing through learning means generating predefined operating rules or AI models with desired characteristics by applying a learning algorithm to multiple learning data sets. Learning can be performed within the device itself, where the AI according to the embodiment is performed, and / or can be implemented via a separate server / system.
[0048] AI models can consist of multiple neural network layers. Each layer has multiple weight values, and layer operations are performed by computing the previous layer and operating on the multiple weights. Examples of neural networks include, but are not limited to, Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Bidirectional Recurrent Deep Neural Networks (BRDNNs), Generative Adversarial Networks (GANs), and Deep Q-Networks.
[0049] A learning algorithm is a method of training a predetermined target device (e.g., a robot) using multiple training datasets to enable, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Attached Figure Description
[0050] The implementation of this technology will now be described by way of example only, with reference to the accompanying drawings, wherein:
[0051] Figure 1 This is a schematic diagram illustrating an example of a multi-exit semantic image segmentation model;
[0052] Figure 2 A schematic diagram of the outlet structure of the model is shown;
[0053] Figure 3A , Figure 3B and Figure 3C The architecture of the ML model is shown in turn when performing budgeted inference, on-the-fly inference, and input-dependent inference.
[0054] Figure 4 It is a chart comparing the performance of different early export policies / standards;
[0055] Figure 5A and Figure 5B Two example input images are shown, processed using a semantic segmentation model during model training.
[0056] Figure 6A An example input image is shown, along with two predictions made using the model: the final exit point and the early exit point.
[0057] Figure 6B The graph shows the difference in accuracy between predictions made using the final exit point and earlier exit points for multiple input samples.
[0058] Figure 7 A schematic diagram illustrating the use of the trained model is shown;
[0059] Figure 8 A flowchart illustrating example steps for generating a semantic segmentation model is shown; specifically, training an overfeeding network that includes multiple candidate early exit segmentation network architectures.
[0060] Figure 9 A flowchart illustrating example steps for generating a semantic segmentation model is shown, specifically, searching for a particular candidate architecture;
[0061] Figure 10 A flowchart illustrating example steps for semantic segmentation prediction using a trained model is shown.
[0062] Figure 11 A block diagram of the apparatus for implementing the training model is shown; and
[0063] Figure 12 This diagram illustrates the entire process of generating a “train once, deploy anywhere” ML model. Detailed Implementation
[0064] In summary, this technology generally relates to a method and server for generating machine learning (ML) models to perform semantic image segmentation, and to a computer-implemented method and apparatus for performing semantic image segmentation using a trained machine learning (ML) model. The training method enables the semantic image segmentation ML model to make predictions faster without significant loss of accuracy. The training method also enables the ML model to be implemented on devices with different hardware specifications, such as different computing power and memory.
[0065] As mentioned above, mitigating latency overhead is crucial, especially for edge deployments on resource-constrained platforms. In this direction, recent work has focused on designing lightweight segmentation models manually or via Neural Architecture Search (NAS). Meanwhile, advances in adaptive DNN inference, which dynamically adjust computational paths in an input-dependent manner to provide complementary gains, have targeted image classification, leaving the challenges in segmentation largely unresolved. In fact, simply applying early exits to segmentation CNNs results in accuracy degradation due to early exit "crosstalk" during training, and the potential latency gain is zero due to the inherently heavyweight architecture of the segmentation head. For example, simply adding a single segmentation head to DeepLabV3 incurs up to 40% of the workload of the original model. Equally important, the dense outputs of segmentation models further complicate exit strategies. Some of these existing techniques will now be described.
[0066] Efficient segmentation. Semantic segmentation has been rapidly evolving since the first CNN-based method emerged. Recent advances have focused on optimizing accuracy through stronger backbone CNNs, dilated convolutions, multi-scale processing, and multi-path refinement. To reduce computational costs, researchers have explored lightweight handcrafted architectures and, more recently, architectures built using NAS, and further efforts have been made to compensate for lost accuracy through knowledge distillation or adversarial training. Advantageously, the framework of this technique is model-independent and can be applied to existing CNN backbones, whether lightweight or not, achieving significant complementary gains through orthogonal dimensions that depend on (dynamic) path selection based on the input.
[0067] Adaptive inference. The key paradigm behind adaptive inference is saving computation on “simple” samples, thereby reducing total computation time with minimal impact on accuracy. Existing methods in this direction can be categorized as: 1) Dynamically routed networks that select different sequences of operations by skipping layers or channels, thus operating in an input-dependent manner, and 2) Multi-exit networks forming a class of architectures with intermediate classifiers along their depth. Such networks offer different accuracy-cost tradeoffs, with earlier exits running faster and deeper exits being more accurate. Existing work has primarily focused on image classification, proposing handcrafted, model-agnostic, and deployment-aware architectures. However, adopting these techniques in segmentation models presents additional, unexplored challenges.
[0068] Adaptive Segmentation Networks. Recently, preliminary efforts have been made in adaptive segmentation. For example, NAS has been combined with a trainable dynamic routing mechanism that generates data-dependent processing paths at runtime. However, by incorporating computational costs into the loss function, this approach lacks flexibility in applications with varying needs or customized deployments across heterogeneous devices without retraining. Cascaded (LC) Research on Early-Stop Segmentation. This approach treats segmentation as a large set of independent classification tasks, where each pixel is only propagated to the next exit if the latest prediction does not exceed a confidence threshold. Nevertheless, this scheme results in severe unstructured computation due to the disparate per-pixel paths, and existing BLAS libraries cannot provide a practical speedup. Furthermore, LC constitutes a handcrafted architecture, heavily relies on Inception-ResNet, and is not applicable to a wide range of backbones; its architecture is also ill-suited to the capabilities of the target devices.
[0069] Multi-exit network training. To date, training of multi-exit models can be categorized into: 1) end-to-end schemes that jointly train the backbone and early exits, increasing the accuracy of early exits at the cost of often experiencing deeper accuracy degradation or even divergence; and 2) the frozen backbone method, which first trains the backbone until convergence, then separately attaches and trains intermediate exits. This independence between the backbone and exits allows for faster exit training, but accuracy is compromised due to fewer degrees of freedom in parameter tuning. This technique introduces a novel two-stage training scheme for MESS networks, comprising an exit-aware backbone training step that pushes for extracting semantically “strong” features early in the network, followed by a frozen backbone step for fully training early exits without compromising the accuracy of the final exits.
[0070] A complementary approach aimed at further improving early exit accuracy involves knowledge distillation between exits, investigated in classification and domain adaptation tasks. This scheme employs self-distillation, treating the last exit as the teacher and the intermediate classifiers as students without prior knowledge of the true values. In contrast, the proposed positive filtering distillation (PFD) scheme leverages densely structured information in semantic segmentation and only allows knowledge to flow through pixels where the teacher is correct.
[0071] Multi-Exit Segmentation Network. This technique provides a novel MESS framework for deriving and training a Multi-Exit Semantic Segmentation (MESS) network from a user-defined architecture for efficient segmentation, adaptable to available devices and tasks. Given a CNN, MESS treats it as the backbone architecture and appends multiple “early exits” (i.e., segmentation heads) at different depths, providing predictions with varying workload accuracy features. Figure 1 This is a schematic diagram illustrating an instance of this multi-exit semantic image segmentation model. The term "instance" used in this paper refers to a specific configuration or version of the MESS network. That is, the training method trains a single network that includes multiple candidate early exit segmentation network architectures, and the "instance" is one of these candidate architectures. Figure 1 The example shown illustrates instances including two early exits attached to the backbone at different locations, and a final exit provided at the end of the backbone (and as part of the backbone itself). Importantly, the architecture, number, and location of the early exits remain configurable and can be co-optimized via search when deployed to target devices with different capabilities and application requirements, without retraining, thus enabling a "train once, deploy everywhere" paradigm. In this way, MESS can support various inference pipelines, from progressive refinement of subnetwork extraction to prediction and confidence-based exit.
[0072] The MESS network combines the strengths of all the aforementioned areas. In its holistic approach to addressing the unique challenges of detailed scene understanding models, this technique's framework integrates end-to-end frozen backbone training, hand-designed extended networks with automated architecture configuration search, and delayed constraint inference with confidence-based early exit.
[0073] Advantageously, this technique provides a design for a MESS network that combines adaptive inference at early exits with architecture customization, offering a fine-grained speed and accuracy trade-off tailored for semantic segmentation tasks. This allows for efficient inference based on the difficulty of the input and the capabilities of the target device.
[0074] As described above, this technique provides a method for training ML models in the form of multi-exit semantic segmentation networks (or progressive segmentation networks). This network includes numerous early exit points (i.e., segmentation heads) attached to different depths of a backbone convolutional neural network (CNN) architecture. This provides segmentation predictions with varying workload (and accuracy) characteristics, introducing a "train once, deploy anywhere" approach for efficient semantic segmentation. Advantageously, this means the network can be parameterized without retraining to be deployed on heterogeneous target devices with varying capabilities (from low-end to high-end).
[0075] The ML model of this technology can be adapted to a wide variety of deployment scenarios, including, for example:
[0076] Extract a lighter-workload sub-model for deployment on devices with different computing capabilities (e.g., mobile phones) to meet latency constraints by completely skipping some computations.
[0077] Based on the allocation of available resources on the target device and the computational load, a computational path is selected at runtime to maintain consistent predicted latency.
[0078] A rapid approximation of the prediction is obtained in the early stages of computation and then gradually refined over time.
[0079] The computation path is selected at runtime based on the difficulty of each input sample / prediction confidence obtained at different computation stages.
[0080] The partitioning model is used to coordinate cloud device execution (i.e., compute offloading) while still being able to obtain an approximation of the final prediction relying solely on onboard computing resources (to address network availability / quality issues).
[0081] Integrate expert-segmented outputs, focusing on different category sets (e.g., human / pet) or fine-tuning user-centric data distributions (e.g., indoor / outdoor).
[0082] Backbone initialization and output placement. As a first step, the CNN backbone is provided. Typical semantic segmentation CNNs attempt to prevent the loss of spatial information (which inherently occurs in classification) without reducing the receptive field on the output pixels. For example, dilated residual networks allow up to 8x spatial reduction in feature maps and replace any other conventional downsampling with twice the dilation rate in convolution operations. Similar assumptions are made for the backbone used to generate the MESS network.
[0083] However, this approach increases the feature resolution of deeper layers, which typically integrates a greater number of channels. Therefore, typical CNN architectures for segmentation contain deeper, more workload-intensive layers, leading to an imbalanced distribution of computational demands and an increased overall workload. This fact further fuels the need for early exits to eliminate unnecessary computations and conserve performance.
[0084] Next, benchmarking is performed on the provided backbone. Benchmarking can be based on FLOPS workload per tier, number of parameters, or latency on the target device. Based on the results of this analysis, N candidate egress points are identified. For simplicity, egress points are limited to a single network block b. k The output terminals follow an approximately equidistant workload distribution (i.e., one-in-N of the total trunk FLOPs). This maximizes the distance between them instead of searching between similar exit locations, thus improving search efficiency.
[0085] Early exit architecture. Early exits in DNNs face the challenges of limited receptive fields and weak semantics from shallow exits. These challenges are addressed in two ways: i) during training, by pushing the extraction of semantically strong features to shallower layers of the backbone, and ii) by introducing a carefully designed architecture configuration space for each exit based on its position in the backbone, exploring it to generate MESS instances, and customizing it to suit latency and accuracy constraints.
[0086] Architecture configuration space. Figure 2 A schematic diagram of the outlet structure of this model is shown. In general, each outlet head ( Figure 2 The configuration space shape is as follows:
[0087] 1. Channel reduction module:
[0088] 2. Additional trainable blocks: O blocks ={0,1,2,3}
[0089] 3. Rapidly expand incrementally:
[0090] 4. Segmentation Head:
[0091] Formally, the configuration space of the i-th exit architecture is represented as: Among them O crm O blocks O dil and O head These are the available options for CRM, the number of trainable blocks, the fast expansion increment, and the segment head.
[0092] Channel Reduction Module (CRM). A key differentiating challenge in early exits during segmentation compared to classification is the significantly higher workload of the segmentation head, stemming from the increased input feature capacity being processed. To reduce the overhead per exit without compromising the spatial resolution of the feature capacity, which is particularly critical to accuracy, this technique focuses optimization efforts on the channel dimension. In this direction, the proposed configuration space includes an optional addition of a lightweight CRM comprising 1×1 convolutional layers that rapidly reduce the number of channels fed to the segmentation head via an adjustable factor.
[0093] Additional trainable blocks. Classification-centric approaches address the feature extraction challenges of early classifiers by adding extra layers at each exit. However, simply introducing layers of the original size can lead to a surge in exit workload overhead due to the increased capacity of the feature map in the segmentation network, thus defeating the purpose of early exits. In the MESS network, this is presented as a configurable option that can be used to remedy weak semantics in shallow exits, while such layers are carefully appended after the CRM to take advantage of the computational efficiency of the reduced feature capacity width.
[0094] Rapidly expandable increments. To address the limited receptive field of shallow exits, in addition to supporting the addition of dedicated trainable layers at each exit, this framework allows for rapidly increasing expansion rates for these layers.
[0095] Segmentation Heads. Currently, the proposed framework supports two types of segmentation head architectures: i) FullyConvolutional Network-based Head (FCN-Head) (Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully Convolutional Networks for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440, 2015) and ii) DeepLabV3-based Head (DLB-Head) (Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-Decoder with AtrousSeparable Convolution for Semantic Image Segmentation. In European Conference on Computer Vision (ECCV), pp. 801–818, 2018). The former provides a simple and efficient mechanism for upsampling the feature capacity through deconvolution and predicting the per-pixel probability distribution of all candidate classes. The latter incorporates the Atru Spatial Pyramid Pool (ASPP), which includes parallel convolutions with different dilation rates to incorporate multi-scale contextual information into its predictions.
[0096] It is worth noting that, for simplicity, most related works adopt a uniform architecture for all exits. However, as described below, different exit depths present their own challenges, with shallow exits benefiting most from numerous lightweight layers, while deep exits favor channel-rich exit architectures. Our technique's framework is customizable, enabling efficient searching for models with customized architectures at each exit through a two-stage training scheme.
[0097] Training Scheme. Having established the network architecture, the training method of this technique is now explained, which involves a two-stage pipeline enhanced with positive filtration distillation. As mentioned above, the early exit network is typically trained either end-to-end or with a frozen backbone. However, both lead to suboptimal accuracy results. Therefore, this technique combines the advantages of both by proposing a novel two-stage training scheme.
[0098] Phase 1 (End-to-End). In the exit-aware pre-training phase, an empty FCN head is appended to all candidate exit points to generate an intermediate "supermodel". (The FCN head is chosen here instead of the DLB head for speed and guidance purposes of the coarse training steps. In Phase 2, the selected head is refined.) The network is trained end-to-end, updating the weights of the backbone and individual early exits in each iteration, where the remaining exits are discarded in a round-robin manner (Equation (1), also known as the exit drop loss). Formally, by y i ∈ represents the segmentation prediction for each early exit after softmax, where R and C are the number of rows and columns of the output, respectively, and M is the number of classes. Given the ground truth labels
[0099] The proposed loss function for the exit-aware pre-training phase is expressed as follows:
[0100]
[0101] Although the early exits were not fully trained after this stage, their contribution to the loss drove the backbone to extract semantically stronger features, even at shallower levels.
[0102] Phase Two (Frozen Backbone). In this phase, the backbone and final exit remain frozen (i.e., weights are not updated). All candidate early exit architectures are attached across all candidate exit points i∈{1,2,...,N} and trained individually using the strong semantics extracted from the backbone. Most importantly, maintaining the backbone invariant allows different exit architectures to be trained without interference and interchanged in a plug-and-play manner at deployment time, providing great flexibility for customization.
[0103] Therefore, this technology provides a computer implementation method for generating a machine learning (ML) model for semantic image segmentation, the method comprising: providing a backbone feature extraction network of an ML model with multiple early exits in a backbone network to generate an over-feeding network including multiple candidate early exit segmentation network architectures, wherein each early exit includes a customized network architecture; obtaining a training dataset including multiple images; and training the backbone network, final exit, and early exits of the ML model through the following steps to output feature maps of the multiple images input to the backbone network: during a first training phase, training the backbone network, final exit, and early exits end-to-end; and after the end-to-end training is completed, freezing the weights of the backbone network and the final early exits, and during a second training phase, training the early exits individually using the final exit as the teacher of the early exits.
[0104] As described above, the first training phase includes iteratively training the backbone network and early exits, wherein during each iteration, training includes: selecting an early exit from a plurality of early exits to be updated; discarding the remaining early exits; and training the backbone network and the selected early exit, and updating the weights of the backbone network and the selected early exit. For each selected early exit, the remaining early exits may be discarded sequentially during each iteration of training the selected early exit.
[0105] Positive Filtering Distillation (PFD) is proposed to further explore the joint potential of knowledge distillation and early exit networks for semantic segmentation in the final stage of the current training process. In prior self-distillation work for multi-exit networks, the final output of the backbone is used as the teacher of the early classifier, whose loss function typically combines the ground truth and distillation-specific terms. To further develop the information backpropagated from the pre-trained final exit, PFD is proposed. This is a technique that selectively controls the information flow to earlier exits using only the signals from samples that are correct regarding the final exit. It is assumed that the early exit head can become a stronger feature extractor by incorporating signals from simpler samples from the final exit and avoids the confusion of trying to mimic contradictory references.
[0106] Driven by the fact that segmented outputs are dense, the proposed distillation scheme assesses the difficulty of the correctness of each pixel in the input sample relative to the teacher's prediction (i.e., the final output). Subsequently, the stronger (higher entropy) ground truth reference signals fed to the early exits are filtered, allowing only information from "simple" pixels to pass through. Thus, by avoiding contamination of the training algorithm with noisy gradients from contradictory loss terms, the training effort and learning capacity of each exit are focused on the "simpler" pixels.
[0107] Formally, the tensor of the i-th exit of the prediction class for each pixel is represented as p = (r, c), where r ∈ [1, R] and c ∈ [1, C]. Where {0, 1, ..., M-1}. The corresponding output is given by the final exit. Real value label And with hyperparameter α, the following loss function is used for the frozen backbone phase of the training scheme:
[0108]
[0109] Where L CE and L KL Let represent the cross-entropy loss and KL divergence, respectively, and I be the indicator function.
[0110] In other words, the second training phase may include: using segmentation predictions of the image made by the final exit, determining the difficulty of each pixel in the image based on whether the prediction for each pixel is correct; and using only the pixels in which the predictions are correct to train the early exit.
[0111] Parameterization during deployment. After training an over-provisioning network that includes all candidate egress architectures, a comprehensive architecture search can be used to derive MESS instances for the use cases at hand, reflecting the capabilities of the target device, the complexity of the inputs, and the required accuracy or latency.
[0112] Inference Setup. To meet the performance requirements of each device and application-specific constraints, the MESS network supports different inference setups: i) Budgeted inference, where a lighter-workload sub-model is extracted up to a specific exit, enabling deployment on heterogeneous platforms with varying computational capabilities; ii) On-the-fly inference, where each sample sequentially passes through an exit, initially providing a fast approximation of the output and progressively refining the output through deeper exits until a deadline is reached, adjusting its computational depth at runtime based on resource availability on the target platform; or iii) Input-dependent inference, where each sample dynamically follows a different computational path based on its difficulty, as captured by the confidence level predicted through each exit. Figure 3A , Figure 3B and Figure 3C The diagrams illustrate how the architecture of an ML model can be configured when performing budgeted inference, on-the-fly inference, and input-dependent inference. This will be explained in more detail below.
[0113] Configuration Search. This framework automatically searches the configuration space for a customized MESS network in each of these settings. Specifically, it searches for the number, location, and architecture of early exits, as well as exit strategies that depend on inference scenarios dependent on the input.
[0114] The number, location, and configuration of egress points are considered. The proposed method takes into account all trained egress architectures and creates diverse configurations in detail, such as swapping heavy-duty shallow egress points with light-duty deep egress points. The search strategy considers the target inference settings, as well as user-specified workload, latency, and accuracy requirements, which can be represented as a combination of hard constraints and optimization objectives. As a result, the number and location of egress points, as well as the architecture of each individual egress point of the resulting MESS instance, are jointly optimized.
[0115] Given the export architecture search space The configuration space for the MESS network is defined as follows:
[0116]
[0117] The additional term represents the "None" option for each setting in the exit position. According to this formula, given the accuracy constraint, th acc, The framework can minimize workload / latency (expressed as cost):
[0118]
[0119] Or given a cost constraint th cost Optimize for accuracy:
[0120]
[0121] Most importantly, the two-stage training scheme described above allows all trained exits to be interchangeably attached to the same backbone for inference. This enables highly efficient searching of overly complex spaces, avoiding the excessive search time of NAS methods. Furthermore, the MESS network can be customized for different requirements without retraining, while the exhaustive search proposed guarantees the optimality of the selected design points.
[0122] The method may also include performing an architecture configuration search to identify an architecture suitable for a specific application from multiple candidate early exit segmentation network architectures.
[0123] The method may further include: receiving hardware constraints and / or inference performance requirements; receiving inference settings for a specific device or device class to be used for processing the input image at inference time; and performing an architecture configuration search using the received hardware constraints and / or inference performance requirements and the received inference settings.
[0124] Figure 3A An example of an ML model instance that can be used when performing budget inference on a user device is shown. If budget inference is used to process received images on a specific device or device category, this framework searches for and extracts suitable sub-models or instances from the ML model (which can be considered as architectures of candidate early exit segmentation network architectures from the ML model). In this case of budget inference, the architecture configuration search output includes the architecture of the backbone feature extraction network and a single early exit.
[0125] like Figure 3A As shown, the sub-model comprises a backbone network and a single early exit that, given the hardware configuration of a user device—i.e., processing and storage limitations—meets any latency requirements. This is achieved by evaluating the accuracy and latency performance of all possible sub-models (or instances or candidates) and selecting the sub-model with the highest performance that meets the latency requirements. Therefore, in Figure 3AIn this example, an early exit or split header is attached to the backbone network. In this example, the early exit is located at a relatively early position along the backbone network; however, it should be understood that the early exit can be located anywhere along the backbone network, provided the delay requirements are met.
[0126] Determining at least one hardware constraint may include receiving information about at least one of the following: the device's computational load, the device's storage capacity, and the device's power consumption.
[0127] Once an ML model has been configured for the device or device category (i.e., a sub-model has been extracted), the ML model can be provided to the device for image segmentation of the received images. Therefore, the method may also include sending or transmitting, or otherwise making available, the identified and extracted early exit segmentation network architecture to devices with the same hardware constraints.
[0128] Figure 3B An example of an ML model instance that can be used when performing on-the-fly inference on a user's device is shown. If on-the-fly inference is used to process images, this framework selects a subset of multiple early exits from all possible early exits with the aim of providing progressive refinement of segmentation predictions. That is, the framework extracts sub-models, instances, or architectures from the candidate early exit segmentation network architectures of the ML model. User-provided objectives can specify the minimum precision / latency for each refinement, as well as the refinement interval. To produce the highest-performing subset of early exits, the framework considers each exit independently and aims to minimize the overhead introduced by each exit while satisfying the objective requirements.
[0129] In the received inference case, the architecture configuration search output includes the architecture of the backbone feature extraction network and multiple early exits.
[0130] Figure 3C This section illustrates an example of an ML model that can be used when performing input-dependent inference on a user device. If input-dependent inference is used to process a received image, this framework selects a subset of early exits and an adjusted early exit strategy that will satisfy the target requirements on the target platform (i.e., the target user device). This is achieved by exhaustively listing the number and location of all different early exits, considering all possible combinations of early exits and exit policies. Thus, the framework generates the highest-performing early exit strategy, along with the number and location of early exits, for the target user device or target application / use case.
[0131] In the case of input-dependent inference, the architecture configuration search output includes an architecture comprising a backbone feature extraction network and multiple early exits, wherein the architecture includes a confidence evaluation unit associated with each early exit to evaluate the confidence of the predictions made by each early exit during inference.
[0132] Early-Exit Criterion. Driven by the fact that not all inputs have the same prediction difficulty, adaptive inference has been extensively studied in image classification. In this setting, each input sample sequentially passes through a selected early exit. After a prediction is generated from the exit, a mechanism that calculates image-level confidence (as a measure of prediction difficulty) is used to determine whether inference should continue to the next exit.
[0133] This technique remains highly unexplored in dense prediction problems, such as semantic segmentation. In some existing techniques, each pixel in an image is treated as an independent classification task, exiting prematurely if its prediction confidence at the exit is high, resulting in irregular computational paths. In contrast, our method treats segmentation of each image as a single task, aiming to drive each sample with a unified computational path. To this end, our technique fills a gap in the literature by introducing a novel mechanism to quantify the overall confidence in semantic segmentation predictions.
[0134] Confidence adjustment for the MESS network. Per-pixel confidence map c, calculated from the probability distribution of each pixel's class. map =f c (y)∈start(where f) c Typically, these are top1(·) or enropy(·)), and this technique provides a mechanism to reduce these per-pixel confidence values to a single (per-image) confidence value. The proposed metric considers the dense output y of the exit. i It has a high prediction confidence (above the adjustable threshold). Percentage of pixels:
[0135]
[0136] Furthermore, it has been observed that some spatial information is lost due to the asymptotic downsampling of the feature capacity in CNNs. Therefore, semantic predictions near object edges are naturally unreliable. Driven by this observation, the proposed metric is augmented to account for these expected low-confidence pixel predictions. Initially, edge detection is performed on the semantic mask, followed by an erosion filter with a kernel equal to the feature capacity spatial downsampling rate (8X) to compute the semantic edge map M:
[0137]
[0138] Finally, median-based smoothing is applied to the confidence values of pixels located at semantic edges:
[0139]
[0140] Where wl={lW,...,l +W}, where W is the adjustable window size.
[0141] During inference, each sample is processed according to the selected early exit order. For each prediction y i Calculate the recommended metrics Furthermore, an adjustable confidence threshold (as an exit strategy exposed to the search space) determines whether a sample will exit early. It will still be further processed by the subsequent main layer / exit.
[0142] The confidence assessment unit for each early exit is configured to: calculate the confidence value of the image segmentation prediction made by the relevant early exit on the overall image; determine whether the confidence value is greater than or equal to a threshold confidence value; and instruct the processing to continue to the subsequent early exit when the confidence value is lower than the threshold confidence value, or instruct the processing to terminate when the confidence value is greater than or equal to the threshold confidence value.
[0143] Therefore, calculating the overall confidence value of an image can include: obtaining a per-pixel confidence map that includes the confidence values of each pixel in the image; identifying pixels located near semantic edges of objects in the prediction; and outputting the percentage of pixels in the image with a per-pixel confidence value greater than or equal to a threshold confidence value, where the contribution of the identified pixels is reduced. During the architecture configuration search, the threshold confidence value is optimized for each early exit.
[0144] Therefore, this technology provides a computer-implemented method for performing semantic image segmentation on a device using a trained machine learning (ML) model, the method comprising: obtaining an instance of a trained ML model, the instance being an early exit segmentation network architecture associated with the device or the device category to which the device belongs and suitable for an inference setup used by the device; receiving an image to be processed by the instance of the trained ML model; and performing image segmentation on the received image using the instance of the trained ML model. It will be understood that the term "obtain" can mean obtaining an instance of the trained ML model from a server, which may occur once. It will be understood that the term "obtain" can also mean obtaining an instance of the trained ML model from memory or local storage on the device, which is used each time image segmentation is performed.
[0145] When the early exit segmentation network architecture includes a backbone feature extraction network and a single early exit, performing image segmentation can include outputting image segmentation predictions from the single early exit.
[0146] When the early exit segmentation network architecture includes a backbone feature extraction network and multiple early exits, performing image segmentation may include processing the image sequentially by the early exits. As mentioned above, the network architecture of each early exit may be the same or different (i.e., inconsistent). After processing by the early exits, the method may include: providing an image segmentation prediction from the early exits; calculating a confidence value for the overall image segmentation prediction; determining whether the confidence value is greater than or equal to a threshold confidence value; and processing the image using subsequent early exits when the confidence value is less than the threshold confidence value; or outputting the image segmentation prediction from the early exits when the confidence value is greater than or equal to the threshold confidence value. The overall confidence value of the image can be determined by considering the percentage of pixels in the image with a pixel-level confidence value higher than the threshold confidence value. Therefore, the number of pixels with a confidence value higher than the threshold confidence value is determined from the total number of pixels in the image, and this percentage is used to determine whether the entire image meets the threshold confidence value.
[0147] When the early exit segmentation network architecture includes a backbone feature extraction network, multiple early exits, and a confidence evaluation unit associated with each early exit, the confidence evaluation unit of each early exit is configured to: obtain a per-pixel confidence map including the confidence value of each pixel in the image; identify pixels located near the semantic edges of objects in the prediction; and generate a confidence value for the entire image, wherein the confidence value is the percentage of pixels in the image that have a per-pixel confidence value greater than or equal to a threshold confidence value associated with the early exit, wherein the contribution of the identified pixels is reduced.
[0148] Evaluation. The evaluation of this technology will now be explained.
[0149] Models and datasets. The proposed method has been applied to DRN-50 (Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated Residual Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 472–480, 2017), DeepLabV3 (Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv preprint arXiv:1706.05587, 2017) and MNetV2 (Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition). Based on Recognition (CVPR), pp. 4510–4520, 2018, segmentation CNNs are used, employing ResNet50 (Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016) and the MobileNetV2 backbone, representing high-end and edge use cases, respectively.All the backbones were trained on MS COCO (Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll'ar, and Clawrence Zitnick. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV), pp. 740–755, 2014) and independently fine-tuned early exits on MS COCO and PASCAL VOC (Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision (IJCV), 88(2):303–338, 2010).
[0150] MS COCO forms one of the largest datasets for dense visual understanding tasks. Therefore, it serves as a common foundation for cross-domain pre-trained semantic segmentation models. Following the conventions of semantic segmentation, only the 20 semantic classes of PASCAL VOC (plus one background class) are considered, and any training images consisting solely of background pixels are discarded. This yields 92.5K training images and 5K validation images. The b of COCO... R Set to 520×520. PASCAL VOC: PASCAL VOC (2012) contains the most widely used semantic segmentation benchmark. It includes 20 foreground object classes (plus one background class). The original dataset consists of 1464 training images and 1449 validation images. As is customary, the augmented training set provided by Hariharan et al. (Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp. 991–998. IEEE, 2011) is used to obtain 10.5K training images. For PASCAL VOC, bR Set to 520×520 (same as MS COCO).
[0151] Development and deployment setup. The MESS network is implemented on PyTorch (v1.6.0), built on top of torchvision (v0.6.0). During inference, MESS network instances are deployed on both high-end (desktop with an Nvidia GTX 1080 Ti; 400WTDP) and edge (Nvidia Jetson Xavier AGX; 30WTDP) platforms.
[0152] Baseline. To evaluate performance against the state-of-the-art (SOTA) technology, comparisons were made with the following baselines: 1) E2E SOTA (Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Weinberger. Multi-Scale Dense Networks for Resource Efficient Image Classification. In International Conference on Learning Representations (ICLR), 2018.); 2) Frozen SOTA (Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. Shallow-Deep Networks: Understanding and Mitigating Network Overthinking. In International Conference on Machine Learning (ICML), pp. 3301–3310, 2019); 3) SelfDistill (Linfeng Zhang, Zhanhong Tan, Jiebo Song, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. SCAN: A Scalable Neural Networks Framework Towards Compact and Efficient Models. In Advances in Neural Information Processing). Systems (NeurIPS), 2019); 4) DRN (Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated Residual Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 472480, 2017); 5) DLBV3 (Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam.Encoder-Decoder with AtrousSeparable Convolution for Semantic Image Segmentation. In European Conference on Computer Vision (ECCV), pages 801–818, 2018); 6) segMBNetV2 (Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pp. 4510–4520, 2018); and 7) LC (Xiaoxiao Li, Zi wei Liu, Ping Luo, Chen Change Loy, and Xiaoou Tang. Not All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer Cascade. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages 3193–3202, 2017). .
[0153] Exit-aware pre-training. First, the effectiveness of the proposed exit-aware pre-training scheme is demonstrated. The accuracy of models with uniform exit configurations across all candidate exit points is compared, with different training strategies employed. Table 1 summarizes the results of this comparison on a ResNet-50-based DRN backbone (DRN-50), where N=6. The top cluster (rows (i)-(iv)) provides alternative initialization schemes for the network, each using a different loss function and integrating different exits. These are alternatives to the end-to-end pre-training step (stage 1 as described above). The last group in the table represents different training schemes, using candidates from the first group as initializations. The experiments were repeated three times.
[0154] Table 1: Per-exit accuracy for different training schemes on DRN-50.
[0155]
[0156]
[0157] Adding early exits to the loss function aims to push the extraction of semantically strong features to the shallower parts of the network. Results from different initializations show that adding a single exit under end-to-end training can even improve the accuracy of the final segmentation prediction (row (ii)). Similar to its use in GoogLeNet, it is assumed that the extra signal in the middle of the network acts as both a regularizer and an additional source of backpropagation, reducing the potential impact of vanishing gradients. However, this effect quickly vanishes when exits are attached to very early layers (row (iii)) or when more exits are attached and trained jointly. This is described in the E2E SOTA representing the end-to-end training scheme (row (v)). Both of these training methods can lead to a decrease in the accuracy of the final output, which may be attributed to conflicting signals between early and late classifiers and the large loss of early results dominating the loss function. Therefore, the exit-dropout loss is proposed, which trains early exits one by one in an alternating manner, thus producing the highest accuracy at the final exit (row (iv)).
[0158] The bottom row clusters in Table 1 list the same settings as before, but applied after the second-stage training, as defined above. For example, freezing the SOTA (row (vi)) represents the case where the early exits are trained while attached to the frozen pre-trained original backbone. The key point is that joint pre-training with at least one early exit can partially benefit from the adjacent exit heads in the second stage (rows (vii)-(viii) and (vi)). However, this effect largely favors specific exits selected during pre-training and may come at the expense of the deepest classifier. In contrast, the proposed exit-aware initialization scheme (row (ix)) produces consistently high accuracy results on all exits without compromising the final exit.
[0159] Indicatively, the proposed exit-loss loss helps the resulting exit achieve an accuracy gain of up to 12.57 percentage points (pp) compared to the conventional pre-trained segmentation network (row (i)) and an accuracy gain of up to 3.38 percentage points compared to the end-to-end trained model (row (v)), which also reduces the accuracy of the final exit by 1.57 percentage points.
[0160] Positive Filter Distillation. Here, the benefits of the Positive Filter Distillation (PFD) scheme are quantified for the second stage (frozen backbone) of current training methods. To this end, PFD is compared with E2ESOTA, which utilizes cross-entropy loss (CE), the traditional knowledge distillation (KD) method, and SelfDistill, which employs combined loss (CE+KD).
[0161] Table 2 summarizes the results for typical egress architectures of DRN-50 and MobileNetV2.
[0162] Table 2: Positive Filtration Distillation Ablation (mIoU)
[0163]
[0164]
[0165] As can be seen, the proposed loss consistently yields higher accuracy in all cases, exceeding E2E·SOTA, KD, and SelfDistill by 1.8, 1.28, and 2.32 percentage points, respectively. This accuracy improvement is more significant at shallow exits, where the training process focuses on “simple” pixels, while smaller improvements are achieved at deeper exits, where the accuracy gap with the final exit is naturally bridged.
[0166] Inference performance evaluation. Here, the effectiveness and flexibility of the proposed one-time training, anywhere deployment method for semantic segmentation are demonstrated under different deployment scenarios and workload / accuracy constraints. As mentioned above (and referring to...) Figures 3A to 3C (For explanation), there are three inference settings in the MESS network: i) budgeted inference, ii) on-the-fly inference, and iii) input-dependent inference, each of which can be optimized separately. For this purpose, a search is conducted to find the optimal early exit architecture for each use case. The performance of the optimized MESS network in this case is shown below.
[0167] Budgeted inference and on-the-fly inference. In budgeted inference, the search finds a sub-model that can fit the device and execute within given latency / memory / accuracy targets. This approach delivers the most efficient MESS network configuration suitable for the requirements of the underlying application and device. The search tends to support designs with robust exit architectures, including multiple trainable layers, and is installed early in the network. In the on-the-fly inference case, a given deadline is considered the computation cutoff point. When the deadline is reached, the last output of the network with available early exits is taken—or used as a placeholder result to be refined asynchronously until the result is actually used. However, there is an inherent trade-off in this paradigm. On the one hand, denser early exits provide more frequent “checkpoints.” On the other hand, each additional head is essentially computational overhead when not explicitly used. To control this trade-off, the method also considers the additional computational cost per exit when populating the MESS network architecture. In contrast to budgeted inference, in this setting, the search produces heads with extremely lightweight architectures, sacrificing flexibility to reduce computational overhead, and is installed deeper in the network. As shown in Table 3, for real-time inference, under the same accuracy constraints, the search-generated exit architecture requires 11.6x less computation than budgeted inference. Here, a 50% average intersection-to-union (IoU) ratio is required.
[0168] Table 3: Workload and accuracy of the DRN-50 with an early export under different inference schemes.
[0169]
[0170] Input-dependent inference. In an input-dependent inference setting, each input sample is propagated through the MESS network at hand until the model produces a prediction (E) that it is confident enough to accept. sel (Equation (6)). In this section, the MESS network is instantiated in this setting, and a new confidence metric for dense scene understanding is evaluated.
[0171] Confidence assessment. Various confidence-based metrics have been proposed for early exits. In the classification domain, these revolve around comparing the softmax entropy or top-1 result of each exit with a threshold. Segmentation, building on this, presents the problem of dense prediction. Thus, either these widely used classification-based metrics are simply applied to segmentation by averaging the confidence per pixel of the image, or custom metrics proposed above are applied. Figure 4 This is a chart comparing the performance of different early export policies / standards. The Cartesian product of these methods is defined as follows: Figure 4Four baselines are described, in which the effectiveness of the proposed exit scheme is benchmarked against other strategies on a DRN50-based MESS network with two exits. By selecting different thresholds for the exit strategy, even the simplest (dual-exit) configuration of the input-dependent MESS network can provide a fine-grained trade-off between workload and accuracy. Utilizing this trade-off, input-dependent inference is observed to offer the highest computational efficiency under the same 50% average IoU accuracy constraint.
[0172] Furthermore, it is observable that the proposed image-level confidence metric applied to the top1 probability and the entropy-based pixel-level confidence estimator consistently provides a better accuracy-efficiency tradeoff compared to the corresponding average. Specifically, experiments with various architecture configurations show a gain of up to 6.34 pp (average 1.17 pp) across the entire threshold range.
[0173] Comparison with SOTA segmentation network.
[0174] Single-exit segmentation solutions. Here, input-dependent inference is considered, and the MESS framework is applied to single-exit alternatives in the literature, namely DRN, DLBV3, and segMBNetV2. Table 4 lists the results obtained by MESS instances optimized for different use cases (considered as feedback to the speed / accuracy requirements of configuration search), as well as the original model.
[0175] Table 4: End-to-end evaluation of MESS network design
[0176]
[0177]
[0178]
[0179] For the DRN-50 backbone with FCN heads on MS COCO, a latency-optimized MESS instance (line (iii)) with no accuracy degradation was observed to achieve up to 3.36x workload reduction, equivalent to a 2.23x latency acceleration on a single-exit DRN. For the case where a controlled accuracy degradation of ≤1pp is tolerated (line (iv)), this improvement is amplified to 4.01x (2.65x latency) over workload. Furthermore, at the same workload budget as the DRN, the accuracy-optimized MESS instance achieved a 5.33pp mIoU gain and a 1.22x reduction in GFLOPs (line (ii)).
[0180] Similar results were obtained for both DLBV3 and the PASCALVOC dataset. Furthermore, the performance gain is consistent with segMBNetV2 (rows (ix)–(xii)), which exhibits an inherently efficient segmentation design with a workload 15.7 times smaller than DRN-50. This demonstrates the model-independent nature of this framework, generating complementary gains by leveraging the dimension of inference that depends on the input.
[0181] Multi-exit segmentation solution. Next, the accuracy and performance of the MESS network are compared with those of the Deep Layer Cascade (LC) network (Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, and Xiaoou Tang. Not AllPixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer Cascade. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3193–3202, 2017), which is the state-of-the-art work proposing per-pixel propagation in early exit segmentation networks. Due to its heavy unstructured computation, standard BLAS libraries cannot realize the true advantages of this approach. Therefore, the pixel-level exit strategy of LC is applied to multiple MESS configurations and compared with current image-level strategies.
[0182] By using state-of-the-art methods in semantic segmentation, such as a larger dilation rate or DeepLab's ASPP, the gain of LC rapidly diminishes, requiring a large pre-computation of feature volumes for each pixel propagating deeper. Specifically, with the FCN head, approximately 45% of the feature volume at the output of the first exit falls within the receptive field of a single pixel in the final output, compared to 100% for the DLB head. As a result, no workload reduction was observed for the dual-exit network in Table 4 (row (iv)) relative to the corresponding baseline in row (i), and a significant dissipative gain of 1.13x was observed for the three-exit network in Table 4 (row (iii)). In contrast, the corresponding MESS instances achieved workload reductions of 6.02x and 3.36x, respectively.
[0183] Figure 5A and Figure 5B Two example input images are shown that were processed using a semantic segmentation model during model training. Figure 5A and Figure 5BThe left image in the diagram shows an example input image, the middle image shows the segmentation prediction for the exit point, and the right image shows the per-pixel confidence for each pixel in the prediction. To capture the overall confidence of the segmentation prediction, a method is needed to reduce the per-pixel confidence of the output to a single value. Instead of a simple approach using an arithmetic mean, this technique uses a reduction formula that considers the percentage of pixels in the prediction that exceed a given confidence threshold. This results in a more robust confidence estimate for the segmentation prediction, unaffected by extremely under-confident pixels / regions in the image.
[0184] Furthermore, the observation that semantic prediction confidence is always lower in pixels closer to the semantic edge of the object is absorbed, and the contribution of these pixels to the prediction confidence is reduced by weighting.
[0185] Figure 6A The example input image and two predictions made using the model's final exit point and early exit point are shown. The input image ("reference") is fed into the network, and two segmentation predictions are displayed, one from the final exit point and one from the early exit point. The figure also shows the ground truth image. Figure 6B The graph shows the difference in accuracy between predictions made using the final exit point and earlier exit points for multiple input samples. Clearly, the earlier exits successfully segmented most of the input samples. The confidence estimator effectively captures the uncertainty of difficult input samples and directs them to deeper exits for further processing.
[0186] Figure 7 A schematic diagram illustrating the use of a trained model is shown. This demonstrates how to capture live video using a smartphone's image capture device. The video can be an office or workplace. Frames of the live video can be input into a trained model for processing using progressive segmentation. The model can output segmented frames, which can, for example, determine that the frame shows walls, doors, floors, windows, etc., of the office / workplace. Virtual overlays can be applied to the input frames to generate mixed reality frames that use information from the segmented frames. For example, this mixed reality frame could show a smartphone user how to get to Jenny's office from their current location—this uses segmentation information to correctly guide the user to the office because the segmented frames identify different features of the frames.
[0187] Robotic devices may require image segmentation algorithms. This is because they need to understand their environment, construct the "world" in which they operate, and know the location of other actors or agents within that world. Robustness and low latency are also required so that the robotic devices can make quick and timely decisions. For example, they may need to be able to avoid obstacles and detect users and other objects, such as pets. Therefore, robotic devices can implement training models based on this technology to improve their visualization and understanding.
[0188] The trained model of this technology can be used on a variety of devices. Advantageously, early exit allows for customization of the model to suit the capabilities of the device. That is, both the static capabilities of the device (i.e., the hardware specifications of the device) and the dynamic capabilities of the device (i.e., the processing load of the device when using the model) can be considered to determine whether to use an early exit. Furthermore, the model can be executed entirely on the device, or it can be executed partially on the device and partially in the cloud, as described in UK Patent Application No. 2005029.0, which is incorporated herein by reference in its entirety.
[0189] Figure 8 A flowchart illustrating example steps for generating a semantic segmentation model is shown. Specifically, training an over-provisioning network comprising multiple candidate early exit segmentation network architectures is described. The method may include: providing a backbone feature extraction network of an ML model with multiple early exits within the backbone network to generate an over-provisioning network comprising multiple candidate early exit segmentation network architectures (step S100), wherein each early exit comprises a customized network architecture; obtaining a training dataset comprising multiple images (step S102); and training the backbone network, final exit, and early exits of the ML model through the following steps to output feature maps of the multiple images input to the backbone network: during a first training phase, training the backbone network, final exit, and early exits end-to-end (step S104); and after end-to-end training is complete, freezing the weights of the backbone network and final exit (step S106), and during a second training phase, training the early exits individually using the final exit as a teacher for the early exits (step S108).
[0190] Figure 9 A flowchart illustrating example steps for generating a semantic segmentation model is shown, specifically, searching for a particular candidate architecture. The method includes: receiving hardware constraints and / or inference performance requirements (step S200); receiving inference settings for a specific device or device class to be used for processing the input image at inference time (step S202); and performing an architecture configuration search using the received hardware constraints and / or inference performance requirements and the received inference settings to identify an architecture suitable for a specific application from multiple candidate early exit segmentation network architectures (step S204).
[0191] Figure 10 A flowchart illustrating example steps for semantic segmentation prediction using a trained model is shown. The method uses a trained machine learning (ML) model, which, as described above, can be configured to have any number of exit points depending on the inference settings to be used. Therefore, the method may include: obtaining an instance of the trained ML model (step S300); receiving an image to be processed by the trained ML model (step S302); and performing image segmentation on the received image using the instance of the trained ML model (step S304).
[0192] As described above, during inference time, the network can follow: (i) an on-the-fly inference paradigm, in which a particular exit is selected based on the available delay budget; (ii) an asymptotic inference paradigm, in which the network provides an approximation of the output from the first exit and asymptotically refines it by predictions of deeper exits; or (iii) an input-dependent inference paradigm, in which each sample takes a different computational path based on its difficulty.
[0193] Figure 11 A block diagram is shown of an apparatus 100 for implementing a training model and a server 112 for generating an ML model. The server 112 includes at least one processor coupled to memory (not shown) and arranged to: provide a backbone feature extraction network of the ML model with multiple early exits in a backbone network to generate an over-feeding network including multiple candidate early exit segmentation network architectures, wherein each early exit includes a customized network architecture; obtain a training dataset including multiple images; and train the backbone network, final exit, and early exits of the ML model through the following steps to output feature maps of the multiple images input to the backbone network: during a first training phase, training the backbone network, final exit, and early exits end-to-end; and after end-to-end training is complete, freezing the weights of the backbone network and final exits, and during a second training phase, training the early exits individually using the final exit as a teacher for the early exits. The resulting trained ML model 114 is stored on the server 112, which includes multiple candidate early exit segmentation network architectures.
[0194] Device 100 can be any of a smartphone, tablet, laptop, computer or computing device, virtual assistant device, vehicle, drone, autonomous vehicle, robot or robotic device, robotic assistant, image capture system or device, augmented reality system or device, virtual reality system or device, gaming system, Internet of Things device, or smart consumer device (such as a smart refrigerator). It should be understood that this is a non-exhaustive and non-limiting list of example devices.
[0195] The apparatus 100 includes an instance of a trained machine learning ML model 106 for performing semantic image segmentation.
[0196] The device includes at least one processor 102 coupled to memory 104. The at least one processor 102 may include one or more of a microprocessor, a microcontroller, and an integrated circuit. Memory 104 may include volatile memory, such as random access memory (RAM) used as temporary memory, and / or non-volatile memory, such as flash memory, read-only memory (ROM), or electrically erasable programmable ROM (EEPROM) for storing, for example, data, programs, or instructions.
[0197] At least one processor 102 may be arranged to: receive an image to be processed by a trained ML model 106; and perform image segmentation on the received image using an instance of the trained ML model 106.
[0198] The apparatus may also include at least one image capture device 108 for capturing images or videos to be processed by the ML model.
[0199] The device may also include at least one interface 110 for providing the processing results of the ML model to the user of the device. For example, the device 100 may include a display screen to receive user input and display the results of implementing the ML model 106 (such as...). Figure 7 (as shown in the example).
[0200] Figure 12 This diagram illustrates the entire process of generating a "train once, deploy anywhere" ML model. The process begins with a segmentation network, which includes a backbone feature extractor and finally a segmentation head. The backbone is analyzed / benchmarked to determine a fixed-granularity set of candidate exit points along its depth, designed to distribute them evenly according to workload / latency. An overfeeding network is defined by creating multiple distinct early exit architectures (neural network layers, then segmentation heads) and appending all of these to all candidate exit points. The set of all possible early exit architectures appended to each exit point is carefully designed and termed the search space.
[0201] Given a segmentation dataset, an oversupply network (the backbone and all candidate exits for all exit points) is trained using the two-stage method described above, and summarized here. First, the backbone network is trained in an exit-aware manner, considering only a uniform architecture from the search space for each early point, and sequentially discarding all but one early exit (in a round-robin fashion), updating the backbone weights and the final and intermediate early exits retained in each iteration. Afterward, the backbone and final exit parameters are frozen, and all candidate exit architectures in the search space and all candidate exit points of the oversupply network are trained. For this second stage, the positive filtering distillation scheme described above is used, where the prediction of the final exit is used to determine the difficulty of each pixel in the training image (based on the prediction correctness of the final exit), and during the training of the early exits of the oversupply model, only simple pixels (those correctly classified by the final exit) are considered.
[0202] After training, various variants of the MESS network can be deployed by selecting some trained early exit architectures from the overfeeding model. An exit can be chosen for each candidate exit point, but it is not necessary to attach exits to all MESS instances for all exit points. This selection process is called architecture configuration search. The selection process does not involve any training and is performed on the server side before deploying the MESS network for a given application.
[0203] For search, users provide constraints or requirements such as latency, accuracy, energy, and memory (or combinations thereof) to the search algorithm, which together optimize the number, location, and architecture of all required exit points (outside the search space) to best meet the user's specifications.
[0204] The search also reveals the inference settings the model needs to follow during deployment. This means determining how the deployed MESS instance processes the input image at runtime. The MESS network supports three inference settings:
[0205] (a) Budgetary inference: During the search, a single early exit (point and architecture) of an over-supplied network is selected to form a sub-model (including a portion of the backbone up to including the single early exit architecture). During inference, all samples are processed by this sub-model, deterministically satisfying requirements such as workload, memory, size, etc.
[0206] (b) On-the-fly inference: This method uses multiple early exits (one for each of the multiple exit points). During inference, samples are processed sequentially by each of the selected early exits, each providing a progressively improving / enhanced segmentation prediction over time. Other components of the system or the user can benefit from these early predictions at runtime.
[0207] (c) Input-dependent reasoning: Similar to on-the-fly reasoning, but each exit includes a confidence evaluation unit after its output. The network's exits process each sample sequentially, and after each prediction, the confidence evaluator determines whether the current image needs further processing (via subsequent exits) or can terminate its computation if a prediction with sufficient confidence has already been provided at the image level (not per pixel). The confidence evaluator attempts to capture the concept of image segmentation difficulty, allowing easy samples to exit early, with "appropriate amount of computation" performed on each input sample at runtime.
[0208] For input-dependent reasoning, a novel confidence metric suitable for semantic segmentation is proposed. Instead of simply averaging the per-pixel confidence values provided by the network to obtain a per-image confidence value for each prediction, the confidence metric considers the percentage of pixels in the image that exceed a given confidence level. Furthermore, the contribution of each pixel is weighted differently in this metric; that is, pixels that are naturally overconfident and close to semantic edges are downgraded. For the input-dependent reasoning case, the exit strategy (a threshold for each exit) is jointly optimized by a search algorithm along with the number, location, and configuration of exits.
[0209] When new deployment scenarios are needed (e.g., deploying the model to different devices, or for applications with stricter latency constraints), new searches are required on top of the already trained, over-provisioned model, thus avoiding the need to retrain the model. This makes the proposed method a "train once, deploy anywhere" approach for semantic segmentation. Training and searching occur on the server side, while deployment can be targeted at different devices with varying computing capabilities.
[0210] Those skilled in the art will understand that, although the foregoing description has described what is considered the best mode and other suitable modes for performing this technology, the technology should not be limited to the specific configurations and methods disclosed in the description of the preferred embodiments. Those skilled in the art will recognize that the technology has broad applications and that extensive modifications can be made to the embodiments without departing from any inventive concept defined in the appended claims.
Claims
1. A computer-implemented method for generating a machine learning ML model for semantic image segmentation, the method comprising: The backbone feature extraction network of the ML model with multiple early exits is provided in the backbone network to generate an over-provisioning network that includes multiple candidate early exit segmentation network architectures, wherein the over-provisioning network represents the set of all possible early exit segmentation network architectures. Obtain a training dataset that includes multiple images; as well as The backbone network, final exit, and early exit of the ML model are trained through the following steps to output feature maps of the multiple images input into the backbone network: In the first training phase, the backbone network, the final exit, and the early exit are trained end-to-end; as well as After the end-to-end training is completed, the weights of the backbone network and the final output are frozen, and during the second training phase, the final output is used as the teacher of the early output to train the early output separately. The second training phase includes: The difficulty of each pixel in the image is determined based on whether the prediction for each pixel is correct, using the segmentation prediction of the image by the final output. as well as The early exit is trained using only the pixels where the prediction is correct.
2. The method of claim 1, wherein the first training phase comprises iteratively training the backbone network and the early exit, wherein during each iteration, the training comprises: Select one of the multiple early exits to update; Discard the remaining portion of the early exits; as well as Train the backbone network and the selected early exits, and update the weights of the backbone network and the selected early exits.
3. The method of claim 2, wherein for each selected early exit, during each iteration of training the selected early exit, the remainder of the early exit is sequentially discarded.
4. The method of claim 1, further comprising performing an architecture configuration search to identify an architecture suitable for a particular application from the plurality of candidate early exit separation network architectures.
5. The method of claim 4, further comprising: Receive hardware constraints and / or inference performance requirements; Receive inference settings for a specific device or device class that will be used to process the input image during inference; as well as The architecture configuration search is performed using the received hardware constraints and / or inference performance requirements, as well as the received inference settings.
6. The method of claim 5, wherein the received inference setting is a budget inference setting, and wherein the architecture configuration search output includes an architecture of a backbone feature extraction network and a single early exit.
7. The method of claim 5, wherein the received inference setting is an on-demand inference setting, and wherein the architecture configuration search output includes an architecture of a backbone feature extraction network and multiple early exits.
8. The method of claim 5, wherein the received inference setting is an input-dependent inference setting, wherein the architecture configuration search output includes an architecture of a backbone feature extraction network and multiple early exits, and wherein the architecture includes a confidence evaluation unit associated with each early exit to evaluate the confidence of predictions made by each early exit at inference time.
9. The method of claim 8, wherein the confidence assessment unit for each early export is configured to: Calculate the confidence value of the image segmentation prediction made by the associated early exit for the entire image; Determine whether the confidence value is greater than or equal to the threshold confidence value; as well as When the confidence value is lower than the threshold confidence value, the process is instructed to continue to a subsequent early exit, or when the confidence value is greater than or equal to the threshold confidence value, the process is instructed to terminate.
10. The method of claim 9, wherein calculating the confidence value of the entire image comprises: Obtain a per-pixel confidence map that includes the confidence value of each pixel in the image; Identify pixels located near the semantic edges of the predicted object; as well as Output the percentage of pixels in the image whose confidence value per pixel is greater than or equal to the threshold confidence value, wherein the contribution of the identified pixels is downgraded.
11. The method of claim 10, wherein during the architecture configuration search, the threshold confidence value is optimized for each early exit.