Portrait segmentation model training method and related device
By training a human face segmentation model and utilizing data preprocessing, four-input encoding and decoding modules, combined with a loss function and an online hard example mining strategy, the problem of inaccurate human face segmentation in existing technologies has been solved, achieving high-precision segmentation of the main target and improving the effect of virtual try-on.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN SHULIAN TIANXIA INTELLIGENT TECH CO LTD
- Filing Date
- 2023-01-31
- Publication Date
- 2026-06-23
AI Technical Summary
When segmenting human figures in images, existing technologies cannot accurately preserve the main subject and often retain similar-colored objects, resulting in poor virtual try-on effects.
The training method of the human image segmentation model includes a data preprocessing module, a four-input encoding module, and a decoding module. It is trained using the mean squared error loss function and the cross-entropy loss function. Combined with the online hard example mining strategy, human feature information is extracted and fused to obtain a high-precision segmentation map of the main target.
It improves the stability and accuracy of human image segmentation, accurately preserves the main target in the image, removes other human figures and debris residue, and enhances the effect of virtual try-on.
Smart Images

Figure CN116229066B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and in particular to a training method and related apparatus for a human portrait segmentation model. Background Technology
[0002] With consumers' growing demand for online virtual fitting, virtual fitting has become an important direction in the field of modern computer vision. Currently, offline smart fitting mostly uses interactive fitting mirror devices, where shoppers stand in front of the mirror to select clothing to try on, and the mirror displays the overall outfit based on the user's basic features and the style of the clothing. Online fitting, on the other hand, generally involves taking a picture of the user, selecting target clothing provided by the system, and then automatically replacing the selected garment.
[0003] Whether it's a smart fitting mirror or online virtual try-on, most existing technologies reshape the user's image by collecting human body data and creating 3D models. However, in real-world photography, multiple human figures often appear in the image, but only the main subject needs to be captured. The presence of other human figures prevents virtual try-on from being completed.
[0004] Existing technologies typically employ segmentation algorithms based on pixel difference binarization or generative models based on convolutional networks to segment images and extract the main subject. However, both of these methods segment all foreground objects, leaving behind human figures or other objects of similar colors, and fail to accurately preserve the main subject in the image. Summary of the Invention
[0005] The embodiments of this application aim to provide a training method and related apparatus for a human face segmentation model, so as to improve the stability and accuracy of human face segmentation.
[0006] The embodiments of this application provide the following technical solutions:
[0007] In a first aspect, embodiments of this application provide a method for training a portrait segmentation model, the portrait segmentation model including a data preprocessing module, a four-input encoding module, and a decoding module, the method comprising:
[0008] Obtain an image dataset, which includes images of people from multiple targets;
[0009] Based on the data preprocessing module, data preprocessing is performed on each person image to obtain the human body feature information image corresponding to each person image.
[0010] Based on the four-input encoding module, feature extraction and fusion are performed on human feature information images to obtain feature images;
[0011] Based on the decoding module, the feature image is decoded to obtain a high-precision segmentation map of the main target;
[0012] A loss function is constructed, and the human image segmentation model is trained based on the loss function and the online hard example mining strategy until the loss function converges. The loss function includes the mean squared error loss function and the cross-entropy loss function.
[0013] In some embodiments, the data preprocessing module includes a high-precision human body segmentation network and a human body instance segmentation network;
[0014] Human feature information images include high-precision human body segmentation maps and human body instance segmentation maps;
[0015] Based on the data preprocessing module, each person image is preprocessed to obtain the corresponding human feature information image, including:
[0016] Based on the high-precision human body segmentation network, the human contour features of each human image are extracted to obtain the high-precision human body segmentation map corresponding to each human image.
[0017] The human instance segmentation network is used to segment each human high-precision segmentation map to obtain the human instance segmentation map corresponding to each human high-precision segmentation map.
[0018] In some embodiments, the human feature information image also includes human key point images;
[0019] Based on the data preprocessing module, data preprocessing is performed on each person image to obtain the human feature information image corresponding to each person image, and it also includes:
[0020] The human pose assessment algorithm is used to detect each human figure image, and the corresponding human key point image is obtained for each human figure image.
[0021] In some embodiments, based on a four-input encoding module, feature extraction and fusion are performed on the human feature information image to obtain a feature image, including:
[0022] Based on the four-input encoding module, feature extraction and fusion are performed on each person image and the corresponding human key point image, human high-precision segmentation map, and human instance segmentation map to obtain a feature image. The four-input encoding module includes a dilated convolutional layer, a convolutional layer, an activation function layer, and a batch normalization layer.
[0023] In some embodiments, the four-input encoding module includes:
[0024]
[0025] in, Let i represent the i-th feature image of the (x+1)-th layer, and BN denotes batch normalization. This represents the Swish activation function. This represents the ReLU activation function, Fix represents the i-th feature image of the x-th layer, and Gix+1 represents the dilated convolution kernel of the (x+1)-th layer. This represents the standard convolutional kernel of the (x+1)th layer. This indicates the bias term.
[0026] In some embodiments, the decoding module includes a transposed convolutional layer, an activation function layer, and a batch normalization layer;
[0027] Loss functions include:
[0028] Loss = L MSE +λ1L BCE
[0029] Where Loss represents the loss function, L MSE Let L represent the mean squared error loss function, λ1 represent the hyperparameters, and L represent the mean squared error loss function. BCE This represents the cross-entropy loss function.
[0030] In some embodiments, the mean squared error loss function includes:
[0031]
[0032] Among them, L MSE Let x represent the mean squared error loss function, and let x represent the true sample. λ represents the predicted sample. OHEM denoted by , where represents the coefficient of the online hard case mining strategy, and n represents the number of samples.
[0033] In some embodiments, the cross-entropy loss function includes:
[0034]
[0035] Among them, L BCE This represents the cross-entropy loss function, where n represents the number of samples, and y represents the number of samples. i p represents the classification label of the i-th real sample. i This represents the classification label of the i-th predicted sample.
[0036] In some embodiments, the human image segmentation model is trained based on a loss function and an online hard example mining strategy until the loss function converges, including:
[0037] Acquire samples for each batch, including images of people;
[0038] Calculate the loss function value for each sample in the current batch;
[0039] Based on the loss function value, all samples in the current batch are sorted to identify the difficult samples in the current batch;
[0040] The human face segmentation model is trained based on difficult samples;
[0041] Save the loss function value for each training session;
[0042] When the number of iterations exceeds the preset number, the loss function value obtained from the current training is sorted in ascending order with the saved loss function values.
[0043] If the ranking of the loss function value obtained from the current training is greater than the preset sequence value, then the loss function converges.
[0044] Secondly, embodiments of this application provide a method for segmenting a person image, including:
[0045] Obtain the image of the person to be segmented;
[0046] The human image to be segmented is input into the human image segmentation model to obtain a high-precision segmentation map of the main target corresponding to the human image to be segmented. The human image segmentation model is trained based on the training method of the first aspect of the human image segmentation model.
[0047] Thirdly, embodiments of this application provide an electronic device, including:
[0048] At least one processor, and
[0049] A memory that is communicatively connected to at least one processor, wherein,
[0050] The memory stores instructions that can be executed by at least one processor, such that the at least one processor can perform either a training method for a first aspect of a portrait segmentation model or a segmentation method for a second aspect of a person image.
[0051] Fourthly, embodiments of this application provide a non-volatile computer-readable storage medium storing computer-executable instructions for causing an electronic device to execute a training method for a portrait segmentation model (a first aspect) or a segmentation method for a person image (a second aspect).
[0052] The beneficial effects of this application's embodiments are as follows: Unlike existing technologies, this application provides a training method for a portrait segmentation model. This portrait segmentation model includes a data preprocessing module, a four-input encoding module, and a decoding module. The method includes: acquiring an image dataset, wherein the image dataset includes multiple images of people; performing data preprocessing on each person image based on the data preprocessing module to obtain a human feature information image corresponding to each person image; performing feature extraction and fusion on the human feature information image based on the four-input encoding module to obtain a feature image; decoding the feature image based on the decoding module to obtain a high-precision segmentation map of the main target; constructing a loss function; and training the portrait segmentation model based on the loss function and an online hard example mining strategy until the loss function converges. The loss function includes a mean squared error loss function and a cross-entropy loss function.
[0053] The human image segmentation model is trained using a loss function and an online hard example mining strategy. The loss function includes a mean squared error loss function and a cross-entropy loss function. This enables the four-input encoding module to extract and fuse human feature information images, and then apply the obtained feature images to the decoding module to obtain a high-precision segmentation map of the main target. This allows the present application to improve the stability and accuracy of human image segmentation. Attached Figure Description
[0054] One or more embodiments are illustrated by way of example with reference numerals in the accompanying drawings. These illustrations do not constitute a limitation on the embodiments. Elements with the same reference numerals in the drawings are denoted as similar elements. Unless otherwise stated, the figures in the drawings are not to be limited by scale.
[0055] Figure 1 This is a schematic diagram of an application environment provided in an embodiment of this application;
[0056] Figure 2 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application;
[0057] Figure 3 This is a flowchart illustrating a training method for a human face segmentation model provided in an embodiment of this application;
[0058] Figure 4 yes Figure 3 A detailed flowchart of step S302 is shown below:
[0059] Figure 5 yes Figure 3 Another detailed flowchart of step S302:
[0060] Figure 6 yes Figure 3 A detailed flowchart of step S303 is shown below:
[0061] Figure 7 This is a schematic diagram illustrating how to obtain a high-precision segmentation map of a main target according to an embodiment of this application:
[0062] Figure 8 yes Figure 3 Detailed flowchart of step S305:
[0063] Figure 9 This is a flowchart illustrating a method for segmenting a person image provided in an embodiment of this application;
[0064] Figure 10 This is a schematic diagram of the structure of a training device for a human face segmentation model provided in an embodiment of this application. Detailed Implementation
[0065] The present application will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present application, but do not limit the present application in any way. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application. These all fall within the protection scope of the present application.
[0066] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0067] It should be noted that, unless there is a conflict, the various features in the embodiments of this application can be combined with each other, all of which are within the protection scope of this application. Furthermore, although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in a different order than the module division in the device or the order in the flowchart. In addition, the terms "first," "second," and "third" used herein do not limit the data or execution order, but only distinguish identical or similar items with essentially the same function and effect.
[0068] Unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the scope of this application. The term "and / or" as used in this specification includes any and all combinations of one or more of the associated listed items.
[0069] Furthermore, the technical features involved in the various embodiments of this application described below can be combined with each other as long as they do not conflict with each other.
[0070] Before introducing the embodiments of this application, a brief introduction will be given to the portrait segmentation methods known to the inventors of this application, so as to facilitate the understanding of the embodiments of this application later.
[0071] Some portrait segmentation methods employ pixel difference-based binarization: the image is divided into background, undetermined, and foreground regions based on the portrait segmentation model's output; color gradient analysis is used in the undetermined region to determine whether each pixel belongs to the foreground or background; the image is enlarged to match the original image size, and then binarized and filtered to obtain the processed image data.
[0072] Some portrait segmentation methods employ generative models based on convolutional networks to segment human figures in images.
[0073] However, the techniques known to the inventors of this application, whether based on pixel difference binarization segmentation or on generative models based on convolutional networks, will retain all the contours of the human subject and leave behind debris of similar colors, making it impossible to accurately retain the main subject and thus unable to complete the virtual try-on operation, lacking practicality.
[0074] To address the aforementioned problems, this application provides a training method for a portrait segmentation model. The portrait segmentation model includes a data preprocessing module, a four-input encoding module, and a decoding module. The method includes: acquiring an image dataset, wherein the image dataset includes images of multiple targets; performing data preprocessing on each portrait image based on the data preprocessing module to obtain a human feature information image corresponding to each portrait image; performing feature extraction and fusion on the human feature information image based on the four-input encoding module to obtain a feature image; decoding the feature image based on the decoding module to obtain a high-precision segmentation map of the main target; constructing a loss function; and training the portrait segmentation model based on the loss function and an online hard example mining strategy until the loss function converges, wherein the loss function includes a mean squared error loss function and a cross-entropy loss function.
[0075] The human image segmentation model is trained using a loss function and an online hard example mining strategy. The loss function includes a mean squared error loss function and a cross-entropy loss function. This allows the four-input encoding module to extract and fuse human feature information images, and then apply the resulting feature images to the decoding module to obtain a high-precision segmentation map of the main target. This enables the present application to accurately retain the main target in the image and remove the remaining human body and debris, thereby improving the stability and accuracy of human image segmentation.
[0076] In the embodiments of this application, the training method of the portrait segmentation model and the segmentation method of the person image can be executed by an electronic device with computing processing capabilities. The following describes an exemplary application of the electronic device provided in the embodiments of this application for training the portrait segmentation model or for segmenting the person image. It is understood that the electronic device can both train the portrait segmentation model and use the portrait segmentation model to segment the person image.
[0077] In this embodiment, the electronic device can be a server, such as a server deployed in the cloud. When the server is used to train a portrait segmentation model, it constructs the portrait segmentation model based on image datasets, data preprocessing modules, four-input encoding modules, and decoding modules provided by other devices or those skilled in the art. The model is then iteratively trained using a loss function and an online hard example mining strategy to determine the final model parameters. When the server is used to segment human images, it invokes the built-in portrait segmentation model to process the human images to be segmented provided by other devices or users, obtaining a high-precision segmentation map of the main target corresponding to the human image to be segmented.
[0078] In this embodiment, the electronic device can also be various types of terminals such as laptops, desktop computers, or mobile devices. When the terminal is used to train a portrait segmentation model, those skilled in the art input a prepared image dataset into the terminal, design a data preprocessing module, a four-input encoding module, and a decoding module on the terminal, and construct a loss function so that the terminal uses the loss function and an online hard example mining strategy to iteratively train the portrait segmentation model to determine the final model parameters. When the terminal is used to segment human images, it calls the built-in portrait segmentation model to process the user-input human image to be segmented, obtaining a high-precision segmentation map of the main target corresponding to the human image to be segmented.
[0079] Before providing a detailed description of this application, the nouns and terms used in the embodiments of this application are explained, and the nouns and terms used in the embodiments of this application shall be interpreted as follows:
[0080] (1) A neural network, also known as a neural network (NNs) or connection model, is an algorithmic mathematical model that mimics the behavioral characteristics of animal neural networks to perform distributed parallel information processing. Neural networks rely on the complexity of the system to process information by adjusting the interconnections between a large number of internal nodes. Specifically, a neural network can be composed of neural units, which can be understood as a neural network with an input layer, hidden layers, and an output layer. Generally, the first layer is the input layer, the last layer is the output layer, and the intermediate layers are hidden layers. Neural networks with many hidden layers are called deep neural networks (DNNs). The work of each layer in a neural network can be described by the mathematical expression y = a(W·x + b). From a physical perspective, the work of each layer in a neural network can be understood as transforming the input space (the set of input vectors) to the output space (i.e., from the row space to the column space of a matrix) through five operations on the input space: 1. Dimensional increase / decrease; 2. Magnification / reduction; 3. Rotation; 4. Translation; 5. "Bending". Operations 1, 2, and 3 are performed by "W·x", operation 4 by "+b", and operation 5 by "a()". The term "space" is used here because the objects being classified are not individual things, but a class of things; space refers to the set of all individuals within that class. W is the weight matrix of each layer in the neural network, where each value represents the weight of a neuron in that layer. This matrix W determines the spatial transformation from the input space to the output space, meaning that the W of each layer in the neural network controls how the space is transformed. The purpose of training the neural network is to ultimately obtain the weight matrices of all layers in the trained neural network. Therefore, the training process of a neural network is essentially learning how to control spatial transformation, more specifically, learning the weight matrix.
[0081] It should be noted that, in the embodiments of this application, the models used for machine learning tasks are essentially neural networks. Common components in neural networks include convolutional layers, dilated convolutional layers, activation function layers, and batch normalization layers. By assembling these common components in neural networks, a model is designed. When the model parameters (weight matrices of each layer) are determined such that the model error meets a preset condition or the number of model parameters is adjusted to reach a preset threshold, the model converges.
[0082] The convolutional layer is configured with multiple convolutional kernels, each with a corresponding stride, to perform convolution operations on the image. The purpose of convolution is to extract different features from the input image. The first convolutional layer may only extract some low-level features such as edges, lines, and corners, while deeper convolutional layers can iteratively extract more complex features from low-level features.
[0083] Dilated convolutional layers increase the receptive field of a model by injecting holes into standard convolutional kernels. Dilated convolutional layers are configured with dilated kernels, ensuring that the number of parameters in the convolutional neural network does not increase while exponentially expanding the receptive field. Compared to convolutional layers, dilated convolutional layers have one more parameter: the dilation coefficient, also known as the dilation rate, which defines the spacing between the values in the dilated kernel.
[0084] Activation function layers are used to allow each neuron in a neural network to accept the output value of the previous layer as its input value and pass the processing result to the next layer. Commonly used activation functions include, but are not limited to, the Rectified Linear Unit (ReLU) function, the Swish function, and the Parametric Rectified Linear Unit (PReLU) function.
[0085] Batch Normalization (BN) layers are used to standardize the features of a certain layer in a network. Their purpose is to solve the problem of numerical instability in deep neural networks. Specifically, as the number of network layers increases, parameter updates during training can easily cause drastic changes in the feature outputs near the output layer, which is not conducive to training an effective neural network.
[0086] (2) A loss function is a function that maps the values of a random event or its related random variables to non-negative real numbers to represent the "risk" or "loss" of that random event. The loss function is a non-negative real number function used to quantify the difference between the predicted label and the true label. In applications, the loss function is often used as a learning criterion in relation to optimization problems, i.e., solving and evaluating the model by minimizing the loss function. For example, it is used for parametric estimation in statistics and machine learning. During the training of a neural network, because we want the output of the neural network to be as close as possible to the actual predicted value, we can compare the current network's predicted value with the actual target value, and then update the weight matrix of each layer of the neural network based on the difference between the two (however, there is usually an initialization process before the first update, i.e., pre-configuring the parameters for each layer in the neural network). For example, if the network's predicted value is too high, the weight matrix is adjusted to make it predict lower, and this adjustment continues until the neural network can predict the actual target value. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value," which is the loss function or objective function. These are important equations used to measure the difference between the predicted value and the target value. Taking the loss function as an example, a higher output value (loss) of the loss function indicates a greater difference, so training the neural network becomes the process of minimizing this loss as much as possible.
[0087] The technical solution of this application will be described in detail below with reference to the accompanying drawings.
[0088] Please see Figure 1 , Figure 1 This is a schematic diagram of an application environment provided in an embodiment of this application;
[0089] like Figure 1 As shown, the application environment 100 includes an electronic device 10 and a server 20. The electronic device 10 is connected to the server 20 via network communication, wherein the network includes wired networks and / or wireless networks. It is understood that the network includes wireless networks such as 2G, 3G, 4G, 5G, wireless LAN, and Bluetooth, and may also include wired networks such as serial cables and network cables.
[0090] In this embodiment, the electronic device 10 is communicatively connected to the server 20 and is used to acquire image datasets, construct a data preprocessing module, a four-input encoding module, and a decoding module. For example, those skilled in the art can download images of people including multiple targets to the electronic device, and construct the data preprocessing module, the four-input encoding module, and the decoding module. It is understood that the electronic device 10 can also be used to acquire images of people to be segmented. For example, a user inputs an image of a person to be segmented through an input interface, and after input, the electronic device automatically acquires the image. Alternatively, the electronic device 10 may have a camera to capture images of people, or the electronic device 10 may store a library of images of people, from which the user can select images of people to be segmented. The electronic device in this embodiment includes, but is not limited to, various terminals with computing capabilities such as laptops, desktop computers, or mobile devices. Preferably, the electronic device is a smartphone.
[0091] In this embodiment, the server 20 is communicatively connected to the electronic device 10 and is used to train a portrait segmentation model, or to acquire a portrait image to be segmented input by the user on the electronic device 10, and call the built-in portrait segmentation model to process the portrait image to be segmented, obtaining a high-precision segmentation map of the main target corresponding to the portrait image to be segmented, and then sending the high-precision segmentation map of the main target to the electronic device 10. The number of servers 20 can also be multiple, and multiple servers can form a server cluster. For example, the server cluster includes: a first server, a second server, ..., an Nth server; or, the server cluster can be a cloud computing service center, which includes several servers. The servers in this embodiment include, but are not limited to: tower servers, rack servers, blade servers, and cloud servers. Preferably, the server is a cloud server (Elastic Compute Service, ECS).
[0092] It is understood that, in this embodiment of the application, the electronic device 10 is also used to display the high-precision segmentation map of the main target on its own display interface after receiving the high-precision segmentation map of the main target sent by the server, so as to inform the user, or to locally execute the training method of the portrait segmentation model or the segmentation method of the person image provided in this embodiment of the application.
[0093] Example 1
[0094] Please see Figure 2 , Figure 2 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application;
[0095] like Figure 2As shown, the electronic device 200 includes one or more processors 201 and a memory 202. Wherein, Figure 2 Take a processor 201 as an example.
[0096] The processor 201 and the memory 202 can be connected via a bus or other means. Figure 2 Taking the example of a connection between China and Israel via a bus.
[0097] Processor 201 provides computing and control capabilities to control electronic device 200 to perform corresponding tasks, such as controlling electronic device 200 to perform a training method for a portrait segmentation model in any of the following method embodiments, including: acquiring an image dataset, wherein the image dataset includes images of multiple targets; performing data preprocessing on each portrait image based on a data preprocessing module to obtain a human feature information image corresponding to each portrait image; performing feature extraction and fusion on the human feature information image based on a four-input encoding module to obtain a feature image; decoding the feature image based on a decoding module to obtain a high-precision segmentation map of the main target; constructing a loss function, and training the portrait segmentation model based on the loss function and an online hard example mining strategy until the loss function converges, wherein the loss function includes a mean squared error loss function and a cross-entropy loss function.
[0098] Alternatively, the electronic device 200 may execute the image segmentation method for a person in any of the following method embodiments, including: acquiring an image of a person to be segmented; inputting the image of a person to be segmented into a portrait segmentation model to obtain a high-precision segmentation map of the main target corresponding to the image of the person to be segmented, wherein the portrait segmentation model is trained based on the training method of the portrait segmentation model in any of the following embodiments.
[0099] Processor 201 can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), a hardware chip, or any combination thereof; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The aforementioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.
[0100] The memory 202, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the program instructions / modules corresponding to the portrait segmentation model training method or the person image segmentation method in the embodiments of this application. The processor 201, by running the non-transitory software programs, instructions, and modules stored in the memory 202, can implement the portrait segmentation model training method or the person image segmentation method in any of the following method embodiments. Specifically, the memory 202 may include volatile memory (VM), such as random access memory (RAM); the memory 202 may also include non-volatile memory (NVM), such as read-only memory (ROM), flash memory, hard disk drive (HDD), solid-state drive (SSD), or other non-transitory solid-state storage devices; the memory 202 may also include combinations of the above types of memory.
[0101] Memory 202 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 202 may optionally include memory remotely located relative to processor 201, and these remote memories may be connected to processor 201 via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0102] One or more modules are stored in memory 202. When executed by one or more processors 201, they perform the training method for the portrait segmentation model or the segmentation method for the portrait image in any of the following method embodiments. For example, they perform the following described... Figure 3 The steps shown, or perform the following descriptions. Figure 10 The steps shown.
[0103] In the embodiments of this application, the electronic device 200 may also have wired or wireless network interfaces, input / output interfaces and other components to perform input and output. The electronic device 200 may also include other components for implementing device functions, which will not be described in detail here.
[0104] The following describes the training method of the portrait segmentation model provided in this application embodiment, with reference to exemplary applications and implementations of the electronic devices provided in the embodiments of this application.
[0105] Please see Figure 3 , Figure 3 This is a flowchart illustrating a training method for a human face segmentation model provided in an embodiment of this application;
[0106] The training method of the portrait segmentation model is applied to electronic devices, such as terminals and servers. Specifically, the execution subject of the training method of the portrait segmentation model is one or at least two processors in the electronic device. The following uses a server as an example to illustrate the training method of the portrait segmentation model.
[0107] Specifically, the portrait segmentation model includes a data preprocessing module, a four-input encoding module, and a decoding module.
[0108] like Figure 3 As shown, the training method for this portrait segmentation model includes:
[0109] Step S301: Obtain the image dataset;
[0110] Specifically, the image dataset includes images of people from multiple targets, where each target is a person. At least 60% of the images in the total sample of the image dataset contain at least two people, and 20% contain at least three people. The size of each image must be at least 1024*1024.
[0111] Specifically, an image dataset consisting of multiple target images of people is downloaded from the Internet using an electronic device. The number of images in the image dataset can be in the tens of thousands, for example, 20,000, or the number of images can be determined by those skilled in the art based on the actual situation.
[0112] Step S302: Based on the data preprocessing module, perform data preprocessing on each person image to obtain the human body feature information image corresponding to each person image;
[0113] Specifically, the data preprocessing module includes a high-precision human body segmentation network and a human body instance segmentation network, and the human body feature information image includes a high-precision human body segmentation map and a human body instance segmentation map.
[0114] Please see Figure 4 , Figure 4 yes Figure 3 A detailed flowchart of step S302 is shown below:
[0115] like Figure 4 As shown, step S302: Based on the data preprocessing module, data preprocessing is performed on each person image to obtain the human body feature information image corresponding to each person image, including:
[0116] Step S3021: Extract human contour features from each human image based on a high-precision human body segmentation network to obtain a high-precision human body segmentation map corresponding to each human image.
[0117] Specifically, the high-precision human body segmentation network is used to extract the overall human contour features of all people in each person image to obtain a high-precision human body segmentation map corresponding to each person image. The high-precision human body segmentation map includes an overall human body contour mask, which is used to determine the overall contour of all people in a person image.
[0118] Specifically, the high-precision human body segmentation network includes U 2 -NET network, U 2The U-Net network is a two-layer nested U-shaped structure. It designs a new U-shaped residual block (RSU) structure at the bottom layer and a U-Net-like structure at the top layer. Each level is filled with RSU. The RSU replaces the ordinary single-stream convolution with U-Net and replaces the original features with local features composed of a weight layer. This design change enables the network to extract features directly from multiple scales of each residual block.
[0119] Specifically, U 2 The -Net network consists of three parts: a six-level encoder, a five-level decoder, and a saliency map fusion module. The saliency map fusion module is connected to the decoder and the last-level encoder. The six-level encoder includes: En_1, En_2, En_3, En_4, En_5, and En_6, and the five-level decoder includes: De_1, De_2, De_3, De_4, and De_5.
[0120] In encoders En_1, En_2, En_3, and En_4, RSU structures of RSU-7, RSU-6, RSU-5, and RSU-4 are used, respectively. The preceding numbers, such as 7, 6, 5, and 4, represent the height L of the RSU. L is typically configured based on the spatial resolution of the input feature maps. In En_5 and En_6, the feature maps have relatively low resolution, and further downsampling of these feature maps would lead to the loss of useful context. Therefore, in the RSU-5 and RSU-6 stages, RSU-4F is used, where F indicates that RSU is an extended version, replacing the merging and upsampling operations with extended convolutions. This means that all intermediate feature maps in RSU-4F have the same resolution as their input feature maps.
[0121] The decoding stage has a structure similar to the symmetric encoding stage in En_6. In De_5, the RSU-4F extended board is also used, similar to that used in the encoding stages of En_5 and En_6. Each decoder stage takes a concatenation of the upsampled feature map from the previous stage and the feature map from its symmetric encoder stage as input.
[0122] The saliency map fusion module is used to generate a saliency probability map. 2 The -NET network first generates six output saliency probability maps from En_6, De_5, De_4, De_3, De_2, and De_1 using 3x3 convolutions and a sigmoid function. Then, the logistic graph of the output saliency maps (the convolution output, before the sigmoid function) is upsampled to the same size as the input image and fused through a cascade operation. Finally, it passes through a 1x1 convolutional layer and a sigmoid function to generate the final saliency probability map, which is a high-precision human segmentation map corresponding to the human image.
[0123] In this embodiment of the application, by adopting U 2 As a high-precision human body segmentation network, the -NET network in this application can integrate features of receptive fields at different scales, capture more contextual information, and increase the depth of the entire architecture without significantly increasing computational costs.
[0124] Step S3022: Segment each human high-precision segmentation map based on the human instance segmentation network to obtain the human instance segmentation map corresponding to each human high-precision segmentation map.
[0125] Specifically, the human instance segmentation network is used to segment the overall human contour mask in each high-precision human body segmentation map, distinguish the human contour mask region of each person, and thus obtain the human instance segmentation map corresponding to each high-precision human body segmentation map. The human instance segmentation map includes the human contour mask region of each person.
[0126] Specifically, human instance segmentation networks include the Detection Free Human Instance Segmentation (Pose2Seg) network, which is a pose-based human instance segmentation framework that can achieve better accuracy and handle occlusion better.
[0127] The Pose2Seg network mainly consists of three parts: an affine alignment module, a skeleton feature extraction module, and a segmentation module. The affine alignment module performs affine transformations based on the person's pose to align each pose to a uniform size; the skeleton feature extraction module detects body joints and connects them to obtain the skeleton features of each instance in the image; and the segmentation module segments each human figure in the image.
[0128] Specifically, the Pose2Seg network takes an image with human poses as input, extracts features using a base network, aligns the poses to a uniform size through affine transformation, generates skeleton features for each human in the image, segments each human based on the pose and skeleton features, and finally reverses and aligns the human figures in the image through an affine transformation matrix to obtain the final segmentation result, which is the human instance segmentation map corresponding to the high-precision human segmentation map.
[0129] Please see Figure 5 , Figure 5 yes Figure 3 Another detailed flowchart of step S302:
[0130] In this embodiment of the application, the human body feature information image also includes human body key point images.
[0131] like Figure 5 As shown, step S302: Based on the data preprocessing module, data preprocessing is performed on each person image to obtain the human body feature information image corresponding to each person image, and it also includes:
[0132] Step S3023: Detect each human figure image using a human pose evaluation algorithm to obtain the human key point image corresponding to each human figure image.
[0133] Specifically, human pose estimation algorithms include, but are not limited to, the human skeletal keypoint detection (OpenPose) algorithm. Preferably, the OpenPose algorithm is used in the embodiments of this application.
[0134] Specifically, the OpenPose algorithm is an algorithm that relies on convolutional neural networks and supervised learning to achieve human pose assessment. Its main advantage is that it is applicable to open-source real-time systems for multi-person 2D pose detection and can accurately and quickly identify human key points.
[0135] Specifically, the OpenPose algorithm takes the entire image of a person as input to the network, then predicts a confidence map for body part detection and a part affinity field (PAF) for part association. After that, it performs a set of binary matching on the associated body part candidates through a parsing step, and finally builds a human skeleton to connect human keypoints and assembles them into the complete pose of all people in the image, thus obtaining the human keypoint image corresponding to each person image.
[0136] The human body key points are divided into 25 categories, each labeled with 0-24. The categories are: {0, "Nose"}, {1, "Neck"}, {2, "Right Upper Arm"}, {3, "Right Elbow"}, {4, "Right Wrist"}, {5, "Left Upper Arm"}, {6, "Left Elbow"}, {7, "Left Wrist"}, {8, "Mid Hip"}, {9, "Right Hip"}, {10, "Right Knee"}, {11, "Right Ankle ...12, "Right Upper Arm"}, {13, "Right Upper Arm"}, {14, "Right Upper Arm"}, {15, "Right Upper Arm"}, {16, "Right Upper Arm"}, {17, "Right Upper Arm"}, {18, "Right Upper Arm"}, {19, "Right Upper Arm"}, {12, "Left hip (Lhip)"}, {13, "Left knee (LKnee)"}, {14, "Left ankle (LAnkle)"}, {15, "Right eye (Reye)"}, {16, "Left eye (Leye)"}, {17, "Right ear (Rear)"}, {18, "Left ear (Lear)"}, {19, "Left big toe (LBigToe)"}, {20, "Left little toe (LSmallToe)"}, {21, "Left heel (LHeel)"}, {22, "Right big toe (RBigToe)"}, {23, "Right little toe (RSmallToe)"}, {24, "Right heel (Rheel)"}.
[0137] Step S303: Based on the four-input encoding module, perform feature extraction and fusion on the human body feature information image to obtain a feature image;
[0138] Specifically, the human feature information image includes a human key point image, a high-precision human segmentation map, and a human instance segmentation map. Each human image corresponds to a human key point image, a high-precision human segmentation map, and a human instance segmentation map.
[0139] Please see Figure 6 , Figure 6 yes Figure 3 A detailed flowchart of step S303 is shown below:
[0140] like Figure 6 As shown, step S303: Based on the four-input encoding module, feature extraction and fusion are performed on the human body feature information image to obtain a feature image, including:
[0141] Step S3031: Based on the four-input encoding module, feature extraction and fusion are performed on each person image and the corresponding human body key point image, human body high-precision segmentation map, and human body instance segmentation map to obtain a feature image.
[0142] Specifically, the four-input encoding module is based on the U-net network module. It uses four input images: a person image and its corresponding human keypoint image, a high-precision human segmentation image, and a human instance segmentation image. It outputs a single feature image as a convolutional neural network. The U-net network module is a convolutional neural network for 2D image segmentation. It employs the same number of convolutional operations in both the downsampling and upsampling stages. The network structure contains only convolutional and pooling layers, without fully connected layers. Shallower, high-resolution layers are used to handle pixel localization, while deeper layers handle pixel classification, thus enabling semantic-level image segmentation.
[0143] Specifically, the four-input encoding module includes dilated convolutional layers, convolutional layers, activation function layers, and batch normalization layers, with the corresponding mathematical expressions as follows:
[0144]
[0145] in, Let i represent the i-th feature image of the (x+1)-th layer, and BN denotes batch normalization. This represents the Swish activation function. This represents the ReLU activation function, Fix represents the i-th feature image of the x-th layer, and Gix+1 represents the dilated convolution kernel of the (x+1)-th layer. This represents the standard convolutional kernel of the (x+1)th layer. This indicates the bias term.
[0146] Specifically, the four-input encoding module includes four encoding modules and one fusion module. Each encoding module includes: a first dilated convolutional layer (with 32 dilated convolutional kernels), a first convolutional layer (with 32 standard convolutional kernels), a first activation function layer, a first batch normalization layer, a second dilated convolutional layer (with 64 dilated convolutional kernels), a second convolutional layer (with 64 standard convolutional kernels), a second activation function layer, a second batch normalization layer, a third dilated convolutional layer (with 128 dilated convolutional kernels), and a third convolutional layer (with 128 standard convolutional kernels). The sequence consists of: third activation function layer, third batch normalization layer, fourth dilated convolutional layer (256 kernels), convolutional layer (256 standard kernels), fourth activation function layer, fourth batch normalization layer, fifth dilated convolutional layer (512 kernels), fifth convolutional layer (512 standard kernels), fifth activation function layer, fifth batch normalization layer, sixth dilated convolutional layer (1024 kernels), sixth convolutional layer (1024 standard kernels), sixth activation function layer, and sixth batch normalization layer.
[0147] In this system, the size of the dilated convolutional kernel in each dilated convolutional layer of each encoding module is 5*5, the dilation factor is 2, and the stride is 2; the size of the standard convolutional kernel in each convolutional layer of each encoding module is 3*3, and the stride is 2; the activation functions used in the activation function layers include the Swish activation function and the ReLU activation function; the convolution (padding) strategy used in each encoding module is the same (same) mode, so that the output size is as proportional as possible to the input size. The same mode means that the original image is padded first, and then convolution is performed. The padding value must be determined according to the convolutional kernel size and the sliding stride.
[0148] In this embodiment of the application, by adding a dilated convolutional layer, the application is able to better obtain the spatial features of the image.
[0149] Furthermore, the four encoding modules extract features from the four input images respectively, resulting in four 8*8*1024 feature maps. Each encoding module extracts features from only one input image, resulting in one feature map. Then, the fusion module concatenates the four 8*8*1024 feature maps using a concatenation function to obtain a single 8*8*4096 feature image. The concatenation function can be the `torch.cat(input, concatenation dimension)` function from the PyTorch scientific computing package (Python). `torch.cat` concatenates input tensor sequences along a given dimension. The parameters within the parentheses are: `inputs` representing the tensor sequences to be concatenated, and `dim` representing the selected dimension expansion value, allowing the tensor sequences to be concatenated along this dimension.
[0150] In this embodiment, feature images are obtained by extracting and fusing human feature information images based on a four-input encoding module. This application can extract and fuse feature information from each human image and the corresponding human key point image, human high-precision segmentation image, and human instance segmentation image, thereby improving the stability and accuracy of human image segmentation.
[0151] Step S304: Based on the decoding module, decode the feature image to obtain a high-precision segmentation map of the main target;
[0152] Specifically, the decoding module decodes the feature image to obtain a high-precision segmentation map of the main target, which includes a high-precision mask of the main target. The decoding module is a convolutional neural network that takes an 8*8*4096 feature image output from the four-input encoding module as input and a 1024*1024*1 grayscale image as output. It includes transposed convolutional layers, activation function layers, and batch normalization layers.
[0153] Specifically, the decoding module includes a first transposed convolutional layer (with 2048 kernels), a first activation function layer, a first batch of normalization layers, a second transposed convolutional layer (with 1024 kernels), a second activation function layer, a second batch of normalization layers, a third transposed convolutional layer (with 512 kernels), a third activation function layer, a third batch of normalization layers, a fourth transposed convolutional layer (with 256 kernels), a fourth activation function layer, a fourth batch of normalization layers, a fifth transposed convolutional layer (with 128 kernels), a fifth activation function layer, a fifth batch of normalization layers, a sixth transposed convolutional layer (with 32 kernels), a sixth activation function layer, a sixth batch of normalization layers, a seventh transposed convolutional layer (with 1 kernel), a seventh activation function layer, and a seventh batch of normalization layers. Each transposed convolutional layer has a kernel size of 3*3 and a stride of 2. The activation functions used in the activation function layers include the PRelu activation function.
[0154] Please see Figure 7 , Figure 7 This is a schematic diagram illustrating how to obtain a high-precision segmentation map of a main target according to an embodiment of this application:
[0155] like Figure 7 As shown, the human image, along with the corresponding human keypoint image, high-precision human segmentation map, and human instance segmentation map, are processed by a four-input encoding module and a decoding module to output a high-precision segmentation map of the main target. The four-input encoding module comprises four encoding modules and one fusion module.
[0156] Step S305: Construct a loss function, and train the human image segmentation model based on the loss function and the online hard example mining strategy until the loss function converges.
[0157] Specifically, a loss function is constructed, which, along with an online hard example mining strategy, is used to train the portrait segmentation model. The parameters of the portrait segmentation model are adjusted until the loss function converges, and the portrait segmentation model at this point is taken as the trained portrait segmentation model.
[0158] Specifically, the Online Hard Example Mining (OHEM) strategy automatically selects difficult samples as training samples to improve network parameter performance. Difficult samples refer to those with diversity and high loss; they can also be understood as the most difficult samples (samples the model struggles to distinguish) within a given training set batch size. The OHEM strategy has the following advantages: it eliminates the need to set a positive-to-negative sample ratio to address class imbalance, making this online selection method more targeted; and it allows for greater algorithm improvement as the dataset increases.
[0159] Specifically, the loss functions include the mean squared error loss function and the cross-entropy loss function, which can be expressed by the following formula:
[0160] Loss = L MSE +λ1L BCE
[0161] Where Loss represents the loss function, L MSE Let L represent the mean squared error loss function, λ1 represent the hyperparameter, set to 100, and L... BCE This represents the cross-entropy loss function.
[0162] The mean squared error loss function includes:
[0163]
[0164] Among them, L MSE Let x represent the mean squared error loss function, and let x represent the true sample. λ represents the predicted sample. OHEM denoted by , where represents the coefficient of the online hard case mining strategy, and n represents the number of samples.
[0165] It is understandable that real samples include high-precision segmentation maps of the main target corresponding to the person image in the image dataset, while predicted samples include high-precision segmentation maps of the main target corresponding to the person image predicted by the portrait segmentation model.
[0166] The cross-entropy loss function includes:
[0167]
[0168] Among them, L BCE This represents the cross-entropy loss function, where n represents the number of samples, and y represents the number of samples. i p represents the classification label of the i-th real sample. i This represents the classification label of the i-th predicted sample.
[0169] Understandably, the classification labels for real samples are manually labeled. The classification labels for real samples include a matrix corresponding to the high-precision segmentation map of the main target image in the image dataset. This matrix consists of 0s and 1s, where 0 represents the background region in the high-precision segmentation map of the main target and 1 represents the human figure region in the high-precision segmentation map of the main target.
[0170] Similarly, the classification label of the predicted sample includes the matrix corresponding to the high-precision segmentation map of the main target in the human image predicted by the human image segmentation model. This matrix is composed of 0 and 1, where 0 represents the background region in the high-precision segmentation map of the main target and 1 represents the human image region in the high-precision segmentation map of the main target.
[0171] Please see Figure 8 , Figure 8 yes Figure 3 Detailed flowchart of step S305:
[0172] like Figure 8 As shown, step S305: Constructing a loss function, and training the portrait segmentation model based on the loss function and the online hard example mining strategy until the loss function converges, including:
[0173] Step S3051: Obtain samples for each batch;
[0174] Specifically, the samples include images of people from the image dataset.
[0175] Step S3052: Calculate the loss function value for each sample in the current batch;
[0176] Specifically, the loss function value includes the sum of the mean squared error loss function value and the cross-entropy loss function value.
[0177] Step S3053: Sort all samples in the current batch according to the loss function value to determine the difficult samples in the current batch;
[0178] Specifically, according to the OHEM strategy, the loss function values of all samples in the same batch are sorted in descending order, and each sample is assigned a weight. The weight of each sample is the reciprocal of its ranking according to the loss function value. For example, if a sample's ranking according to the loss function value is tenth, its weight is 1 / 10. It can be understood that samples with larger loss function values are assigned higher weights, and the difficult samples in the current batch are those with large loss function values, i.e., samples with high weights.
[0179] Furthermore, the number of difficult samples in the current batch is the ratio of the total number of samples to the number of samples in the current batch. Based on the number of difficult samples in the current batch, a corresponding number of samples are selected from the top of the ranking based on the loss function value. Alternatively, based on the number of difficult samples in the current batch, a corresponding number of samples with the highest weights are selected from all samples in the current batch. For example, if the number of difficult samples in the current batch is 10, then the top 10 samples ranked by loss function value are selected as the difficult samples in the current batch, or the 10 samples with the highest weights are selected as the difficult samples in the current batch.
[0180] Step S3054: Train the human image segmentation model based on difficult samples;
[0181] Specifically, the human face segmentation model focuses on training the difficult samples in the current batch.
[0182] In this embodiment, the Adam algorithm (Adaptive Moment Estimation Algorithm) is used to optimize the model parameters. For example, the upper limit of the number of iterations is set to 100,000, the initial learning rate is set to 0.0005, and the weight decay is set to 0.0008. After obtaining the adjusted model parameters output by the Adam algorithm, these adjusted model parameters are used for the next training iteration until the loss function converges. It can be understood that loss function convergence refers to the convergence of the sum of the mean squared error loss function and the cross-entropy loss function.
[0183] Understandably, the Adam algorithm (Adaptive Moment Estimation Algorithm) can be seen as a combination of the momentum method and the RMSprop algorithm. It not only uses momentum as a parameter to update the direction, but also can adaptively adjust the learning rate.
[0184] In this embodiment, while optimizing the model parameters using the Adam algorithm, the loss function value curve is monitored in real time. If the loss function value curve does not decrease significantly within a preset number of iterations, the loss function converges, indicating that the portrait segmentation model training is complete. The model parameters of the portrait segmentation model at this time are then output, thus obtaining the trained portrait segmentation model.
[0185] Step S3055: Save the loss function value for each training session;
[0186] Step S3056: When the number of iterations exceeds the preset number, sort the loss function value obtained from the current training in ascending order with the saved loss function values;
[0187] Specifically, the preset number of iterations can be 500. When the number of iterations exceeds 500, the loss function value obtained from the current training is sorted in ascending order with the saved loss function values to obtain the ranking of the loss function value obtained from the current training.
[0188] Step S3057: If the ranking of the loss function value obtained from the current training is greater than the preset sequence value, then the loss function converges.
[0189] Specifically, the preset sequence value can be 500. If the ranking of the loss function value obtained from the current training is greater than the 500th, it indicates that the loss function value curve has not decreased significantly within 500 iterations, and the loss function has converged.
[0190] Furthermore, if the loss function converges, it signifies that the portrait segmentation model has completed training. The model parameters of the portrait segmentation model at this point are then output, resulting in the trained portrait segmentation model. This trained model can then be used to segment the image of the person to be segmented, obtaining a high-precision segmentation map of the corresponding subject.
[0191] In this embodiment, by training the portrait segmentation model based on a loss function and an online hard example mining strategy until the loss function converges, the portrait segmentation model can be better trained on difficult samples, thereby improving the stability and accuracy of portrait segmentation.
[0192] After training the portrait segmentation model using the training method provided in this application embodiment, the portrait segmentation model can be used to segment human images. The human image segmentation method provided in this application embodiment can be implemented by various types of electronic devices with computing capabilities, such as smart terminals and servers.
[0193] In this embodiment, a training method for a portrait segmentation model is provided. The portrait segmentation model includes a data preprocessing module, a four-input encoding module, and a decoding module. The method includes: acquiring an image dataset, wherein the image dataset includes multiple images of people; performing data preprocessing on each image based on the data preprocessing module to obtain a human feature information image corresponding to each image; performing feature extraction and fusion on the human feature information image based on the four-input encoding module to obtain a feature image; decoding the feature image based on the decoding module to obtain a high-precision segmentation map of the main target; constructing a loss function; and training the portrait segmentation model based on the loss function and an online hard example mining strategy until the loss function converges, wherein the loss function includes a mean squared error loss function and a cross-entropy loss function.
[0194] The human image segmentation model is trained using a loss function and an online hard example mining strategy. The loss function includes a mean squared error loss function and a cross-entropy loss function. This enables the four-input encoding module to extract and fuse human feature information images, and then apply the obtained feature images to the decoding module to obtain a high-precision segmentation map of the main target. This allows the present application to improve the stability and accuracy of human image segmentation.
[0195] Example 2
[0196] After training the portrait segmentation model using the method provided in the above embodiments, a trained portrait segmentation model is obtained, which can be used to segment human images.
[0197] It is understood that the above embodiment one is the training stage of the portrait segmentation model, and embodiment two of this application is the stage of segmenting the portrait image using the portrait segmentation model to obtain a high-precision segmentation map of the main target corresponding to the portrait image to be segmented.
[0198] The following describes the method for segmenting human images provided in this application embodiment, using exemplary applications and implementations of the terminal provided in the embodiments of this application.
[0199] Please see Figure 9 , Figure 9 This is a flowchart illustrating a method for segmenting a person image provided in an embodiment of this application;
[0200] like Figure 9 As shown, the method for segmenting the image of the person includes:
[0201] Step S901: Obtain the image of the person to be segmented;
[0202] Specifically, the electronic device acquires the image of the person to be segmented. For example, the user inputs the image of the person to be segmented through an input interface, and after the input is completed, the electronic device automatically acquires the image of the person to be segmented. Alternatively, the electronic device has a camera that captures images of people. Or, the electronic device stores a library of images of people, from which the user can select the image of the person to be segmented. Or, the electronic device receives the image of the person to be segmented uploaded by the user through the network.
[0203] Step S902: Input the image of the person to be segmented into the portrait segmentation model to obtain a high-precision segmentation map of the main target corresponding to the image of the person to be segmented.
[0204] Specifically, the portrait segmentation model is trained using the training method of the portrait segmentation model in Implementation Example 1.
[0205] The image of the person to be segmented is input into the trained portrait segmentation model to obtain a high-precision segmentation map of the main target output by the portrait segmentation model. This high-precision segmentation map of the main target includes a high-precision mask of the person.
[0206] It is understood that the portrait segmentation model is trained using the same training method as the portrait segmentation model in the above embodiments, and has the same structure and function as the portrait segmentation model in the above embodiments, which will not be described in detail here.
[0207] In this embodiment of the application, by inputting the image of the person to be segmented into the portrait segmentation model, a high-precision segmentation map of the main target corresponding to the image of the person to be segmented is obtained. This model can accurately preserve the main target in images of multiple people or a single person, and remove the remaining human body and debris, thereby achieving accurate portrait segmentation.
[0208] Example 3
[0209] Please see Figure 10 , Figure 10 This is a schematic diagram of the structure of a training device for a human face segmentation model provided in an embodiment of this application;
[0210] The training device for the portrait segmentation model is applied to an electronic device; specifically, the training device for the portrait segmentation model is applied to one or at least two processors of the electronic device.
[0211] The portrait segmentation model includes a data preprocessing module, a four-input encoding module, and a decoding module.
[0212] like Figure 10 As shown, the training device 101 for the portrait segmentation model includes:
[0213] The acquisition unit 1011 is used to acquire an image dataset, wherein the image dataset includes images of multiple targets;
[0214] The data processing unit 1012 is used to perform data preprocessing on each person image based on the data preprocessing module to obtain the human body feature information image corresponding to each person image.
[0215] The feature image unit 1013 is used to extract and fuse human feature information images based on the four-input encoding module to obtain feature images;
[0216] The image decoding unit 1014 is used to decode the feature image based on the decoding module to obtain a high-precision segmentation map of the main target.
[0217] Model training unit 1015 is used to construct the loss function and train the human image segmentation model based on the loss function and the online hard example mining strategy until the loss function converges. The loss function includes the mean squared error loss function and the cross-entropy loss function.
[0218] In the embodiments of this application, the training device for the human face segmentation model can also be built from hardware devices. For example, the training device for the human face segmentation model can be built from one or more chips, and the chips can work together to complete the training method for the human face segmentation model described in the above embodiments. As another example, the training device for the human face segmentation model can also be built from various logic devices, such as general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), microcontrollers, ARM processors (Advanced RISC Machines) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or any combination of these components.
[0219] The training device for the human face segmentation model in this application embodiment can be a device, or a component, integrated circuit, or chip in a terminal. The device can be a mobile electronic device or a non-mobile electronic device. For example, mobile electronic devices can be mobile phones, tablets, laptops, PDAs, in-vehicle electronic devices, wearable devices, ultra-mobile personal computers (UMPCs), netbooks, or personal digital assistants (PDAs), etc., while non-mobile electronic devices can be servers, network-attached storage (NAS), personal computers (PCs), televisions (TVs), ATMs, or self-service machines, etc. This application embodiment does not impose specific limitations.
[0220] The training device for the human face segmentation model in this embodiment can be a device with an operating system. This operating system can be Android, iOS, or other possible operating systems; this embodiment does not specifically limit the specific operating system used.
[0221] The training device for the human face segmentation model provided in this application embodiment can achieve... Figure 3 To avoid repetition, the various processes involved will not be described in detail here.
[0222] It should be noted that the above-described training device for the portrait segmentation model can execute the training method for the portrait segmentation model provided in the above embodiments, and has the corresponding functional modules and beneficial effects of the method. Technical details not described in detail in the embodiments of the training device for the portrait segmentation model can be found in the training method for the portrait segmentation model provided in the above embodiments.
[0223] In this embodiment, a training device for a portrait segmentation model is provided. The portrait segmentation model includes a data preprocessing module, a four-input encoding module, and a decoding module. The training device comprises: an acquisition unit for acquiring an image dataset, wherein the image dataset includes multiple images of people; a data processing unit for preprocessing each image based on the data preprocessing module to obtain a human feature information image corresponding to each image; a feature image unit for extracting and fusing features from the human feature information image based on the four-input encoding module to obtain a feature image; an image decoding unit for decoding the feature image based on the decoding module to obtain a high-precision segmentation map of the main target; and a model training unit for constructing a loss function and training the portrait segmentation model based on the loss function and an online hard example mining strategy until the loss function converges. The loss function includes a mean squared error loss function and a cross-entropy loss function.
[0224] The human image segmentation model is trained using a loss function and an online hard example mining strategy. The loss function includes a mean squared error loss function and a cross-entropy loss function. This enables the four-input encoding module to extract and fuse human feature information images, and then apply the obtained feature images to the decoding module to obtain a high-precision segmentation map of the main target. This allows the present application to improve the stability and accuracy of human image segmentation.
[0225] Example 4
[0226] This application embodiment also provides a computer-readable storage medium storing computer-executable instructions for causing an electronic device to execute the training method of the portrait segmentation model provided in the above embodiments, for example, such as... Figure 3 The training method for the portrait segmentation model shown, or the portrait image segmentation method provided in the above embodiments, for example, Figure 9 The image segmentation method shown is for the human figure.
[0227] In the embodiments of this application, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or it may be a device including one or any combination of the above-mentioned memories.
[0228] In the embodiments of this application, the executable instructions may take the form of a program, software, software module, script or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including being deployed as a standalone program or as a module, component, subroutine or other unit suitable for use in a computing environment.
[0229] As an example, executable instructions may, but do not necessarily, correspond to files in the file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple collaborative files (e.g., a file that stores one or more modules, subroutines, or code sections).
[0230] As an example, executable instructions can be deployed to execute on a single computing device (including devices such as smart terminals and servers), or on multiple computing devices located in one location, or on multiple computing devices distributed across multiple locations and interconnected via a communication network.
[0231] This application also provides a computer-readable storage medium storing a computer program, which includes program instructions. When executed by an electronic device, the program instructions cause the electronic device to perform the training method for the portrait segmentation model or the segmentation method for the human image as described in the above embodiments.
[0232] This application also provides a computer program product, which includes one or more lines of program code stored in a computer-readable storage medium. The processor of an electronic device reads the program code from the computer-readable storage medium and executes the program code to complete the method steps of the portrait segmentation model training method or the portrait image segmentation method provided in the above embodiments.
[0233] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware, or by a program or program code related to hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.
[0234] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented using software and a general-purpose hardware platform, or of course, using hardware. Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0235] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and not to limit them; under the concept of this application, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of this application as described above. For the sake of brevity, they are not provided in detail; although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. A training method for a human face segmentation model, characterized in that, The portrait segmentation model includes a data preprocessing module, a four-input encoding module, and a decoding module; the method includes: Obtain an image dataset, wherein the image dataset includes images of multiple targets; Based on the data preprocessing module, each of the human figures images is preprocessed to obtain the human body feature information image corresponding to each human figure image. Based on the four-input encoding module, feature extraction and fusion are performed on the human body feature information image to obtain a feature image; Based on the decoding module, the feature image is decoded to obtain a high-precision segmentation map of the main target; Construct a loss function, and train the portrait segmentation model based on the loss function and the online hard example mining strategy until the loss function converges. The loss function includes a mean squared error loss function and a cross-entropy loss function. The data preprocessing module includes a high-precision human body segmentation network and a human body instance segmentation network; The human body feature information image includes a high-precision human body segmentation map and a human body instance segmentation map; The step of preprocessing each of the aforementioned human figures images using the data preprocessing module to obtain a human feature information image corresponding to each human figure image includes: Based on the high-precision human body segmentation network, human portrait contour features are extracted from each of the human images to obtain a high-precision human body segmentation map corresponding to each human image. The human instance segmentation network is used to segment each of the high-precision human body segmentation maps to obtain the human instance segmentation map corresponding to each of the high-precision human body segmentation maps. The step of extracting and fusing features from the human body feature information image based on the four-input encoding module to obtain a feature image includes: Based on the four-input encoding module, feature extraction and fusion are performed on each of the person images and the corresponding human key point images, human high-precision segmentation maps, and human instance segmentation maps to obtain a feature image.
2. The method according to claim 1, characterized in that, The human body feature information image also includes human body key point images; The step of performing data preprocessing on each of the aforementioned person images based on the data preprocessing module to obtain a human feature information image corresponding to each person image further includes: The human pose evaluation algorithm is used to detect each of the aforementioned human figures to obtain the human key point image corresponding to each of the aforementioned human figures.
3. The method according to claim 1, characterized in that, The four-input encoding module includes a dilated convolutional layer, a convolutional layer, an activation function layer, and a batch normalization layer.
4. The method according to any one of claims 1-3, characterized in that, The four-input encoding module includes: in, Represents the (x+1)th level For each feature image, BN represents the batch normalization operation. This represents the Swish activation function. This represents the ReLU activation function. Represents the xth layer. Each feature image, This represents the dilated convolution kernel of the (x+1)th layer. This represents the standard convolutional kernel of the (x+1)th layer. This indicates the bias term.
5. The method according to any one of claims 1-3, characterized in that, The decoding module includes a transposed convolutional layer, an activation function layer, and a batch normalization layer; The loss function includes: in, Represents the loss function. This represents the mean squared error loss function. Indicates hyperparameters, This represents the cross-entropy loss function.
6. The method according to claim 5, characterized in that, The mean squared error loss function includes: in, This represents the mean squared error loss function. Represents a real sample. Indicates the predicted sample, This represents the coefficient of the online hard case mining strategy. Indicates the number of samples.
7. The method according to claim 6, characterized in that, The cross-entropy loss function includes: in, Represents the cross-entropy loss function. Indicates the number of samples. This represents the classification label of the i-th real sample. This represents the classification label of the i-th predicted sample.
8. The method according to any one of claims 1, 6 or 7, characterized in that... The step of training the portrait segmentation model based on the loss function and the online hard example mining strategy until the loss function converges includes: Acquire samples for each batch, wherein the samples include the images of the people; Calculate the loss function value for each sample in the current batch; Based on the loss function value, all samples in the current batch are sorted to determine the difficult samples in the current batch; The portrait segmentation model is trained based on the aforementioned difficult samples; Save the loss function value for each training session; When the number of iterations exceeds the preset number, the loss function value obtained from the current training is sorted in ascending order with the saved loss function values. If the ranking of the loss function value obtained from the current training is greater than the preset sequence value, then the loss function converges.
9. A method for segmenting a person's image, characterized in that, The method includes: Obtain the image of the person to be segmented; The image of the person to be segmented is input into the portrait segmentation model to obtain a high-precision segmentation map of the main target corresponding to the image of the person to be segmented, wherein the portrait segmentation model is trained based on the method described in any one of claims 1-8.
10. An electronic device, characterized in that, include: At least one processor, and The memory communicatively connected to the at least one processor, wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method according to any one of claims 1-9.