Method for training a fitting model, method for generating a fitting image and related devices

CN115587618BActive Publication Date: 2026-06-23SHENZHEN SHULIAN TIANXIA INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN SHULIAN TIANXIA INTELLIGENT TECH CO LTD
Filing Date
2022-04-24
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing virtual try-on algorithms generate try-on images with low resolution that is difficult to improve, resulting in unstable try-on effects that fail to meet user needs.

Method used

A virtual fitting network structure was designed, including a first encoding network, a decoding network, and a second encoding network. Feature maps are fused through normalized layers with cross-layer connections, and iterative training is performed using a human body parsing algorithm and a loss function to generate high-resolution virtual fitting images.

Benefits of technology

The generated fitting images have high resolution and a realistic and natural fitting effect, accurately reproducing the user's fitting experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115587618B_ABST
    Figure CN115587618B_ABST
Patent Text Reader

Abstract

The embodiment of the application relates to the image processing technical field, discloses a kind of method for training fitting model, the method for generating fitting image and related device, by designing the structure of the above-mentioned fitting network, decoding network is constructed using multiple cascaded, interval setting normalization layer and decoding layer, and there is cross-layer connection between the normalization layer, the first encoding layer and the second encoding layer of the same level, so that, by normalization layer, the first clothes feature map, the second clothes feature map and the up-sampling feature map of the same level are fused, so that identity feature map can be fused clothes features from different scales in decoding process, avoid the loss of clothes texture problem, so that high-resolution fitting image can be generated, and the high-resolution fitting image can have real and natural fitting effect. With the continuous iterative training of fitting network, the pretest fitting image fused and generated will be constantly close to real fitting image, that is, accurate fitting model is obtained.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, and in particular to a method for training a virtual fitting model, a method for generating virtual fitting images, and related apparatus. Background Technology

[0002] With the continuous advancement of modern technology and the increasing scale of online shopping, users can purchase clothing on online shopping platforms via their mobile phones. However, since the information about the clothing for sale is generally presented as two-dimensional images, users cannot know how the clothing will look on them, which may lead to them buying clothes that are not suitable for them and resulting in a poor shopping experience.

[0003] With the continuous development of neural networks, they have been widely used in the field of image generation. Therefore, researchers have applied neural networks to virtual try-on, proposing various try-on algorithms. However, the virtual try-on images generated by existing algorithms are mainly low-resolution images (e.g., 256*192). If the resolution is forcibly increased during training, it is easy to cause problems such as unstable training of the algorithm model, vanishing gradients, and pattern collapse, making it difficult to obtain high-quality try-on results. Summary of the Invention

[0004] The main technical problem solved by the embodiments of this application is to provide a method for training a fitting model, a method for generating fitting images, and related apparatus. The fitting model trained by this method can generate high-resolution fitting images, and the high-resolution fitting images can have a realistic and natural fitting effect.

[0005] To address the aforementioned technical problems, in a first aspect, this application provides a method for training a fitting model. The fitting network includes a first encoding network, a decoding network, and a second encoding network. The first encoding network includes multiple cascaded first encoding layers, the second encoding network includes multiple cascaded second encoding layers, and the decoding network includes multiple cascaded, spaced-apart normalization layers and decoding layers. Cross-layer connections exist between the normalization layers, the first encoding layers, and the second encoding layers at the same level. The normalization layer is used to fuse the feature map output by the preceding adjacent decoding layer or the last first encoding layer in the first encoding network with the feature maps output by the cross-layer connected first and second encoding layers.

[0006] The method includes:

[0007] Obtain a training set, which includes multiple training data, including clothing images and real fitting images, where the real fitting images include images of models wearing the corresponding clothing from the clothing images;

[0008] The real fitting images are analyzed using a human body analysis algorithm to obtain a human body analysis diagram, and the model's torso diagram is obtained based on the human body analysis diagram;

[0009] The clothing in the clothing image is deformed according to the human body structure of the model in the real fitting image to obtain the deformed clothing image;

[0010] The deformed clothing image and the body torso image are input into the fitting network. The deformed clothing image is input into the first encoding network and the second encoding network for downsampling encoding, and the body torso image is input into the first encoding network for downsampling encoding and the decoding network for upsampling decoding. During the decoding process, the upsampling feature map output by the decoding layer is fused with the first clothing feature map output by the first encoding layer and the second clothing feature map output by the second encoding layer through the normalization layer. The upsampling decoding is performed layer by layer. The last decoding layer in the decoding network outputs the pre-test fitting image.

[0011] The difference between each real fitting image and each predicted fitting image in the training set is calculated using a loss function, and the fitting network is iteratively trained based on the difference until the fitting network converges to obtain the fitting model.

[0012] In some embodiments, deforming the clothing in the clothing image according to the human body structure of the model in the real fitting image to obtain the deformed clothing image includes:

[0013] The human body image and the clothing image are respectively used to extract features using a convolutional network, and the two extracted feature images are combined into a one-dimensional tensor.

[0014] The one-dimensional tensor is input into the regression network to predict the transformation parameters.

[0015] The clothing image is interpolated and deformed according to the transformation parameters to obtain the deformed clothing image.

[0016] In some embodiments, obtaining the model's torso image based on the human anatomy analysis image includes:

[0017] Based on the human body analysis image, obtain the model's body pixel region;

[0018] Based on the real fitting image and the body pixel area, obtain the model's torso image.

[0019] In some embodiments, the method further includes:

[0020] The human body analysis image is sequentially input into the first encoding network for downsampling encoding and the decoding network for upsampling decoding; during the decoding process, the analysis feature map output by the decoding layer constrains the pixel categories of the upsampling feature maps at the same level.

[0021] In some embodiments, the normalization layer is fused using the following formula;

[0022] ;

[0023] in, It is the fused feature map output by the i-th normalization layer. It is the first clothing feature map output by the i-th first coding layer. It is the upsampled feature map output by the (i-1)th decoding layer. It is the second clothing feature map output by the i-th second coding layer.

[0024] In some embodiments, the loss function includes adversarial loss, perceptual loss, and multi-scale resolution loss between the real fitting image and the pre-fitting image, wherein the multi-scale resolution loss reflects the loss between the multi-size feature map corresponding to the pre-fitting image and the multi-size feature map corresponding to the real fitting image.

[0025] In some embodiments, the loss function includes:

[0026] ;

[0027] in,

[0028]

[0029]

[0030] in, For the aforementioned resistance loss, For the perceived loss, The multi-scale resolution loss, and Here, T is the actual fitting image, Y is the predicted fitting image, and D is the discriminator. It is the k-th feature map extracted from the real fitting image. It is the k-th feature map extracted from the pre-test garment image. yes or The number of elements in, V is the number of elements in the... The number of or the number of The number of It is the j-th feature map corresponding to the real fitting image. It is the j-th feature map corresponding to the pre-test garment image.

[0031] To address the aforementioned technical problems, in a second aspect, embodiments of this application provide a method for generating fitting room images, comprising:

[0032] Obtain images of the clothing to be tried on and the user's image;

[0033] The user image is analyzed using a human body analysis algorithm to obtain the user's human body analysis image, and the user's torso image is obtained based on the user's human body analysis image.

[0034] The clothing in the image of the clothing to be tried is deformed according to the user's human body structure in the user image to obtain the deformed image of the clothing to be tried.

[0035] The deformed image of the garment to be tried on and the user's torso image are input into the fitting model to generate the fitting image. The fitting model is trained using the method described in the first aspect.

[0036] To address the aforementioned technical problems, in a third aspect, this application provides a computer device, comprising:

[0037] At least one processor, and

[0038] A memory communicatively connected to at least one processor, wherein,

[0039] The memory stores instructions that can be executed by at least one processor, such that the at least one processor can perform the method described in the first aspect above.

[0040] To address the aforementioned technical problems, in a fourth aspect, this application provides a computer-readable storage medium storing computer-executable instructions for causing a computer device to perform the method described in the first aspect above.

[0041] The beneficial effects of this application's embodiments are as follows: Unlike the prior art, the method for training a fitting model provided in this application's embodiments includes a fitting model network structure comprising a first encoding network, a decoding network, and a second encoding network. The first encoding network includes multiple cascaded first encoding layers, the second encoding network includes multiple cascaded second encoding layers, and the decoding network includes multiple cascaded, spaced-apart normalization layers and decoding layers. Furthermore, there are cross-layer connections between normalization layers, first encoding layers, and second encoding layers at the same level. The normalization layer is used to fuse the feature map output by the preceding adjacent decoding layer or the last first encoding layer in the first encoding network with the feature maps output by the cross-layer connected first and second encoding layers.

[0042] First, a training set is acquired, comprising multiple training data points. Each training data point includes clothing images and real-world images of models wearing the corresponding clothing from the images; these real-world images serve as the true labels. For each training data point, the clothing in the clothing image is deformed according to the model's human anatomy in the real-world images, resulting in a deformed clothing image that is three-dimensional and conforms to the human body structure. The real-world images are then classified pixel by pixel, and a model's torso image is obtained based on the classification results. This torso image preserves the model's identity features while removing background and other interfering information, simplifying subsequent encoding and facilitating network training. Then, the deformed clothing image and body torso image are input into the virtual try-on network. The deformed clothing image is downsampled and encoded by the first and second encoding networks, respectively. The body torso image is sequentially downsampled and encoded by the first encoding network and upsampled and decoded by the decoding network. During decoding, the upsampled feature map output from the decoding layer is fused with the first clothing feature map output from the first encoding layer and the second clothing feature map output from the second encoding layer through a normalization layer. This fusion and upsampling decoding is performed layer by layer. The last decoding layer in this network outputs a pre-test clothing image. Finally, a loss function is used to calculate the difference between each real virtual try-on image and each pre-test clothing image in the training set. Based on the difference, the virtual try-on network is iteratively trained until it converges, resulting in the virtual try-on model.

[0043] By designing the structure of the aforementioned virtual try-on network, the decoding network is constructed using multiple cascaded, spaced-apart normalization and decoding layers. Cross-layer connections exist between normalization layers, the first encoding layer, and the second encoding layer at the same level. This allows the normalization layers to fuse the first clothing feature map, the second clothing feature map, and the upsampled feature map at the same level. This enables the identity feature map to fuse clothing features at different scales during decoding, avoiding the loss of clothing texture. Consequently, high-resolution virtual try-on images can be generated, achieving a realistic and natural virtual try-on effect. As the virtual try-on network iterates and trains, the fused pre-test virtual try-on images continuously approach the real virtual try-on images (real labels), resulting in an accurate virtual try-on model. Therefore, the virtual try-on images generated using this model not only have high resolution but also a realistic and natural virtual try-on effect. Attached Figure Description

[0044] One or more embodiments are illustrated by way of example with reference numerals in the accompanying drawings. These illustrations do not constitute a limitation on the embodiments. Elements with the same reference numerals in the drawings are denoted as similar elements. Unless otherwise stated, the figures in the drawings are not to be limited by scale.

[0045] Figure 1 A flowchart illustrating a method for training a fitting model according to some embodiments of this application;

[0046] Figure 2 This is a schematic diagram of the fitting network structure in some embodiments of this application;

[0047] Figure 3 These are analytical diagrams of the human body from some embodiments of this application;

[0048] Figure 4 for Figure 1 A schematic diagram of a sub-process of step S30 in the method shown;

[0049] Figure 5 This is a schematic diagram illustrating the clothing deformation process in some embodiments of this application;

[0050] Figure 6 This is a schematic diagram of the fusion process of the normalization layer in some embodiments of this application;

[0051] Figure 7 This is a flowchart illustrating the method for generating fitting images in some embodiments of this application;

[0052] Figure 8 This is a schematic diagram of the structure of a computer device in some embodiments of this application. Detailed Implementation

[0053] The present application will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present application, but do not limit the present application in any way. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application. These all fall within the protection scope of the present application.

[0054] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0055] It should be noted that, unless there is a conflict, the various features in the embodiments of this application can be combined with each other, all of which are within the protection scope of this application. Furthermore, although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in a different order than the module division in the device or the order in the flowchart. In addition, the terms "first," "second," and "third" used herein do not limit the data or execution order, but only distinguish identical or similar items with essentially the same function and effect.

[0056] Unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the scope of this application. The term "and / or" as used in this specification includes any and all combinations of one or more of the associated listed items.

[0057] Furthermore, the technical features involved in the various embodiments of this application described below can be combined with each other as long as they do not conflict with each other.

[0058] To facilitate understanding of the methods provided in the embodiments of this application, the terms used in the embodiments of this application will first be introduced:

[0059] (1) Neural Network

[0060] A neural network can be composed of neural units, specifically understood as a neural network with input layers, hidden layers, and output layers. Generally, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. Neural networks with many hidden layers are called deep neural networks (DNNs). The work of each layer in a neural network can be described by the mathematical expression y=a(W·x+b). From a physical perspective, the work of each layer in a neural network can be understood as transforming the input space (the set of input vectors) to the output space (i.e., from the row space to the column space of a matrix) through five operations on the input space: 1. Dimensionality increase / decrease; 2. Magnification / scaling; 3. Rotation; 4. Translation; 5. "Bending". Operations 2 and 3 are performed by "W·x", operation 4 by "+b", and operation 5 by "a()". The term "space" is used here because the objects being classified are not individual things, but a class of things; space refers to the set of all individuals within that class. W is the weight matrix of each layer in the neural network, where each value represents the weight of a neuron in that layer. This matrix W determines the spatial transformation from the input space to the output space, as described above; that is, the W of each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to ultimately obtain the weight matrices of all layers of the trained neural network. Therefore, the training process of a neural network is essentially learning how to control spatial transformation, more specifically, learning the weight matrix.

[0061] It should be noted that, in the embodiments of this application, the models used for machine learning tasks are essentially neural networks. Common components in neural networks include convolutional layers, pooling layers, normalization layers, and deconvolutional layers. By assembling these common components in neural networks, a model is designed. When the model parameters (weight matrices of each layer) are determined such that the model error meets a preset condition or the number of model parameters is adjusted to reach a preset threshold, the model converges.

[0062] The convolutional layer is configured with multiple convolutional kernels, each with a corresponding stride, to perform convolution operations on the image. The purpose of convolution is to extract different features from the input image. The first convolutional layer may only extract some low-level features such as edges, lines, and corners, while deeper convolutional layers can iteratively extract more complex features from low-level features.

[0063] A deconvolutional layer is used to map a low-dimensional space to a high-dimensional space while preserving the connections / patterns between them (the connections during convolution). A deconvolutional layer is configured with multiple convolutional kernels, each with a corresponding stride, to perform deconvolution operations on the image. Generally, framework libraries used for designing neural networks (such as the PyTorch library) have a built-in `upsumple()` function, which allows for low-dimensional to high-dimensional spatial mapping.

[0064] Pooling layers mimic the human visual system's ability to reduce the dimensionality of data or represent images with higher-level features. Common pooling operations include max pooling, mean pooling, random pooling, median pooling, and combined pooling. Typically, pooling layers are periodically inserted between convolutional layers in neural networks to achieve dimensionality reduction.

[0065] The normalization layer is used to normalize all neurons in the intermediate layer to prevent gradient explosion and gradient vanishing.

[0066] (2) Loss function

[0067] During neural network training, to ensure the output closely approximates the desired predicted value, we compare the network's current prediction with the target value. Based on the difference, we update the weight matrix of each layer (usually an initialization process before the first update, pre-configuring parameters for each layer). For example, if the network's prediction is too high, the weight matrix is ​​adjusted to predict a lower value, and this adjustment continues until the neural network accurately predicts the target value. Therefore, we need to predefine "how to compare the difference between the predicted and target values," which is the loss function or objective function. These are important equations used to measure the difference between the predicted and target values. Taking the loss function as an example, a higher output value (loss) indicates a greater difference, and training the neural network becomes a process of minimizing this loss.

[0068] (3) Human body analysis

[0069] Human body analysis refers to segmenting a person captured in an image into multiple semantically consistent regions, such as body parts and clothing, or further subcategories of body parts and clothing. Essentially, it involves pixel-level identification of the input image and labeling each pixel with its corresponding object category. For example, neural networks can be used to distinguish various elements (including hair, face, limbs, clothing, and background) in an image containing a human body.

[0070] Before introducing the embodiments of this application, a brief introduction will be given to the virtual try-on method known to the inventors of this application, so as to facilitate the understanding of the embodiments of this application later.

[0071] Generally, Generative Adversarial Networks (GANs) are used to train virtual fitting models. These models are then placed on a terminal for user access. After acquiring the user's image and an image of the clothing to be tried on, a virtual fitting image can be generated. However, most virtual fitting models generate fitting images of 256*192 pixels, which is insufficient for user needs. Forcibly increasing the resolution during training can easily lead to instability in the algorithm model, vanishing gradients, and pattern collapse, resulting in an inability to achieve high-quality fitting effects. In other words, a balance between high resolution and fitting effect cannot be struck.

[0072] To address the aforementioned problems, this application provides a method for training a fitting model. The embodiments of this application are described below with reference to the accompanying drawings. Those skilled in the art will recognize that, with technological advancements and the emergence of new scenarios, the technical solutions provided in this application are also applicable to similar technical problems.

[0073] Please see Figure 1 , Figure 1 This is a flowchart illustrating the method for training a fitting model provided in an embodiment of this application. Please refer to... Figure 2 The virtual fitting model network structure includes a first encoding network, a decoding network, and a second encoding network. The first encoding network comprises multiple cascaded first encoding layers, the second encoding network comprises multiple cascaded second encoding layers, and the decoding network comprises multiple cascaded, spaced-apart normalization layers and decoding layers. Cross-layer connections exist between normalization layers, first encoding layers, and second encoding layers at the same level. The normalization layer is used to fuse the feature map output by the preceding adjacent decoding layer or the last first encoding layer in the first encoding network with the feature maps output by the cross-layer connected first and second encoding layers. Here, normalization layers, first encoding layers, and second encoding layers at the same level can be understood as normalization layers, first encoding layers, and second encoding layers that output feature maps of the same resolution.

[0074] Those skilled in the art will understand that both the first and second encoding layers are downsampled convolutional layers. In the first and second encoding networks, the size of the output feature map decreases as the encoding layers (convolutional layers) progress. The decoding layer is an upsampled convolutional layer. In the decoding network, the size of the output feature map increases as the decoding layers (convolutional layers) progress. Those skilled in the art can configure parameters such as the kernel size and stride of the first, second, and decoding layers according to actual needs. In some embodiments, a pooling layer is configured after the first or second encoding layer to achieve dimensionality reduction. The pooling layer has been described in detail in "(1) Neural Networks" above and will not be repeated here.

[0075] The method S100 may specifically include the following steps:

[0076] S10: Obtain the training set.

[0077] The training set includes multiple training data sets, each of which includes clothing images and real fitting images, where the real fitting images include images of models wearing the corresponding clothing from the aforementioned clothing images.

[0078] It is understood that training data includes image pairs consisting of clothing images and real-life fitting images. In some embodiments, the amount of training data is in the tens of thousands, for example, 20,000, which is beneficial for training an accurate general-purpose model. Those skilled in the art can determine the amount of training data according to the actual situation.

[0079] In an image pair consisting of a clothing image and a real-life fitting image, the clothing image includes the garment to be tried on; for example, clothing image 1# includes a green short-sleeved shirt. The real-life fitting image shows a model wearing the garment corresponding to the clothing image; for example, in the real-life fitting image corresponding to clothing image 1#, the model is wearing the green short-sleeved shirt.

[0080] It is understandable that the training data, including clothing images and real fitting images, can be collected in advance by those skilled in the art. For example, clothing images and corresponding images of models wearing the clothing (i.e., real fitting images) can be crawled from some clothing sales websites.

[0081] Since the training process needs to output high-resolution fitting images, the clothing images and real fitting images in the training set also have correspondingly high resolutions. In some embodiments, the resolution of the clothing images and real fitting images can be 1024*768.

[0082] S20: Use a human body analysis algorithm to analyze the real fitting image to obtain a human body analysis diagram, and obtain the model's torso diagram based on the human body analysis diagram.

[0083] Understandably, when changing clothes on a model in a real fitting image, it's necessary to preserve the model's identity and other essential features. Extracting the model's identity features—specifically, obtaining a torso image—before merging the clothes and the model can prevent interference from the original clothing features and preserve the model's identity, ensuring the model doesn't appear distorted when wearing the clothes.

[0084] Specifically, as can be seen from the aforementioned "Glossary (3)", human body analysis involves dividing the human body into different parts, such as... Figure 3 As shown, different body parts, such as hair, face, top, pants, arms, hat, and shoes, are identified and segmented, and represented by different colors to obtain a human body analysis image.

[0085] In some embodiments, the human body analysis algorithm can be the existing Graphomanomy algorithm. The Graphomanomy algorithm segments the image into 20 categories, which can be distinguished by different colors, classifying each body part. In some embodiments, the aforementioned 20 categories can also be classified using numbers 0-19, for example, 0 represents background, 1 represents hat, 2 represents hair, 3 represents gloves, 4 represents sunglasses, 5 represents top, 6 represents dress, 7 represents coat, 8 represents socks, 9 represents pants, 10 represents torso skin, 11 represents scarf, 12 represents skirt, 13 represents face, 14 represents left arm, 15 represents right arm, 16 represents left leg, 17 represents right leg, 18 represents left shoe, and 19 represents right shoe. From the human body analysis map, the category to which each body part belongs in the image can be determined.

[0086] To ensure that the model's identity information remains unchanged during the generation of the pre-test garment image, the parsing categories are simplified in some embodiments. Specifically, depending on the parsing requirements, pixels representing human identity features such as the head, arms, legs, and feet are assigned a category of 1, while the rest are assigned a category of 0. It is understood that reducing parsing errors in this way helps improve the model's convergence speed and accuracy.

[0087] Since a human body analysis image can reflect the category of each pixel, a model's torso image can be obtained from the human body analysis image. For example, pixels with a pixel category of 1 constitute the model's torso image. In some embodiments, the aforementioned "obtaining a model's torso image from a human body analysis image" specifically includes:

[0088] S21: Obtain the model's body pixel area based on the human body analysis image.

[0089] S22: Obtain the model's torso image based on the real fitting image and body pixel area.

[0090] The body pixel region can be a region of category 1, and can be represented by a matrix M. The size of matrix M is the same as that of the human body analysis image and the real fitting image. In matrix M, the elements corresponding to the model's body pixels are 1, and the elements corresponding to other pixels are 0. Therefore, multiplying the real fitting image T by the body pixel region M yields the body torso image T' = T*M. S30: The clothing in the clothing image is deformed according to the model's human body structure in the aforementioned real fitting image to obtain the deformed clothing image.

[0091] Considering that the clothing in the image is two-dimensional, while the human body is three-dimensional, this step deforms the clothing in the image according to the human body structure of the model in the actual fitting image to ensure that the clothing adapts to the human body structure when blended. This results in a deformed clothing image where the clothing is three-dimensional and conforms to the human body structure, which is beneficial for a close fit after the clothing is blended with the body, resulting in a realistic and natural fitting effect.

[0092] In some embodiments, please refer to Figure 4 The aforementioned step S30 specifically includes:

[0093] S31: Use convolutional networks to extract features from the human body image and clothing image respectively, and combine the two extracted feature maps into a one-dimensional tensor.

[0094] S32: Input a one-dimensional tensor into the regression network to predict the transformation parameters.

[0095] S33: Interpolate and deform the clothing image according to the transformation parameters to obtain the deformed clothing image.

[0096] Please see Figure 5 The human body image and clothing image are extracted using convolutional networks, and the two extracted feature maps P and C are combined into a one-dimensional tensor. It is understood that the convolutional network here includes multiple convolutional layers, which perform downsampling and dimensionality reduction. Those skilled in the art can set the structure of the convolutional network and the parameters of the convolutional layers according to actual needs; no restrictions are imposed here.

[0097] The human body image is downsampled using a convolutional network to obtain an analytical feature map P. The clothing image is downsampled using a convolutional network to obtain a clothing feature map C. Then, the analytical feature map P is transformed into a 1*n vector P', and the clothing feature map C is transformed into a 1*n vector C'. Finally, vectors P' and C' are concatenated to form the one-dimensional tensor. It can be understood that in the above example, the size of the one-dimensional tensor is 1*2n.

[0098] It is understood that a regression network is a neural network used to perform regression analysis. Regression analysis determines a quantitative relationship between variables from a set of data. In this embodiment, the regression network is able to determine a quantitative relationship between a one-dimensional tensor and transformation parameters. It is understood that this regression network is trained from several pairs of one-dimensional tensors and transformation parameters. The training of regression networks is well known to those skilled in the art and will not be described in detail here.

[0099] When a new one-dimensional tensor is obtained, the regression network can determine the transformation parameters corresponding to the new one-dimensional tensor based on the quantitative relationship between the one-dimensional tensor and the transformation parameters. Therefore, the aforementioned one-dimensional tensor of size 1*2n can be input into the regression network to predict the corresponding transformation parameters.

[0100] Finally, the clothing image is interpolated and deformed according to the transformation parameters to obtain the deformed clothing image. In some embodiments, thin plate spline interpolation can be used for interpolation deformation. Using thin plate spline interpolation for image deformation requires specifying the coordinates of control points, then interpolating all pixels in the image according to the TPS function to obtain the interpolated positions, and finally mapping the pixel values ​​to obtain the deformed image. In this embodiment, the transformation parameters are used as parameters of the TPS function. For the coordinates of each pixel in the clothing image, the TPS function is used to calculate the new coordinates, achieving positional change, i.e., deformation, to obtain the deformed clothing image.

[0101] In this embodiment, a regression network is used to obtain the transformation parameters required for deformation. Then, the clothing pixels in the clothing image are interpolated and deformed according to the transformation parameters, which can obtain an accurate deformed clothing image. This is beneficial for the clothing to fit the human body when blending with the model, making it look realistic and natural.

[0102] S40: Input the deformed clothing image and body torso image into the fitting network. The deformed clothing image is input into the first encoding network and the second encoding network for downsampling encoding, and the body torso image is input into the first encoding network for downsampling encoding and the decoding network for upsampling decoding. During the decoding process, the upsampling feature map output by the decoding layer is fused with the first clothing feature map output by the first encoding layer and the second clothing feature map output by the second encoding layer through the normalization layer. The upsampling decoding is performed layer by layer. The last decoding layer in the decoding network outputs the pre-test clothing image.

[0103] Please refer to it again. Figure 2The deformed clothing image is input into a first encoding network for downsampling encoding, and each first encoding layer outputs a first clothing feature map. The deformed clothing image is then input into a second encoding network for downsampling encoding, and each second encoding layer outputs a second clothing feature map. The body torso image is then input into the first encoding network for downsampling encoding, and the last first encoding layer in the first encoding network outputs an identity feature map. In this embodiment, the resolution of the deformed clothing image and the body torso image is 1024*768, and the resolution of the identity feature map is 8*6.

[0104] Then, the identity feature map, the first clothing feature map at the same level as the identity feature map, and the second clothing feature map are all input into the first normalization layer of the decoding network for fusion to obtain a fused feature map. Here, "same level" can be understood as having the same resolution. The obtained fused feature map is input into the first decoding layer of the decoding network for upsampling decoding. The obtained upsampled feature map, the first clothing feature map at the same level as the upsampled feature map, and the second clothing feature map are input into subsequent cascaded normalization layers and decoding layers to continue fusion and upsampling decoding until the last decoding layer of the decoding network outputs the pre-test clothing image. In this embodiment, the resolution of the pre-test clothing image is 1024*768.

[0105] By fusing the first clothing feature map, the second clothing feature map, and the upsampled feature map at the same level through a normalization layer, the identity feature map can fuse clothing features from different scales during the decoding process, avoiding the loss of clothing texture. As a result, high-resolution fitting images, such as 1024*768 fitting images, can be generated, and these high-resolution fitting images can have a realistic and natural fitting effect.

[0106] In some embodiments, the normalization layer is fused using the following formula;

[0107] ;

[0108] in, It is the fused feature map output by the i-th normalization layer. It is the first clothing feature map output by the i-th first coding layer. It is the upsampled feature map output by the (i-1)th decoding layer. It is the second clothing feature map output by the i-th second coding layer.

[0109] Please see Figure 6The fused feature map is first multiplied and fused with a first clothing feature map of the corresponding size, and then added and fused with a second clothing feature map of the corresponding size. This multiplication and addition fusion process allows the fused feature map to better preserve clothing texture features. Furthermore, the first and second clothing feature maps are processed by convolutional layers from different encoding networks, effectively avoiding feature omissions.

[0110] S50: The loss function is used to calculate the difference between each real fitting image and each predicted fitting image in the training set, and the fitting network is iteratively trained based on the difference until the fitting network converges to obtain the fitting model.

[0111] Understandably, the smaller the difference between the real and predicted fitting images in the training set, the more similar the real and predicted fitting images are, indicating that the predicted fitting images can accurately reproduce the real fitting images. Therefore, the model parameters of the aforementioned fitting network can be adjusted based on the differences between the real and predicted fitting images in the training set, and the fitting network can be iteratively trained. Since the fitting network includes a first encoding network, a second encoding network, and a decoding network, the model parameters include the model parameters of the first encoding network, the second encoding network, and the decoding network. These differences are then backpropagated, causing the predicted fitting images output by the fitting network to continuously approximate the real fitting images until the fitting network converges, resulting in the fitting model.

[0112] It is understandable that the convergence of the virtual fitting network here can refer to the fact that, under a certain model parameter, the sum of the differences between each real virtual fitting image and the predicted virtual fitting image in the training set is less than a preset threshold or fluctuates within a certain range.

[0113] In some embodiments, the Adam algorithm is used to optimize the model parameters. For example, the number of iterations is set to 100,000, the initial learning rate is set to 0.001, the learning rate weight decay is set to 0.0005, and the learning rate decays to 1 / 10 of its original value every 1,000 iterations. The learning rate and the differences between each real fitting image and each predicted fitting image in the training set can be input into the Adam algorithm to obtain the adjusted model parameters output by the Adam algorithm. The adjusted model parameters are used for the next training until the training is completed. Then, the model parameters of the converged fitting network are output, which is the fitting model.

[0114] It should be noted that, in this embodiment, the training set includes multiple training data sets, such as 20,000 training data sets, which cover different models and clothing, and can cover most of the characteristics of clothing on the market. Therefore, the trained virtual try-on model is a general-purpose model that can be widely used for virtual try-on and generating virtual try-on images.

[0115] In this embodiment, a loss function is used to calculate the difference between the real fitting images and the predicted fitting images corresponding to each real fitting image in the training set. The loss function has been described in detail in the aforementioned "Glossary (2)" section and will not be repeated here. It is understood that, based on different network structures and training methods, the structure of the loss function can be set according to the actual situation.

[0116] In summary, by designing the above-mentioned virtual try-on network structure, the decoding network is constructed using multiple cascaded, spaced-apart normalization and decoding layers. Furthermore, cross-layer connections exist between normalization layers, the first encoding layer, and the second encoding layer at the same level. Thus, the normalization layer fuses the first clothing feature map, the second clothing feature map, and the upsampled feature map at the same level, enabling the identity feature map to fuse clothing features at different scales during the decoding process. This avoids the loss of clothing texture, thereby generating high-resolution virtual try-on images with a realistic and natural virtual try-on effect. As the virtual try-on network iterates and trains, the fused pre-test virtual try-on images continuously approach the real virtual try-on images (real labels), resulting in an accurate virtual try-on model. Therefore, the virtual try-on images generated using this model not only have high resolution but also a realistic and natural virtual try-on effect.

[0117] In some embodiments, the loss function includes adversarial loss between the real fitting image and the pre-fitting image, perceptual loss, and multi-scale resolution loss, wherein the multi-scale resolution loss reflects the loss between the multi-size feature map corresponding to the pre-fitting image and the multi-size feature map corresponding to the real fitting image.

[0118] The adversarial loss is the loss used to determine whether the predicted fitting image corresponds to the real fitting image. A large adversarial loss indicates a significant difference between the distribution of the predicted fitting image and the distribution of the real fitting image. Conversely, a small adversarial loss indicates a smaller difference or similarity between the distribution of the predicted fitting image and the real fitting image. Here, the distribution of the fitting image refers to the distribution of different parts of the image, such as the clothing, head, and limbs.

[0119] Perceptual loss compares the feature map obtained by convolving the real fitting image with the feature map obtained by convolving the predicted fitting image, so that the high-level information (content and global structure) is similar.

[0120] Multi-scale resolution loss reflects the difference between the multi-scale feature maps corresponding to the pre-test garment image and the real garment image. By comparing the pre-test garment image and the real garment image at different resolutions, the pre-test garment image can approximate the real image, thus ensuring the stability of the generated high-resolution garment images during training and accelerating the model convergence speed.

[0121] In some embodiments, the loss function is as follows:

[0122] ;

[0123] in,

[0124]

[0125]

[0126] in, To combat the losses, In order to perceive loss, For multi-scale resolution loss, and Here are the hyperparameters: T is the real fitting image, Y is the predicted fitting image, and D is the discriminator. It is the k-th feature map extracted from real fitting images. It is the k-th feature map extracted from the pre-test garment image. yes or The number of elements in, V is the number of elements in the... The number of or the number of The number of It is the j-th feature map corresponding to the real fitting image. It is the j-th feature map corresponding to the pre-test clothing image.

[0127] In some embodiments, convolutional neural networks such as VGG can be used to downsample the real fitting images and extract V feature maps. Similarly, convolutional neural networks such as VGG can be used to downsample the pre-test garment image and extract V feature maps. .

[0128] In some embodiments, when j=1 and The size is When j=2 and The size is When j=3 and The size is When j=4 and The size is When j=5 and The size is When j=6 and The size is When j=7 and The size is Therefore, by iteratively training the virtual fitting network based on the differences calculated using loss functions including adversarial loss, perceptual loss, and the aforementioned multi-scale resolution loss, the predicted virtual fitting images can be constrained to continuously approach the real virtual fitting images in terms of distribution, content features, and multi-scale features, which is beneficial to improving the virtual fitting effect of the trained virtual fitting model.

[0129] In some embodiments, the method S100 further includes:

[0130] S60: Input the human body analysis image sequentially into the first encoding network for downsampling encoding and the decoding network for upsampling decoding.

[0131] During the decoding process, the parsed feature map output by the decoding layer constrains the pixel categories of the upsampled feature maps at the same level.

[0132] In this embodiment, since the parsed feature map can reflect the pixel category, it allows the pixel categories of upsampled feature maps at the same level to be close to the pixel categories of the corresponding parsed feature map, thereby optimizing the pixel boundaries of the upsampled feature map. As the decoding process constrains each layer, the boundaries of each region in the generated pre-test clothing image are finally clear, for example, the boundary between the clothing and the model is clear, effectively reducing boundary blurring.

[0133] In summary, the method for training a fitting model provided in this application includes a fitting model network structure comprising a first encoding network, a decoding network, and a second encoding network. The first encoding network includes multiple cascaded first encoding layers, the second encoding network includes multiple cascaded second encoding layers, and the decoding network includes multiple cascaded, spaced-apart normalization layers and decoding layers. Cross-layer connections exist between normalization layers, first encoding layers, and second encoding layers at the same level. The normalization layer is used to fuse the feature map output by the preceding adjacent decoding layer or the last first encoding layer in the first encoding network with the feature maps output by the cross-layer connected first and second encoding layers.

[0134] First, a training set is acquired, comprising multiple training data points. Each training data point includes clothing images and real-world images of models wearing the corresponding clothing from the images; these real-world images serve as the true labels. For each training data point, the clothing in the clothing image is deformed according to the model's human anatomy in the real-world images, resulting in a deformed clothing image that is three-dimensional and conforms to the human body structure. The real-world images are then classified pixel by pixel, and a model's torso image is obtained based on the classification results. This torso image preserves the model's identity features while removing background and other interfering information, simplifying subsequent encoding and facilitating network training. Then, the deformed clothing image and body torso image are input into the virtual try-on network. The deformed clothing image is downsampled and encoded by the first and second encoding networks, respectively. The body torso image is sequentially downsampled and encoded by the first encoding network and upsampled and decoded by the decoding network. During decoding, the upsampled feature map output from the decoding layer is fused with the first clothing feature map output from the first encoding layer and the second clothing feature map output from the second encoding layer through a normalization layer. This fusion and upsampling decoding is performed layer by layer. The last decoding layer in this network outputs a pre-test clothing image. Finally, a loss function is used to calculate the difference between each real virtual try-on image and each pre-test clothing image in the training set. Based on the difference, the virtual try-on network is iteratively trained until it converges, resulting in the virtual try-on model.

[0135] By designing the structure of the aforementioned virtual try-on network, the decoding network is constructed using multiple cascaded, spaced-apart normalization and decoding layers. Cross-layer connections exist between normalization layers, the first encoding layer, and the second encoding layer at the same level. This allows the normalization layers to fuse the first clothing feature map, the second clothing feature map, and the upsampled feature map at the same level. This enables the identity feature map to fuse clothing features at different scales during decoding, avoiding the loss of clothing texture. Consequently, high-resolution virtual try-on images can be generated, achieving a realistic and natural virtual try-on effect. As the virtual try-on network iterates and trains, the fused pre-test virtual try-on images continuously approach the real virtual try-on images (real labels), resulting in an accurate virtual try-on model. Therefore, the virtual try-on images generated using this model not only have high resolution but also a realistic and natural virtual try-on effect.

[0136] After training a virtual fitting model using the method provided in this application, the model can be used for virtual fitting and to generate fitting images. See also Figure 7 , Figure 7 This is a flowchart illustrating the method for generating fitting images provided in an embodiment of this application, as shown below. Figure 7 As shown, method S200 includes the following steps:

[0137] S201: Obtain the image of the clothing to be tried on and the user image.

[0138] The image of the clothes to be tried on includes the clothes, and the user image includes the user's body.

[0139] S202: Use a human body analysis algorithm to analyze the user's image to obtain the user's human body analysis image, and obtain the user's torso image based on the user's human body analysis image.

[0140] Understandably, when changing clothes for a user in a user image, it is necessary to retain the user's identity features and other necessary characteristics.

[0141] Before fusing the clothes to be tried on with the user-input fitting model, extract the user's identity features, i.e., obtain the user's torso image. On the one hand, this can avoid interference from the original old clothes features on the fusion, and on the other hand, it can preserve the user's identity features so that the user does not lose authenticity after changing into the clothes to be tried on.

[0142] The human body parsing algorithm has been described in detail in step S20 and will not be repeated here. In some embodiments, the user's body movement diagram can be obtained by referring to the aforementioned steps S21-S22, which will not be described in detail here.

[0143] S203: The clothing in the image of the clothes to be tried on is deformed according to the user's human body structure in the user image to obtain the deformed image of the clothes to be tried on.

[0144] Considering that the clothing in the image of the clothes to be tried on is in a two-dimensional plane, while the user's human body structure is three-dimensional, in order to make the clothing in the image of the clothes to be tried on adapt to the human body structure when the clothing and the user are merged, this step deforms the clothing in the image of the clothes to be tried on according to the user's human body structure in the user image, so as to obtain the deformed image of the clothes to be tried on.

[0145] In some embodiments, the clothing in the image of the clothes to be tried on can be deformed according to the user's human body structure in the user image, as described in steps S31-S33 above. This will not be described in detail here.

[0146] S204: Input the deformed image of the clothing to be tried on and the user's torso image into the fitting model to generate a fitting image. The fitting model is trained using the method described in any of the above embodiments.

[0147] It is understood that the fitting model is trained using the same method as the fitting model training method described in the above embodiments, and has the same structure and function as the fitting model described in the above embodiments, which will not be described in detail here.

[0148] In short, the deformed image of the clothing to be tried on and the user's torso image are input into the fitting model. The fitting model will downsample and encode the deformed image of the clothing to be tried on and the user's torso image respectively. During the decoding process, feature fusion is performed to obtain the fitting image.

[0149] Based on the fact that this fitting model has the same structure and function as the fitting model in the above embodiment, the generated fitting images of users not only have high resolution, such as 1024*768, but also have a realistic and natural fitting effect.

[0150] See Figure 8 , Figure 8 This is a schematic diagram of the structure of a computer device 50 provided in an embodiment of this application. The computer device 50 includes a processor 501 and a memory 502. The processor 501 is connected to the memory 502, for example, the processor 501 can be connected to the memory 502 via a bus.

[0151] Processor 501 is configured to support the execution of the computer device 50. Figures 1-6 Method or Figure 7 The processor 501 can be a central processing unit (CPU), a network processor (NP), a hardware chip, or any combination thereof. The aforementioned hardware chip can be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The aforementioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.

[0152] The memory 502, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the program instructions / modules corresponding to the method for training the fitting model in the embodiments of this application, or the program instructions / modules corresponding to the method for generating fitting images. The processor 501 can implement the training in any of the above method embodiments by running the non-transitory software programs, instructions, and modules stored in the memory 502.

[0153] Memory 502 may include volatile memory (VM), such as random access memory (RAM); memory 1002 may also include non-volatile memory (NVM), such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid-state drive (SSD); memory 502 may also include combinations of the above types of memory.

[0154] This application also provides a computer-readable storage medium storing a computer program, the computer program including program instructions, which, when executed by a computer, cause the computer to perform the method of training a fitting model or the method of generating a fitting image as described in the foregoing embodiments.

[0155] It should be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0156] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented using software and a general-purpose hardware platform, or of course, using hardware. Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.

[0157] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and not to limit them; under the concept of this application, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of this application as described above, which are not provided in detail for the sake of brevity; although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.

Claims

1. A method for training a fitting model, characterized in that, The fitting network includes a first encoding network, a decoding network, and a second encoding network. The first encoding network includes multiple cascaded first encoding layers, the second encoding network includes multiple cascaded second encoding layers, and the decoding network includes multiple cascaded, spaced-apart normalization layers and decoding layers. There are cross-layer connections between the normalization layers, the first encoding layers, and the second encoding layers at the same level. The normalization layer is used to fuse the feature map output by the previous adjacent decoding layer or the last first encoding layer in the first encoding network with the feature maps output by the cross-layer connected first and second encoding layers. The method includes: Obtain a training set, which includes multiple training data, including clothing images and real fitting images, where the real fitting images include images of models wearing the corresponding clothing from the clothing images; The real fitting images are analyzed using a human body analysis algorithm to obtain a human body analysis diagram, and the model's torso diagram is obtained based on the human body analysis diagram; The clothing in the clothing image is deformed according to the human body structure of the model in the real fitting image to obtain the deformed clothing image; The deformed clothing image and the body torso image are input into the fitting network. The deformed clothing image is input into the first encoding network and the second encoding network for downsampling encoding, and the body torso image is input into the first encoding network for downsampling encoding and the decoding network for upsampling decoding. During the decoding process, the upsampling feature map output by the decoding layer is fused with the first clothing feature map output by the first encoding layer and the second clothing feature map output by the second encoding layer through the normalization layer. The upsampling decoding is performed layer by layer. The last decoding layer in the decoding network outputs a pre-test fitting image. The difference between each real fitting image and each predicted fitting image in the training set is calculated using a loss function, and the fitting network is iteratively trained based on the difference until the fitting network converges to obtain the fitting model. The normalization layer is fused using the following formula; ; in, It is the fused feature map output by the i-th normalization layer. It is the first clothing feature map output by the i-th first coding layer. It is the upsampled feature map output by the (i-1)th decoding layer. It is the second clothing feature map output by the i-th second coding layer.

2. The method according to claim 1, characterized in that, The process of deforming the clothing in the clothing image according to the human anatomy of the model in the real fitting image to obtain the deformed clothing image includes: The human body image and the clothing image are respectively used to extract features using a convolutional network, and the two extracted feature images are combined into a one-dimensional tensor. The one-dimensional tensor is input into the regression network to predict the transformation parameters. The clothing image is interpolated and deformed according to the transformation parameters to obtain the deformed clothing image.

3. The method according to claim 1, characterized in that, The step of obtaining the model's torso diagram based on the human anatomy analysis diagram includes: Based on the human body analysis image, obtain the model's body pixel region; Based on the real fitting image and the body pixel area, obtain the model's torso image.

4. The method according to any one of claims 1-3, characterized in that, The method further includes: The human body analysis image is sequentially input into the first encoding network for downsampling encoding and the decoding network for upsampling decoding; during the decoding process, the analysis feature map output by the decoding layer constrains the pixel categories of the upsampling feature maps at the same level.

5. The method according to any one of claims 1-3, characterized in that, The loss function includes adversarial loss, perceptual loss, and multi-scale resolution loss between the real fitting image and the pre-fitting image, wherein the multi-scale resolution loss reflects the loss between the multi-size feature map corresponding to the pre-fitting image and the multi-size feature map corresponding to the real fitting image.

6. The method according to claim 5, wherein the loss function comprises: ; in, in, For the aforementioned resistance loss, For the perceived loss, The multi-scale resolution loss, and Here, T is the actual fitting image, Y is the predicted fitting image, and D is the discriminator. It is the k-th feature map extracted from the real fitting image. It is the k-th feature map extracted from the pre-test garment image. yes or The number of elements in, V is the number of elements in the... The number of or the number of The number of It is the j-th feature map corresponding to the real fitting image. It is the j-th feature map corresponding to the pre-test garment image.

7. A method for generating fitting room images, characterized in that, include: Obtain images of the clothing to be tried on and the user's image; The user image is analyzed using a human body analysis algorithm to obtain the user's human body analysis image, and the user's torso image is obtained based on the user's human body analysis image. The clothing in the image of the clothing to be tried is deformed according to the user's human body structure in the user image to obtain the deformed image of the clothing to be tried. The deformed image of the garment to be tried on and the user's torso image are input into the fitting model to generate the fitting image, wherein the fitting model is trained using the method described in any one of claims 1-6.

8. A computer device, characterized in that, include: At least one processor, and The memory communicatively connected to the at least one processor, wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1-7.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions for causing a computer device to perform the method as described in any one of claims 1-7.