An e-commerce product matting method
By improving the Transformer structure and attention mechanism, and combining the global representation of the tri-graph with the product image feature fusion, the problem of insufficient accuracy and precision in product image matting on e-commerce platforms is solved, and efficient matting effect for large product images is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- FOCUS TECH
- Filing Date
- 2024-04-17
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies for image matting on e-commerce platforms suffer from problems such as high computational load, limitations in positional encoding, and poor saliency detection, resulting in insufficient accuracy and precision, especially for mechanical and component product images.
A segmentation model based on the Transformer structure is used for saliency detection. Thresholding and erosion-dilation operations are combined to generate a triad. The matting model is optimized by improving the attention mechanism and relative position encoding. The global representation of the triad and the features of the product image are fused. A window-based attention mechanism is used for information fusion to minimize the loss and generate high-precision matting results.
It improves the accuracy and precision of product cutout in e-commerce, solves the problems of subject optimization and edge refinement in large product images, and generates more accurate cutout results.
Smart Images

Figure CN118261933B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer image matting, and more particularly to a method for matting e-commerce products. Background Technology
[0002] Product image matting is widely used in e-commerce platforms for product display, poster production, etc. High quality is required for matting, and high-precision matting results are needed to ensure good display effects. However, fine matting tasks often require more manpower and time. Existing matting methods face difficulties such as limited training data, complex environments, and low computational efficiency, making it difficult to meet the needs of matting massive amounts of product images.
[0003] Mainstream image matting algorithms rely on manually drawing high-quality triangulation maps, which is complex and time-consuming. While recent end-to-end frameworks offer simpler matting methods, their direct application to product images can easily lead to issues such as missing subjects and blurred edges. The results are even worse for complex product images, such as those of machinery or components. Considering the prominent subject in product images, a better two-stage framework exists: the first stage uses saliency to detect the subject, and the second stage uses a matting model for edge refinement. However, current saliency detection models perform poorly on certain machinery and component types, resulting in inaccurate triangulation maps. Commonly used image convolution-based matting models also suffer from insufficient edge refinement.
[0004] Existing image matting models based on the transformer structure establish dependencies between different image patches through attention mechanisms. Compared with convolution-based image matting models, they can better understand global semantic information. However, existing transformer image matting models use global image patches for encoding, which is computationally intensive. At the same time, they use absolute position encoding for image patches, which cannot learn the contextual information and positional relationships between different image patches well, resulting in poor accuracy in matting large machinery images or product images with varying scales.
[0005] Therefore, there is a need for a method to cut out images of e-commerce products with improved accuracy and precision. Summary of the Invention
[0006] The technical problem solved by this invention is to overcome the shortcomings of the prior art and provide a method for product matting in e-commerce. It uses a high-precision segmentation model to detect the saliency of the product and generates a triangulation based on threshold and erosion-dilation operations. It uses a Transformer-based structure to fuse the global representation of the triangulation with the features of the product image, thereby further optimizing the matting effect on general product matting.
[0007] To solve the above technical problems, the present invention provides a method for image cutout of e-commerce products, comprising the following steps:
[0008] Step 1: Prepare the first dataset for training the binary segmentation model. The binary segmentation model is based on the transformer structure and is used to segment salient objects and background in the product image. The dataset includes the RGB image of the product and the first mask image. Set a threshold for the first mask image to filter it into the second mask image. Apply erosion and dilation processing to the second mask image to obtain the three-part image.
[0009] Step 2: Prepare a second dataset for training the matting model. The matting model is based on a transformer architecture. The second dataset includes the RGB image and alpha label of the product. Input the RGB image and the tri-image obtained in Step 1 into the matting model. Improve the transformer architecture using a window-based attention mechanism, a window offset-based attention mechanism, and relative position encoding. Optimize the training based on the window size (7 pixels). Calculate the attention weights of pixels within the window. Input the tri-image into the window-based attention mechanism to fuse the global representation of the tri-image with the features of the RGB image, minimizing the loss between the predicted alpha image and the alpha label. Merge the predicted alpha image and the original RGB image through channel merging to obtain the final matting result.
[0010] Step 1 specifically includes:
[0011] Step 1-1: The first dataset includes open-source, manually labeled segmented DIS5K images and collected e-commerce product images. The e-commerce product images include real product images and product images synthesized by replacing the product background, with a number of no less than 250,000 images. The first mask image is a mask image of the corresponding image. For the predicted mask image and the first mask image, a binary classification cross-entropy loss is used. The model parameters are optimized by minimizing this loss.
[0012] Steps 1-2: Perform data augmentation on the RGB image of the product. The data augmentation includes random cropping and normalization processing. The input size is 1024*1024. The binary segmentation model includes an intermediate supervision module, which uses a lightweight deep learning model F. gt Fine-tuning training was performed again on the first dataset to supervise the multi-layer feature maps of the binary segmentation model and prevent overfitting. Convolutional operations were added before training the binary segmentation network.
[0013] Steps 1-3: The threshold is set for the first Mask image, and the resulting second Mask image is subjected to erosion and dilation operations with a kernel size of 3 and an iteration count of 3 to obtain the three-part image labels.
[0014] Step 2 specifically includes:
[0015] Step 2-1: The second dataset also includes background images, with no fewer than 100,000 RGB images and no fewer than 20,000 background images. New images are synthesized online using the RGB images and background images.
[0016] Step 2-2: Input the new image synthesized from the RGB image of the product and the background image, and the three-part image obtained based on the second Mask image processing, into the matting model, output a 3-layer semantic feature map, and calculate the total matting loss with the real Alpha label; the three-part image is obtained by processing the second Mask image during training;
[0017] Steps 2-3: The matting loss includes the alpha loss L. a Synthesis loss of foreground and background L com and Laplace gradient constraint loss L lap The joint loss calculated by combining the single-layer semantic feature map and the alpha label is L. matting =L a +L com +L lap ;
[0018] Steps 2-4: Calculate the joint loss L for each of the three semantic feature maps and the Alpha label. matting1 L matting2 and L matting3 And calculate the average loss corresponding to the same matting loss, denoted as:
[0019] Average Alpha loss L a-mean =(2*L) a1 +2*L a2 +L a3 ) / 5、
[0020] Average synthetic loss L com-mean =(2*L) com1 +2*L com2 +L com3 ) / 5
[0021] and
[0022] Average Laplacian gradient constraint loss L lap-mean =(2*L) lap1 +2*L lap2 +L lap3 ) / 5,
[0023] By updating the parameters of the image matting model using joint loss, we obtain:
[0024] Average joint loss L SUM =L a-mean +L com-mean +Llap-mean ;
[0025] Steps 2-5: During model inference, the predicted Alpha image is merged with the original RGB image of the product to obtain a 4-channel image matting result.
[0026] The binary segmentation model uses U2-Net as the backbone network, and the deep learning model F gt The input image is the ground truth segmentation mask G. A self-supervised encoder is trained to extract 6 intermediate feature maps. The loss between the mask predicted by each intermediate feature map and the ground truth segmentation mask G is minimized. The loss function uses the binary cross-entropy loss (BCE), as shown in the following formula:
[0027]
[0028] Among them, F gt Let θ represent the deep learning model. gt Here, D represents the number of intermediate feature maps, G represents the first mask map, and BCE represents the cross-entropy loss for binary classification.
[0029] Self-supervised GT encoder F gt After training, through Freeze weights θ gt ,in This represents the generation of a supervised probability map, which is used to supervise the segmentation model F. sg The corresponding generated intermediate depth features in Image I is processed by segmentation model F sg The resulting set of intermediate feature maps, θ sg The weights of the segmentation model are represented by L; the loss L is calculated between the intermediate feature maps generated by the supervision module and the intermediate feature maps generated by the segmentation model. fs Ensure feature synchronization. in, The weights among the six feature map losses are represented by the synchronization loss at each layer. The training process of the segmentation model is expressed as an optimization problem argmin(L fs +L sg ), L sg The cross-entropy loss represents the BCE binary classification between the segmentation model's predictions of multiple intermediate feature maps and the ground truth mask label G. This represents the weights among the six feature map losses;
[0030] The loss L is obtained by using the six intermediate feature maps generated by the self-supervised encoder and segmentation model described above. fsAnd the loss L between the intermediate feature maps generated by the segmentation model and the real mask labels G. sg To minimize the loss L S =L fs +L sg The model parameters are updated iteratively.
[0031] In step 2, let Let α be the predicted label, P be the predicted composite image, I be the new image synthesized from the RGB image and the background, F be the foreground, and B be the background.
[0032] The Alpha loss L a Using the L1 loss function, the absolute error between the predicted and true pixel values is calculated, expressed as:
[0033]
[0034] The synthesis loss L of the foreground and background com Using the L1 loss function, the absolute error of the corresponding pixel values between the predicted synthesized image and the synthesized new image input to the model is calculated, expressed as:
[0035] L com =∑||PI||1
[0036] Where I represents the new image synthesized from the RGB image and the background. P is the predicted composite image.
[0037] The Laplace loss L lap The predicted alpha image is decomposed into three semantic feature layers. Then, the L1 loss function is used at each layer to calculate the loss between the predicted alpha image and the ground truth alpha label, which is used to supervise the local and global alpha output. The Alpha image corresponding to the i-th layer prediction, the The alpha image is obtained by applying convolution and upsampling operations to the semantic feature map of the corresponding i-th layer. Upsampling is used to ensure that the predicted alpha image has the same label resolution as the real alpha image. The Laplacian loss is L. lap The computational expression is as follows:
[0038]
[0039] The beneficial effects achieved by this invention are as follows:
[0040] This invention addresses the limitations of computational complexity and positional encoding in transformer attention mechanisms by improving the transformer structure. It employs window-based attention mechanisms, window offset-based attention mechanisms, and relative positional encoding, which better integrate local and global feature information. Furthermore, it directly incorporates the tri-image into the transformer structure, integrating the global representation of the tri-image into the attention calculation of the original image window region. This further improves the subject optimization and edge refinement of large product images, resulting in accurate product matting results and effectively solving the problem of inaccurate matting mask prediction caused by ignoring prior information from the tri-image. Attached Figure Description
[0041] Figure 1 This is a simplified flowchart of a method according to an embodiment of the present invention;
[0042] Figure 2 This is a schematic diagram of the overall process of an e-commerce product image cutout method according to an embodiment of the present invention;
[0043] Figure 3 This is a schematic diagram of a supervision module for a high-precision segmentation model according to an embodiment of the present invention;
[0044] Figure 4 This is a schematic diagram of a high-precision segmentation model according to an embodiment of the present invention;
[0045] Figure 5 This is a schematic diagram comparing the traditional saliency results of a commodity with the detection results of a high-precision segmentation model and the triangulation map generated based on the high-precision segmentation results, according to an embodiment of the present invention.
[0046] Figure 6 This is a comparison image of a high-precision segmentation prediction of goods and a fine prediction based on a matting model with a transformer structure, according to an embodiment of the present invention. Detailed Implementation
[0047] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described in the drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0048] like Figure 1 As shown, the overall flowchart of an e-commerce product image cutout method according to an embodiment of the present invention specifically includes the following steps:
[0049] Step 1: Prepare the dataset for training the high-precision binary segmentation model. This mainly involves segmenting salient objects and background in product images. The prepared dataset includes the RGB image of the product and the first mask image of the product. Set a threshold for the predicted first mask image to adjust it to the second mask image. Apply erosion and dilation processing based on the second mask image to obtain a three-part image.
[0050] Step 2: Prepare the data for training the image matting model, including the product's RGB image and alpha label. Input the product's RGB image and the three-part image from Step 1 into the image matting model. To address the limitations of existing transformer attention mechanisms (high computational cost and positional encoding), improve the transformer structure by using window-based attention mechanisms, window offset-based attention mechanisms, and relative positional encoding. This allows for better fusion of local and global image features. Simultaneously, optimize the window size during training, ultimately determining a window size of 7. Compared to the traditional transformer's computationally intensive calculation of attention weights across all pixels, the improved transformer only calculates attention weights for pixels within the window. Finally, the three-part image is directly incorporated into the window-based attention mechanism module, fusing the global representation of the three-part image with the product image's features to minimize the loss between the predicted alpha image and the product's true alpha label. Finally, merge the predicted alpha image with the original image to obtain the final image matting result.
[0051] like Figure 2 The diagram shows the overall process of the e-commerce product image cutout method, wherein step 1 specifically includes:
[0052] Step 1-1: Prepare the dataset for training the high-precision binary segmentation model. The two classes are objects and background. The RGB images of the products mainly include open-source high-precision manually annotated segmentation DIS5K data and collected e-commerce product images. The e-commerce product images mainly include real product images and product images synthesized by replacing the product background, with a number of no less than 250,000 images. The first mask label is the mask image of the corresponding image. For the predicted mask image and the first mask label, a binary classification cross-entropy loss is used. Minimize this loss to optimize the model parameters.
[0053] Steps 1-2: Perform data augmentation such as random cropping and normalization on the RGB image of the product. The input size is 1024*1024. Introduce an intermediate supervision module, which uses a lightweight deep learning model F. gtThe training was fine-tuned again on the first dataset. The supervision module is usually applied to the side output, which is a single-channel probability map generated by convolving the last feature map of a specific depth layer. In essence, the generation of a single-channel probability map from a high-dimensional feature map inevitably loses key information. However, this paper supervises the multi-layer feature maps of the binary segmentation model through the supervision module to prevent the binary segmentation model from overfitting. Furthermore, a convolution operation is added before the training backbone segmentation network to further optimize the GPU overhead of the entire network.
[0054] Specifically, the high-precision binary segmentation model uses U2-Net as the backbone network;
[0055] Specifically, an intermediate supervision module is introduced into the segmentation model, with the structure as follows: Figure 3 As shown, the input image is the first mask G, and a lightweight deep learning model F is used. gt A self-supervised Ground Truth encoder is trained to extract six high-dimensional feature maps; and the loss between the predicted mask of each feature map and the first mask G is minimized. The loss function uses the BCE binary classification cross-entropy loss, as shown in the following formula:
[0056]
[0057] Among them, F gt Denotes the intermediate supervision model, θ gt Here, D represents the number of intermediate feature maps, G represents the first mask map, and BCE represents the cross-entropy loss for binary classification.
[0058] Specifically, the structure of the intermediate supervision module assisting in the training of the high-precision segmentation model is as follows: Figure 4 As shown, the intermediate supervision module is the self-supervised GT encoder F. gt After training, in order to generate the "ground truth" high-dimensional intermediate depth features, their weights θ are frozen. gt This operation is mainly performed through... in This represents the generation of a supervised probability map, which is used to supervise the segmentation model F. sg The corresponding generated intermediate depth features in Image I is processed by segmentation model F sg The resulting set of intermediate feature maps, Similarly, D = {1, 2, 3, 4, 5, 6}, θ sg The weights of the segmentation model are represented; then, the loss between the intermediate feature maps generated by the supervision module and the intermediate feature maps generated by the segmentation model is calculated to ensure feature synchronization. in, The weights among the six feature map losses are represented by the synchronization loss at each layer. The final training process of the segmentation model can be expressed as the optimization problem argmin(L fs +L sg ), L sg This represents the cross-entropy loss of the BCE binary classification between the segmentation model's predictions of multiple intermediate feature maps and the ground truth mask label G. Similarly, this represents the weights among the six feature map losses;
[0059] Specifically, the loss L is obtained by using the six intermediate feature maps generated by the self-supervised encoder and the segmentation model described above. fs And the loss L between the intermediate feature maps generated by the segmentation model and the real mask labels G. sg L S =L fs +L sg The model parameters are iteratively updated to minimize this loss;
[0060] Compared to traditional saliency detection models, such as Figure 5 As shown, the high-precision segmentation model segments the product image, and the three-part image obtained by binarizing the segmentation image and using erosion and dilation operations is presented. By comparing the product RGB image, the traditional saliency map (the detection effect of the previous saliency detection model on the product image), the high-precision segmentation image, and the generated three-part image, it can be found that the segmentation effect of the present invention is better, and the three-part image is also more accurate.
[0061] Steps 1-3: For the predicted Mask image, set a threshold to segment and obtain binary images of the predicted object and background, and perform erosion and dilation operations with a kernel size of 3 and an iteration count of 3 on the predicted binary image to obtain the three-part image labels.
[0062] Specifically, for the Mask image predicted by the segmentation model of the input image I, the results are more stable with the help of the supervised encoder. At the same time, the prediction results are binarized by setting a threshold to separate the salient targets in image I from the background. Based on this binary image, the erosion and dilation operation is used to generate a three-part image. Through the segmentation results of the high-precision product image, a more accurate three-part image is generated. The subsequent image matting model continues to optimize the edge region of the salient target by introducing the more accurate three-part image.
[0063] Step 2 specifically includes:
[0064] Step 2-1: Prepare the data for training the image matting model, including RGB product images, product alpha tags, and background images. There should be no fewer than 100,000 product images and no fewer than 20,000 background images. Use online synthesis of the original image and background to train the model.
[0065] Step 2-2: Input the new image synthesized from the RGB image of the product and the background image, and the three-part image obtained based on the second Mask image processing, into the matting model. The output is three semantic feature maps F1, F2, and F3, and the total matting loss is calculated with the real Alpha label. The semantic feature map refers to the Alpha image predicted by a certain layer feature map in the network. The three semantic feature maps are three Alpha images predicted by three different network feature maps. The three semantic feature maps are defined as F1, F2, and F3, respectively.
[0066] The three-part image is obtained by processing the second mask image during training, while during the inference of the entire image matting model, the first mask image predicted by the model in step 1 is used, which is then processed into the second mask image to obtain the three-part image.
[0067] Steps 2-3: Current matting loss methods primarily calculate the loss between the predicted alpha and the alpha label. However, this method often leads to poor convergence of the matting model, especially limited edge optimization. The matting loss proposed in this paper is a joint loss, specifically including the alpha loss L. a Synthesis loss of foreground and background L com Laplace gradient constraint loss L lap The total loss calculated from the single-layer semantic graph and the alpha label is L. matting =L a +L com +L lap ;
[0068] Specifically, assuming the image matting model predicts Alpha using Let α represent the true alpha label of the product, then the alpha loss L a The expression is as follows, using the L1 loss function, which calculates the predicted label. The absolute error of the corresponding pixel value between the actual label α and the actual label α:
[0069]
[0070] Synthesis loss of foreground and background L com The specific operation involves denoting the foreground as F, the background as B, and the product's RGB image as I. The prediction alpha of the image matting model is then represented by... This indicates that the predicted RGB product image can be represented as follows: Then the combined loss L com The absolute error of the corresponding pixel values between the predicted synthetic image P and the synthetic image I input to the model is calculated using the L1 loss function according to the following formula:
[0071] L com=∑||PI||1
[0072] L lap It uses Laplacian loss, which decomposes the alpha image into three semantic feature layers, and then applies an L1 loss function at each layer to supervise the local and global alpha output. The calculated representation of the predicted alpha image is as follows:
[0073]
[0074] Specifically, the matting model calculates the loss between the predicted Alpha and the real Alpha using the above three joint losses;
[0075] Steps 2-4: The introduced transformer-structured matting model optimizes the matting effect of object edges by calculating the joint loss L of the three semantic feature maps F1, F2, and F3 with the Alpha label. matting1 =L a1 +L com1 +L lap1 L matting2 =L a2 +L com2 +L lap2 and L matting3 =L a3 +L com3 +L lap3 Furthermore, the average loss is calculated using a weighted summation method for the losses of different feature maps, L a-mean =(2*L) a1 +2*L a2 +L a3 ) / 5、L com-mean =(2*L) com1 +2*L com2 +L com3 ) / 5 and L lap-mean =(2*L) lap1 +2*L lap2 +L lap3 ) / 5, and finally update the matting model parameters using the total loss, L SUM =L a-mean +L com-mean +L lap-mean ;
[0076] Specifically, in the matting model, lower-level feature maps contain richer positional information, while higher-level feature maps contain richer semantic information. To optimize the matting effect, this invention selects Alpha labels predicted from three semantic feature layers for the matting model. and Calculate three types of L with the real alpha labels respectively a Lcom L lap The model weights are iteratively updated by minimizing the fusion loss after applying a joint loss and weighted fusion.
[0077] Steps 2-5: During model inference, the predicted Alpha image is merged with the original image through channels to obtain a 4-channel matting result;
[0078] Specifically, the original product image, the segmentation results of the high-precision segmentation model for the main body of the product image, and the matting results of the final matting model are as follows: Figure 6 As shown, from the product RGB image, the high-precision product segmentation image, and the final cutout effect, it is very intuitive to find that the cutout model based on the transformer structure has more refined subject edges and better cutout effect on the basis of high-precision segmentation results.
[0079] For image matting methods based on the tripartite image, the tripartite image is crucial. The tripartite image obtained based on the saliency detection model is mainly generated by morphological erosion and dilation of the predicted mask. The quality of the generated image mainly depends on the result of the mask and the choice of threshold during binarization, because the erosion and dilation coefficient is relatively fixed. A more accurate mask result has lower threshold requirements and can better segment the subject of the product from the background. A more accurate product segmentation image can generate a more accurate tripartite image. Introducing a more accurate tripartite image can further improve the image matting effect.
[0080] This invention addresses the limitations of computational complexity and positional encoding in transformer attention mechanisms by improving the transformer structure. It employs window-based attention mechanisms, window offset-based attention mechanisms, and relative positional encoding, which better integrate local and global feature information. Furthermore, it directly incorporates the tri-image into the transformer structure, integrating the global representation of the tri-image into the attention calculation of the original image window region. This further improves the subject optimization and edge refinement of large product images, resulting in accurate product matting results and effectively solving the problem of inaccurate matting mask prediction caused by ignoring prior information from the tri-image.
[0081] This invention may have many other embodiments. The above embodiments do not limit this invention in any way. Without departing from the spirit and essence of this invention, those skilled in the art can make various corresponding changes and modifications according to this invention. All other improvements and applications made to the above embodiments by equivalent transformation should fall within the protection scope of the appended claims.
Claims
1. A method for image cutout of e-commerce products, characterized in that, Includes the following steps: Step 1: Prepare the first dataset for training the binary segmentation model. The binary segmentation model is based on the transformer structure and is used to segment salient objects and background in the product image. The dataset includes the RGB image of the product and the first mask image. Set a threshold for the first mask image to filter it into the second mask image. Apply erosion and dilation processing to the second mask image to obtain the three-part image. Step 2: Prepare a second dataset for training the matting model. The matting model is based on a transformer architecture. The second dataset includes the RGB image and alpha label of the product. Input the RGB image and the tri-image obtained in Step 1 into the matting model. Improve the transformer architecture using a window-based attention mechanism, a window offset-based attention mechanism, and relative position encoding. Optimize the training based on the window size (7 pixels). Calculate the attention weights of pixels within the window. Input the tri-image into the window-based attention mechanism to fuse the global representation of the tri-image with the features of the RGB image, minimizing the loss between the predicted alpha image and the alpha label. Merge the predicted alpha image and the original RGB image through channel merging to obtain the final matting result.
2. The method for image cutout of e-commerce products as described in claim 1, characterized in that: Step 1 specifically includes: Step 1-1: The first dataset includes open-source, manually labeled segmented DIS5K images and collected e-commerce product images. The e-commerce product images include real product images and product images synthesized by replacing the product background, with a number of no less than 250,000 images. The first mask image is a mask image of the corresponding image. For the predicted mask image and the first mask image, a binary classification cross-entropy loss is used. The model parameters are optimized by minimizing this loss. Steps 1-2: Perform data augmentation on the RGB image of the product. The data augmentation includes random cropping and normalization. The input size is 1024*1024. The binary segmentation model includes an intermediate supervision module. The intermediate supervision module uses a lightweight deep learning model and retrains it on the first dataset to supervise the multi-layer feature maps of the binary segmentation model, preventing overfitting. A convolution operation is added before training the binary segmentation network. Steps 1-3: The threshold is set for the first Mask image, and the resulting second Mask image is subjected to erosion and dilation operations with a kernel size of 3 and an iteration count of 3 to obtain the three-part image labels.
3. The method for image cutout of e-commerce products as described in claim 2, characterized in that: Step 2 specifically includes: Step 2-1: The second dataset also includes background images, with no fewer than 100,000 RGB images and no fewer than 20,000 background images. New images are synthesized online using the RGB images and background images. Step 2-2: Input the new image synthesized from the RGB image of the product and the background image, and the three-part image obtained based on the second Mask image processing, into the matting model, output a 3-layer semantic feature map, and calculate the total matting loss with the real Alpha label; the three-part image is obtained by processing the second Mask image during training; Steps 2-3: The matting loss includes the alpha loss L. a Synthesis loss of foreground and background L com and Laplace gradient constraint loss L lap The joint loss calculated by combining the single-layer semantic feature map and the alpha label is L. matting =L a +L com +L lap ; Steps 2-4: Calculate the joint loss L for each of the three semantic feature maps and the Alpha label. matting1 L matting2 and L matting3 And calculate the average loss corresponding to the same matting loss, denoted as: Average Alpha loss L a-mean =(2*L) a1 +2*L a2 +L a3 ) / 5、 Average synthetic loss L com-mean =(2*L) com1 +2*L com2 +L com3 ) / 5 and Average Laplacian gradient constraint loss L lap-mean =(2*L) lap1 +2*L lap2 +L lap3 ) / 5, By updating the parameters of the image matting model using joint loss, we obtain: Average joint loss L SUM =L a-mean +L com-mean +L lap-mean ; Steps 2-5: During model inference, the predicted Alpha image is merged with the original RGB image of the product to obtain a 4-channel image matting result.
4. The method for image cutout of e-commerce products as described in claim 3, characterized in that: The binary segmentation model uses U2-Net as the backbone network. The deep learning model takes the real segmentation mask G as input and trains a self-supervised encoder to extract six intermediate feature maps. The model minimizes the loss between the predicted mask of each intermediate feature map and the real segmentation mask G. The loss function uses the BCE binary cross-entropy loss, as shown in the following formula: Among them, F gt Let θ represent the deep learning model. gt Here, D represents the number of intermediate feature maps, G represents the first mask map, and BCE represents the cross-entropy loss for binary classification. Self-supervised GT encoder F gt After training, through D = {1, 2, 3, 4, 5, 6} Freeze weight θ gt ,in This represents the generation of a supervised probability map, which is used to supervise the binary splitting model F. sg The corresponding generated intermediate depth features in Image I is processed by a binary segmentation model F sg The resulting set of intermediate feature maps, D={1, 2, 3, 4, 5, 6}, θ sg The weights of the binary segmentation model are represented by L; the loss L is calculated between the intermediate feature maps generated by the supervision module and the intermediate feature maps generated by the binary segmentation model. fs Ensure feature synchronization. in, The weights among the six feature map losses are represented by the synchronization loss at each layer. D = {1, 2, 3, 4, 5, 6}; The training process of the binary segmentation model is expressed as an optimization problem argmin(L fs +L sg ), L sg This represents the cross-entropy loss of the binary classification (BCE) model, which predicts multiple intermediate feature maps and the true mask label G. This represents the weights among the six feature map losses; The loss L is obtained by using the six intermediate feature maps generated by the self-supervised encoder and the binary segmentation model described above. fs And the loss L between the intermediate feature map generated by the binary segmentation model and the real mask label G. sg To minimize the loss L S =L fs +L sg The model parameters are updated iteratively.
5. The method for image cutout of e-commerce products as described in claim 4, characterized in that: In step 2, let Let α be the predicted label, P be the predicted composite image, I be the new image synthesized from the RGB image and the background, F be the foreground, and B be the background. The Alpha loss L a Using the L1 loss function, the absolute error between the predicted and true pixel values is calculated, expressed as: The synthesis loss L of the foreground and background com Using the L1 loss function, the absolute error of the corresponding pixel values between the predicted synthesized image and the synthesized new image input to the model is calculated, expressed as: L com =∑||P-I||1 Where I represents the new image synthesized from the RGB image and the background. P is the predicted composite image. The Laplace loss L lap The predicted alpha image is decomposed into three semantic feature layers. Then, the L1 loss function is used at each layer to calculate the loss between the predicted alpha image and the ground truth alpha label, which is used to supervise the local and global alpha output. The Alpha image corresponding to the i-th layer prediction, the The alpha image is obtained by applying convolution and upsampling operations to the semantic feature map of the corresponding i-th layer. Upsampling is used to ensure that the predicted alpha image has the same label resolution as the real alpha image. The Laplacian loss is L. lap The computational expression is as follows: