Matting method, matting model training method and related device
By using a pre-trained matting model, image features are extracted and decoded using an encoder and decoder. Combined with dilated spatial convolutional pooling pyramids and gated recurrent units, the problem of user-adjustable parameters and target color overflow is solved, achieving automatic matting and clear boundary effects.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN ZEGO TECH CO LTD
- Filing Date
- 2023-04-27
- Publication Date
- 2026-06-19
AI Technical Summary
Existing image cutout methods require users to manually set and adjust parameters, making it difficult for non-professional users to achieve good cutout results, and they are prone to blurring of boundaries when the target color overflows.
A pre-trained matting model is used to extract and decode image features using an encoder and decoder. Combined with dilated spatial convolutional pooling pyramids and gated recurrent units, matting and overflow suppression of images are achieved. Target color suppression is achieved by adjusting the grayscale values of color channels.
It can automatically perform background removal without requiring users' professional knowledge, improving the efficiency and effect of background removal, ensuring clear boundaries of the foreground image, and is suitable for different devices and can process video streams in real time.
Smart Images

Figure CN116503421B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, specifically to a matting method, a matting model training method, and related equipment. Background Technology
[0002] Image cutout is a common operation in image processing, which extracts the desired portion of an image from the entire image to prepare for later compositing. Common cutout methods often require users to manually set a series of adjustment parameters, which can lead to poor cutout results when users lack professional knowledge. Furthermore, when cutting out images with overflowing target colors, the cutout boundaries may become blurry. Summary of the Invention
[0003] In view of the above, it is necessary to propose a matting method, a matting model training method, and related equipment that can reduce the difficulty of matting and improve the efficiency and effect of matting.
[0004] Embodiments of this application provide a method for image matting, the method comprising: acquiring an image to be matted; inputting the image to be matted into a pre-trained matting model, using the matting model to perform matting processing on the image to be matted to obtain a foreground image of the image to be matted; and using the matting model to perform overflow suppression processing on the foreground image to obtain a target foreground image.
[0005] In one embodiment, the matting model includes an encoder, which includes: multiple convolutional layers, multiple batch normalized (BatchNorm) layers, multiple linear rectified ReLU activation functions, and an inverse residual structure of the MobileNetV3 architecture; the multiple encoders are used to extract features from the image to be matted multiple times and output foreground features.
[0006] In one embodiment, the matting model includes a dilated spatial convolutional pooling pyramid, which is used to sample the foreground features of the image to be matted in parallel at different sampling rates, and fuse the multiple sampled feature results to obtain multi-scale features containing different spatial scales.
[0007] In one embodiment, the image to be matted includes multiple images with temporal information; the matting model includes a decoder, the decoder includes a gated loop unit, the decoder is used to decode the multi-scale features of the image to be matted based on the temporal information, and output the foreground image of the image to be matted and the alpha channel corresponding to the foreground image.
[0008] In one embodiment, the matting model includes an overflow suppression module, which includes a convolutional layer and a linear rectified ReLU activation function; the overflow suppression module is used to suppress the overflow of the target color in the foreground image of the image to be matted, and output the foreground image after overflow suppression.
[0009] In one embodiment, the overflow suppression processing of the foreground image using the matting model includes: determining the original grayscale value of each color channel of each pixel in the foreground image; when it is determined that the original grayscale value of the color channel corresponding to the target color in any pixel is higher than the original grayscale value of other color channels, adjusting the original grayscale value of the target color so that the adjusted grayscale value is lower than the original grayscale value.
[0010] Embodiments of this application provide a method for training an image matting model. The method includes: acquiring training samples; and training a neural network model using the training samples to obtain an image matting model. The neural network model includes: multiple encoders, a dilated spatial convolutional pooling pyramid, multiple decoders, and an overflow suppression module.
[0011] In one embodiment, the method further includes: setting corresponding loss functions and corresponding weight coefficients for different modules of the neural network model, and using the weighted sum of the loss functions of all modules as the loss function of the neural network model.
[0012] An embodiment of this application provides an image matting device, the device comprising: an acquisition module and a matting module; the acquisition module is used to acquire an image to be matted; the matting module is used to input the image to be matted into a pre-trained matting model, and use the matting model to perform matting processing on the image to be matted to obtain a foreground image of the image to be matted; the matting module is further used to use the matting model to perform overflow suppression processing on the foreground image to obtain a target foreground image.
[0013] An embodiment of this application provides an electronic device, which includes a processor and a memory. The processor is used to execute a computer program stored in the memory to implement the matting method and the matting model training method.
[0014] Embodiments of this application provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the image matting method and the image matting model training method.
[0015] In summary, the image matting method, image matting model training method, and related equipment described in this application can use the image matting model to perform image matting processing to obtain the foreground image of the image to be matted. No professional knowledge is required from the user, and no parameter adjustment is needed, thereby reducing the difficulty of image matting and improving the efficiency of image matting. By using the image matting model to perform overflow suppression processing on the foreground image to obtain the target foreground image, overflow suppression of the target color can be performed, thereby improving the image matting effect. Attached Figure Description
[0016] Figure 1 This is a structural diagram of the electronic device provided in the embodiments of this application.
[0017] Figure 2 This is a flowchart of the image matting method provided in the embodiments of this application.
[0018] Figure 3 This is an example image of the image matting model provided in this application being applied to a computer.
[0019] Figure 4 This is an example image of the image matting model provided in this application being applied to a mobile device.
[0020] Figure 5 This is an example diagram illustrating the usage effect of the GRU provided in the embodiments of this application.
[0021] Figure 6 This is a flowchart of the image matting model training method provided in the embodiments of this application.
[0022] Figure 7 This is a structural diagram of the image matting device provided in the embodiments of this application. Detailed Implementation
[0023] To better understand the above-mentioned objectives, features, and advantages of this application, the application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be combined with each other.
[0024] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the specification of this application is for the purpose of describing an embodiment in one instance only and is not intended to be limiting of the application.
[0025] It should be noted that in this application, "at least one" means one or more, and "more than one" means two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone, where A and B can be singular or plural. The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and drawings of this application are used to distinguish similar objects, not to describe a specific order or sequence.
[0026] In the embodiments of this application, the terms "exemplary" or "for example" are used to indicate examples, illustrations, or descriptions. Any embodiment or design described as "exemplary" or "for example" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design solutions. Specifically, the use of terms such as "exemplary" or "for example" is intended to present the relevant concepts in a specific manner. Unless otherwise specified, the following embodiments and features described herein can be combined with each other.
[0027] In one embodiment, image matting is a common operation in image processing, referring to the extraction of a desired portion from an image to prepare for later compositing. Common matting methods often require users to manually set a series of adjustment parameters, which can lead to poor results when users lack professional knowledge. Furthermore, they often fail to consider potential color overflow issues in the image, resulting in blurred object boundaries in the final composite image.
[0028] To address the aforementioned issues, this application provides a matting method and a matting model training method. These methods utilize a matting model to process the image to be matted, obtaining the foreground image of the image to be matted. No user expertise is required, and parameter adjustments are unnecessary, thus reducing the difficulty and improving the efficiency of matting. Furthermore, the matting model performs overflow suppression processing on the foreground image to obtain the target foreground image, enabling overflow suppression of the target color, thereby improving the matting effect.
[0029] For example Figure 1 The diagram shown is a structural diagram of an electronic device provided in an embodiment of this application. The image matting method provided in this embodiment is executed by an electronic device, which integrates the image matting function provided by the method in this embodiment, or runs in the electronic device in the form of a software development kit (SDK). The electronic device can be a computer, server, laptop, mobile terminal, or other such device.
[0030] In this embodiment of the application, the electronic device 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.
[0031] Those skilled in the art should understand that Figure 1 The structure of the electronic device shown does not constitute a limitation of the embodiments of this application. It can be a bus structure or a star structure. The electronic device 3 may also include more or fewer other hardware or software than shown, or different component arrangements.
[0032] In some embodiments, the electronic device 3 is a device capable of automatically performing numerical calculations and / or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits, programmable gate arrays, digital processors, and embedded devices. The electronic device 3 may also include other external devices, such as input / output devices like keyboards, mice, remote controls, displays, touchpads, or voice-activated devices.
[0033] It should be noted that the electronic device 3 is merely an example. Other existing or future electronic products that are suitable for this application should also be included within the scope of protection of this application and are incorporated herein by reference.
[0034] Figure 2 This is a flowchart of the image matting method provided in the embodiments of this application. The image matting method specifically includes the following steps. Depending on different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
[0035] S11, Obtain the image to be cut out.
[0036] In one embodiment, the image to be cut out may include one or more images. When the image to be cut out includes multiple images, the image to be cut out may be multiple video frames with temporal information belonging to the same video. The temporal information may be the shooting time of each image to be cut out, so the image to be cut out may be called a temporal map.
[0037] The image to be cut out can also be an image obtained in certain specific shooting environments. For example, the image to be cut out may be taken in an environment where a background curtain of the target color (such as a green screen) is installed. Therefore, the image to be cut out may include overflow of the target color (such as green). For example, an image to be cut out obtained in a green screen environment may have green overflow. For example, in scenarios such as film and television shooting, short video shooting, or online classes, when the background of the person being filmed uses a curtain of a certain color, after shooting the relevant video of the person, when it is necessary to cut out the person from the video or image, the curtain color will overflow.
[0038] In another embodiment, the electronic device can receive a user-inputted image to be cut out. When the user inputs a video, the electronic device can simultaneously receive the video and extract video frames from it for subsequent image cutout based on the video frames. Furthermore, the electronic device can be electrically or communicatively connected to a camera device that captures the image to be cut out, thereby acquiring the image to be cut out.
[0039] In one embodiment, the image to be cut out is processed by extracting the foreground image from the image to be cut out. For example, if the target object (e.g., a human body) is photographed in a green screen environment to obtain the image to be cut out, then the green area corresponding to the green screen is used as the background, and the image of the area where the target object is located outside the green area is the foreground image of the image to be cut out.
[0040] While obtaining the foreground image, the corresponding alpha channel (transparency channel image) can also be obtained. Based on the foreground image and its corresponding alpha channel, the foreground image can be composited into any new background to obtain a composite image. In the alpha channel of the foreground image, the area containing the target object is an opaque area (a white area with a grayscale value of 255), and the remaining areas are transparent areas (black areas with a grayscale value of 0).
[0041] In one embodiment of this application, the image matting principle can be explained by the following formula:
[0042] img foreground =alpha*img,
[0043] Among them, img foreground The image represents the foreground image, alpha is the alpha channel corresponding to the foreground image, and img represents the image to be cut out.
[0044] The principle of image synthesis can be explained by the following formula:
[0045] img ′ =alpha*img foreground +(1-alpha)*img background ,
[0046] Among them, img ′ For composite images, img background This indicates a new background.
[0047] Furthermore, unless special effects processing is required on the foreground image corresponding to the target object, the target object should avoid including the target color (e.g., green) to prevent transparent areas from appearing in the extracted foreground image. The target object used in this embodiment does not contain the target color; therefore, if there are pixels in the obtained foreground image that are biased towards the target color, they can be considered as residual pixels caused by target color overflow.
[0048] S12, input the image to be cut out into a pre-trained cutout model, and use the cutout model to perform cutout processing on the image to be cut out to obtain the foreground image of the image to be cut out.
[0049] In one embodiment, the training method for the image matting model can be referred to the following... Figure 6 The process described is illustrated below. The matting model primarily employs an encoder-decoder structure, where the number of encoders is equal to the number of decoders. The size of the matting model can be adjusted by changing the number of encoders or decoders, thus making the matting model suitable for devices with different memory sizes (e.g., computers, mobile devices).
[0050] For example Figure 3 The image shown is an example of the image matting model provided in this application being applied to a computer. For example... Figure 4 The image shown is an example of the image matting model provided in this application being applied to a mobile device. The following section combines... Figure 3 and Figure 4 The structure of the image matting model shown illustrates how to achieve image matting using this model.
[0051] In one embodiment, for example Figure 3 As shown in Figure 4, the matting model, in addition to the encoder and decoder, may also include a downsampling module, which may include Gaussian convolution. Before using the encoder to extract features from the image to be matted, the matting model can use the downsampling module to perform downsampling processing on the image to be matted. The downsampling processing includes shrinking the image to be matted to obtain a downsampled image with a smaller size and resolution than the original image to be matted.
[0052] In one embodiment, the encoder includes: multiple convolutional layers, multiple batch-normalized (BatchNorm) layers, multiple linearly rectified ReLU activation functions, and an inverse residual structure based on the MobileNetV3 architecture. The multiple encoders are used to perform multiple feature extractions on the image to be matted, outputting foreground features. The foreground features include features corresponding to the target object, such as the boundary features of the target object.
[0053] Specifically, the convolutional layer of each encoder is used to extract convolutional features from the input image (e.g., the image to be cut out); channel-separable convolution is used to reduce the computational load, followed by feature map dimensionality upscaling, and finally 1x1 convolution is used to compress and reduce the dimensionality of the feature map; the data size is reduced by half every time it passes through an encoder, and the network extracts features on data at different resolutions, which can extract data features more effectively.
[0054] The BatchNorm layer is used to improve the convergence speed of the encoder; the ReLU activation function (e.g., ReLU6) is used to add non-linear features.
[0055] The encoder as a whole adopts the inverse residual structure of MobileNetV3. In traditional residual structures, the input (e.g., the image to be matted) is added to the output (e.g., features of the image to be matted) through a skip connection that spans multiple layers (e.g., multiple convolutional layers, multiple BatchNorm layers) to increase the efficiency of information flow. However, the cross-layer connection in the traditional residual structure leads to gradient instability, affecting the stability of the encoder model. The inverse residual structure solves this problem by using reverse connections, that is, skip connections from output to input, instead of connections from input to output. In addition, based on the residual connections of ResNet, the inverse residual structure changes the shortcut connections in the residual blocks from standard addition to multiplication, thereby reducing the computational cost of the encoder; it also adds a squeeze-and-excitation module, which can adaptively adjust the weights of each channel, thereby improving the ability to extract important features.
[0056] In one embodiment, after the encoder extracts features, the matting model includes an Atrous Spatial Pyramid Pooling (ASPP), such as the Lite Reduced Atrous Spatial Pyramid Pooling (LRASPP) shown in the figure. The atrous spatial pooling pyramid is used to sample the foreground features of the image to be matted in parallel at different sampling rates, and the resulting multiple sampled feature results are fused to obtain multi-scale features containing different spatial scales.
[0057] Specifically, ASPP uses different dilated convolutional kernels to convolve the input feature map (e.g., the foreground features mentioned above). Each kernel has a different dilation rate, resulting in features at different scales (e.g., dividing the input feature map into 1, 4, and 16 parts evenly according to width and height to obtain features at three scales). After dilated convolution, ASPP also performs global average pooling on the output feature map to capture global contextual information. Finally, feature maps from all scales are concatenated and fused through a convolutional layer. ASPP can also be used to reduce model runtime, decrease model parameters and computational cost, improve model efficiency, and enhance network robustness.
[0058] In one embodiment, the image to be cut out includes multiple images with temporal information. The encoder of the cut-out image and ASPP realize the extraction and fusion of foreground features of each image to be cut out. In order to obtain the foreground image corresponding to each image to be cut out, the foreground features need to be decoded using a decoder.
[0059] The decoder includes a gated recurrent unit (GRU), which is used to decode the multi-scale features of the image to be cut out based on the temporal information, and output the foreground image of the image to be cut out and the alpha channel corresponding to the foreground image.
[0060] Specifically, GRU is a recurrent neural network that includes two gated interfaces, and its formula is as follows:
[0061]
[0062]
[0063] Where r is the reset gating, z is the update gating, σ is the sigmoid function, and w r w z Indicates weight;
[0064] GRU uses the status h transmitted from the previous node t-1 and the input x of the current node t To obtain the states of two gating systems, use the reset gating system to obtain the reset data h. t-1′ :h t-1′ =h t-1 ⊙, where ⊙ represents update;
[0065] Then h t-1 With input x t The data is then concatenated and subsequently scaled to the range of -1 to 1 using the tanh activation function. Since h′ implicitly contains the current input x. tData, thus enabling the retention of the current data state;
[0066] Subsequently, GRU performs memory updates by simultaneously performing two steps: forgetting and remembering, and then updates using the previously obtained update gating: h t =(1-z)⊙h t-1 +⊙h′, the range of the gating signal is 0 to 1.
[0067] Using GRU in the decoder solves the gradient vanishing and gradient exploding problems in long sequence training. It can refer to the temporal information of the preceding and following frames of multiple images to be cut out, and suppress the blurring of the object boundary in the cutout foreground caused by the lighting changes when capturing multiple images to be cut out.
[0068] For example Figure 5 The image shown is an example of the effect of using GRU provided in the embodiments of this application. The left image is the decoded image without GRU, and the images from left to right are the images of GRU correcting decoding prediction errors over time. It can be seen that GRU can obtain clearer boundaries.
[0069] In one embodiment, the matting model further includes an upsampling module, which is used to upsample the foreground image output by the decoder and the alpha channel corresponding to the foreground image.
[0070] S13, the foreground image is subjected to overflow suppression processing using the matting model to obtain the target foreground image.
[0071] In one embodiment, the matting model utilizes an encoder and decoder to achieve precise matting of the image to be matted, thereby obtaining a foreground image with clear boundaries and the corresponding alpha channel of the foreground image. However, since the human eye is very sensitive to certain colors, such as green, if the foreground image still has residual green pixels due to green overflow, the green screen matting effect will not meet expectations.
[0072] Therefore, the image matting model includes an overflow suppression module, which can employ various methods to suppress the target color in the foreground image. The principle of the overflow suppression module includes: identifying the pixels of the target color in the foreground image and suppressing the target color in those pixels.
[0073] For example, the foreground image is an RGB image containing three color channels: red (R), green (G), and blue (B). The pixel value of each pixel can be represented as (the gray value of R, the gray value of G, and the gray value of B). When the target color is green, the corresponding green pixel represents a pixel where the gray value of G is greater than the gray values of R and B. For example, when the pixel value of pixel n is (145, 200, 100), the gray value of G > the gray value of R > the gray value of B. Therefore, pixel n is a green pixel, and the overflow suppression module performs green suppression on pixel n.
[0074] In one embodiment, the suppression module includes a convolutional layer and a linear rectified ReLU activation function; the overflow suppression module is used to suppress the overflow of the target color in the foreground image of the image to be cut out, and output the foreground image after overflow suppression.
[0075] Specifically, the convolutional layer in the suppression module can be used to scan the green pixels in the foreground image point by point. The scan result is then superimposed on the foreground image with green overflow, thereby suppressing the green pixels in the foreground image and obtaining a green-suppressed foreground image. Alternatively, green pixels can be treated as noise points and denoised using convolutional layers, thus achieving green suppression.
[0076] In one embodiment, the overflow suppression module may be the aforementioned neural network module based on convolutional layers, or it may be a machine learning algorithm module implemented using a programming language.
[0077] When the overflow suppression module uses a machine learning algorithm, the overflow suppression processing of the foreground image using the image matting model includes: determining the original grayscale value of each color channel (e.g., red, green, blue) of each pixel in the foreground image; when it is determined that the original grayscale value of the color channel corresponding to the target color in any pixel is higher than the original grayscale value of other color channels, the original grayscale value of the target color is adjusted so that the adjusted grayscale value is lower than the original grayscale value.
[0078] In one embodiment, adjusting the original grayscale value of the target color includes method one: updating the grayscale value of the target color to the same grayscale value as the original grayscale value of any other color channel. For example, when the pixel value of pixel n is (145, 200, 100), and the original grayscale value of G is 200, 200 can be updated to 145, thereby updating the pixel value of pixel n to (145, 145, 100), thus achieving the effect of green suppression; alternatively, 200 can be updated to 100, thereby updating the pixel value of pixel n to (145, 100, 100), thus achieving the effect of green suppression.
[0079] In one embodiment, adjusting the original grayscale value of the target color includes method two: determining the adjustment weight value (ranging from 0 to 1) for each color (e.g., R, G, B) from a preset weight value table, and adjusting the grayscale value of each color to the product of the original grayscale value and the corresponding adjustment weight value. The weight values in the weight value table can be derived through statistical analysis of a large amount of historical data (e.g., 10,000 images of green overflow and corresponding 10,000 images of green suppression).
[0080] For example, in the weight value form, R has an adjusted weight value of α, G has an adjusted weight value of β, and B has an adjusted weight value of γ. Then, Color... R ′ =olor R *, Color ′ G =olor G *, Color ′ B =Color B *, where Color R Represents the original grayscale value of R, Color G Represents the original grayscale value of G, Color B Represents the original grayscale value of B, Color R ′ Represents the adjusted grayscale value of R, Color ′ G Represents the adjusted grayscale value of G, Color ′ B This represents the adjusted grayscale value of B.
[0081] The image matting method provided in this application utilizes a pre-trained matting model to matte images where target colors overflow. It can automatically identify the foreground image of the image to be matted, eliminating the need for manual adjustment or parameter settings by the user, thus improving the efficiency and accuracy of matting. The GRU used in the matting model can perform matting on multiple consecutive images containing temporal information, ensuring the stability of the matting results. The matting model also suppresses overflow of target colors in the foreground image, resulting in clearer boundaries. Furthermore, by adjusting the model structure and training to obtain matting models of different sizes, it is easy to install on different devices, improving the model's usability. Moreover, compared to existing matting tools (such as the Keylight plugin in After Effects software), the matting model can perform matting and overflow suppression on each frame of the image to be matted in a real-time video stream, without waiting for the video stream to be cached locally before processing, thus improving the efficiency of matting and overflow suppression.
[0082] Figure 6This is a flowchart of the image matting model training method provided in the embodiments of this application. The image matting model training method specifically includes the following steps. Depending on different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
[0083] S21, Obtain training samples.
[0084] In one embodiment, the training samples include multiple historical images with time-series information, which can be obtained by constructing a green screen scene and taking multiple historical images with green overflow in the green screen scene.
[0085] For example, a professional green screen shooting location can be set up, and professional photography equipment and lighting sources can be used to shoot the target object. By adjusting the relative position of the target object and the green screen, as well as the brightness and color temperature of the light source, multiple historical images can be captured.
[0086] In one embodiment, the training samples further include a first sample image and its corresponding alpha channel. A preset tool can be used to perform image matting on a historical image to obtain the first sample image and its corresponding alpha channel. For example, a preset tool or software (such as the Keylight plugin in After Effects) can be used to perform image matting on a historical image to obtain a foreground image of the historical image as the first sample image, and simultaneously obtain the alpha channel corresponding to the first sample image.
[0087] In one embodiment, the training samples further include a second sample image obtained by suppressing the target color of the first sample image. For example, a preset tool or software (such as the Keylight plugin in After Effects) can be used to suppress green overflow in the first sample image to obtain the corresponding second sample image.
[0088] In one embodiment, the alpha channel of the first sample image is also the alpha channel of the corresponding second sample image.
[0089] S22, Use the training samples to train a neural network model to obtain a matting model.
[0090] In one embodiment, the neural network model includes: multiple encoders, a dilated spatial convolutional pooling pyramid, multiple decoders, and an overflow suppression module. Specifically, the specific structure of the neural network model can be found in the relevant description in Embodiment 1.
[0091] In one embodiment, when training the neural network model using the training samples, historical images are input into the neural network model. The neural network model performs matting and overflow suppression on the historical images, and outputs the historical foreground image of the historical image and the corresponding historical alpha channel and overflow-suppressed historical foreground image. The processing of historical images by the neural network model (e.g., matting and overflow suppression) is similar to the processing of the image to be matted by the matting model, and can be referred to the description in Embodiment 1.
[0092] To verify the performance of the neural network model and optimize it to obtain a matting model that meets preset requirements (e.g., loss function convergence), training the neural network model further includes: using a preset loss function (e.g., L1 function) to calculate the first loss between the historical foreground image and the first sample image, the second loss between the historical alpha channel and the alpha channel of the first sample image, and the third loss between the historical foreground image and the second sample image for overflow suppression; when any value of the first loss, the second loss, and the third loss is greater than or equal to a preset loss threshold (e.g., 0.1), the parameters of the neural network model are updated to obtain an updated neural network model; the above steps are repeated using the updated neural network model until all of the first loss, the second loss, and the third loss are less than the loss threshold to obtain the matting model.
[0093] In one embodiment, a corresponding loss function can be set for each module of the neural network model, such as setting a separate loss function for the overflow suppression module and a separate loss function for the GRU module; then, a corresponding weight coefficient is set for the loss function of each module, and the weighted sum of the loss functions of all modules is used as the loss function of the neural network model.
[0094] Figure 7 This is a structural diagram of the image matting device provided in the embodiments of this application.
[0095] In some embodiments, the matting device 20 may include multiple functional modules composed of computer program segments. The computer programs for each program segment in the matting device 20 may be stored in the memory of an electronic device and executed by at least one processor to perform (see details). Figure 2 (Description) The function of image cutout.
[0096] In this embodiment, the image matting device 20 can be divided into multiple functional modules according to its functions. The functional modules may include: an acquisition module 201 and a matting module 202. The term "module" in this application refers to a series of computer program segments that can be executed by at least one processor and perform a fixed function, and which are stored in memory. In this embodiment, the limitations of the image matting device 20 can be found in the above-described limitations of the image matting method, and will not be repeated in detail here.
[0097] The acquisition module 201 is used to acquire the image to be cut out.
[0098] The image matting module 202 is used to input the image to be matted into a pre-trained image matting model, and use the image matting model to perform image matting processing on the image to be matted to obtain the foreground image of the image to be matted.
[0099] The matting module 202 is further configured to perform overflow suppression processing on the foreground image using the matting model to obtain the target foreground image.
[0100] Continuing from the previous text Figure 1 As described above, the memory 31 stores a computer program, which, when executed by the at least one processor 32, implements all or part of the steps in the matting method and matting model training method as described. The memory 31 includes read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically-erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium capable of carrying or storing data.
[0101] Furthermore, the computer-readable storage medium may primarily include a program storage area and a data storage area, wherein the program storage area may store the operating system, at least one application required for a function, etc.; and the data storage area may store data created based on the use of blockchain nodes, etc.
[0102] In one embodiment of this application, a computer program is stored on the computer-readable storage medium, which, when executed by the processor 32, implements the following: Figure 2 and Figure 6 The process shown. Alternatively, when the computer program is executed by the processor, it implements as follows: Figure 7 The functions of each module / unit in the image matting device shown, for example Figure 7 Modules 201-202 in the document.
[0103] In some embodiments, the at least one processor 32 is the control unit of the electronic device 3, connecting various components of the electronic device 3 via various interfaces and lines. It executes programs or modules stored in the memory 31 and calls data stored in the memory 31 to perform various functions and process data. For example, when the at least one processor 32 executes a computer program stored in the memory, it implements all or part of the steps of the matting method and matting model training method described in the embodiments of this application; or it implements all or part of the functions of the matting device. The at least one processor 32 may be composed of integrated circuits, such as a single-packaged integrated circuit or multiple integrated circuits with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips.
[0104] In some embodiments, the at least one communication bus 33 is configured to enable communication between the memory 31 and the at least one processor 32, etc.
[0105] Although not shown, the electronic device 3 may also include a power supply (such as a battery) to power the various components. Preferably, the power supply can be logically connected to the at least one processor 32 through a power management device, thereby enabling functions such as charging, discharging, and power consumption management. The power supply may also include one or more DC or AC power supplies, recharging devices, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components. The electronic device 3 may also include various sensors, Bluetooth modules, Wi-Fi modules, camera devices, etc., which will not be described in detail here.
[0106] The integrated unit implemented as a software functional module described above can be stored in a computer-readable storage medium. This software functional module, stored in a storage medium, includes several instructions to cause a computer device (which may be a personal computer, electronic device, or network device, etc.) or processor to execute portions of the methods described in the various embodiments of this application.
[0107] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.
[0108] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0109] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.
[0110] It will be apparent to those skilled in the art that this application is not limited to the details of the exemplary embodiments described above, and that it can be implemented in other specific forms without departing from the spirit or essential characteristics of this application. Therefore, the embodiments should be considered exemplary and non-limiting in all respects, and the scope of this application is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within this application. No reference numerals in the claims should be construed as limiting the scope of the claims. Furthermore, it is clear that the word "comprising" does not exclude other elements or, and the singular does not exclude the plural. Multiple elements or devices recited in the specification may also be implemented by a single element or device through software or hardware. The terms "first," "second," etc., are used to indicate names and do not indicate any particular order.
[0111] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application and are not intended to limit it. Although this application has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of this application without departing from the spirit and scope of the technical solutions of this application.
Claims
1. A method for image cutout, characterized in that, The method includes: Obtain the image to be cut out; The image to be matted is input into a pre-trained matting model, which performs matting processing on the image to obtain the foreground image. The matting model includes multiple encoders, each encoder comprising: multiple convolutional layers, multiple batch-normalized (BatchNorm) layers, multiple linear rectified ReLU activation functions, and an inverse residual structure based on the MobileNetV3 architecture. These encoders perform multiple feature extractions on the image to be matted, outputting foreground features, including: extracting convolutional features using the channel-separable convolutions of each encoder, and performing feature map dimensionality upscaling and 1x...
1. Convolutional compression and dimensionality reduction; the BatchNorm layer is used to accelerate convergence, and the ReLU activation function is used to enhance the nonlinear expressive power of feature data; the inverse residual structure of the MobileNetV3 architecture is adopted, and the gradient instability problem is solved by using inverse skip connections. The shortcut connections of the residual blocks are changed from addition to multiplication to reduce the amount of computation, and a squeeze-and-excitation module is introduced to adaptively adjust the channel weights; the spatial resolution of the feature data is halved when passing through each encoder, realizing multi-scale feature extraction, and outputting a foreground feature map containing the boundary features of the target object; and The foreground image is subjected to overflow suppression processing using the matting model to obtain the target foreground image.
2. The image matting method according to claim 1, characterized in that, The matting model includes a dilated spatial convolutional pooling pyramid, which is used to sample the foreground features of the image to be matted in parallel at different sampling rates, and fuse the multiple sampled feature results to obtain multi-scale features containing different spatial scales.
3. The image matting method according to claim 1, characterized in that, The image to be matted includes multiple images with temporal information; the matting model includes a decoder, the decoder includes a gated loop unit, the decoder is used to decode the multi-scale features of the image to be matted based on the temporal information, and output the foreground image of the image to be matted and the alpha channel corresponding to the foreground image.
4. The image matting method according to claim 1, characterized in that, The image matting model includes an overflow suppression module, which includes a convolutional layer and a linear rectified ReLU activation function. The overflow suppression module is used to suppress the overflow of the target color in the foreground image of the image to be matted, and outputs the foreground image after overflow suppression.
5. The image matting method according to claim 1, characterized in that, The process of using the image matting model to suppress overflow in the foreground image includes: Determine the original grayscale value of each color channel for each pixel in the foreground image; When it is determined that the original grayscale value of the color channel corresponding to the target color in any pixel is higher than the original grayscale value of other color channels, the original grayscale value of the target color is adjusted so that the adjusted grayscale value is lower than the original grayscale value.
6. A method for training an image matting model, characterized in that, The method includes: Obtain training samples; A neural network model is trained using the training samples to obtain a matting model. The neural network model includes: multiple encoders, a dilated spatial convolutional pooling pyramid, multiple decoders, and an overflow suppression module. Each encoder includes: multiple convolutional layers, multiple batch normalized (BatchNorm) layers, multiple linear rectified ReLU activation functions, and an inverse residual structure based on the MobileNetV3 architecture. The multiple encoders are used to extract features from the training samples multiple times, outputting foreground features. This includes: extracting convolutional features using the channel-separable convolutions of each encoder, and performing feature map dimensionality upscaling and 1x1 convolutional compression dimensionality reduction; accelerating convergence using the BatchNorm layers, and enhancing the nonlinear expressive power of the feature data through the ReLU activation function; employing the inverse residual structure of the MobileNetV3 architecture, solving gradient instability problems through inverse skip connections, changing the shortcut connections of the residual blocks from addition to multiplication to reduce computation, and introducing a squeeze-and-excitation module to adaptively adjust channel weights; the spatial resolution of the feature data is halved when passing through each encoder, achieving multi-scale feature extraction, and outputting a foreground feature map containing the boundary features of the target object.
7. The image matting model training method according to claim 6, characterized in that, The method further includes: Each module of the neural network model is assigned a loss function and a corresponding weight coefficient. The weighted sum of the loss functions of all modules is used as the loss function of the neural network model.
8. An electronic device, characterized in that, The electronic device includes a processor and a memory, wherein the processor is configured to execute a computer program stored in the memory to implement the matting method as described in any one of claims 1 to 5, or to implement the matting model training method as described in claim 6.
9. A computer-readable storage medium storing a computer program thereon, characterized in that, When the computer program is executed by the processor, it implements the image matting method as described in any one of claims 1 to 5, or the image matting model training method as described in claim 6.