A method, apparatus, device, and medium for extracting a foreground image.

By using a semantic segmentation model to process images, the problem of poor accuracy in foreground image extraction in traditional methods is solved, and a higher foreground object separation effect is achieved.

CN115810103BActive Publication Date: 2026-06-30CHINA TELECOM CLOUD TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TELECOM CLOUD TECH CO LTD
Filing Date
2022-08-01
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Traditional pixel-based foreground extraction methods cannot effectively separate moving objects that are similar to the background, resulting in poor accuracy in foreground image extraction.

Method used

A semantic segmentation model is used to process images, and the semantic information in the images is used for foreground segmentation to improve extraction accuracy.

Benefits of technology

By considering semantic information in the image, foreground objects that are similar to the background can be effectively separated, thus improving the accuracy of foreground image extraction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115810103B_ABST
    Figure CN115810103B_ABST
Patent Text Reader

Abstract

This application provides a method, apparatus, device, and medium for extracting foreground images, addressing the problem of poor accuracy in foreground image extraction in existing technologies. In this application embodiment, a first image of the foreground to be extracted is acquired and input into a semantic segmentation model. Based on the semantic segmentation model, the foreground image can be segmented from the first image. Since the semantic segmentation model considers semantic information in the image, it can extract the foreground similar to the background, thereby improving the accuracy of foreground image extraction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image segmentation technology, and in particular to a method, apparatus, device, and medium for extracting foreground images. Background Technology

[0002] In the era of big data, surveillance cameras are ubiquitous in people's public lives. How to effectively utilize and analyze this camera data is a crucial issue for industry. Foreground extraction, a traditional problem in computer vision, has wide applications in important fields such as video surveillance, traffic motion analysis, and video summarization. The increasing variety of applications also presents various challenges to the efficiency and accuracy of foreground extraction algorithms.

[0003] Traditional image processing-based foreground extraction methods typically use pixel-value-based machine learning models to model the background image, then calculate the difference between the input image and the modeled background image as the motion component in the image, and binarize this motion component to obtain the foreground mask. This pixel-value-based model cannot separate moving objects similar to the background, resulting in poor accuracy of the extracted foreground image. Summary of the Invention

[0004] This invention provides a method, apparatus, device, and medium for extracting foreground images, in order to solve the problem of poor accuracy in the extracted foreground images.

[0005] In a first aspect, embodiments of this application provide a method for extracting a foreground image, the method comprising:

[0006] Obtain the first image of the foreground image to be extracted;

[0007] The first image is used as the input image and fed into the semantic segmentation model;

[0008] Based on the semantic segmentation model, the segmentation result of the foreground image in the first image is output.

[0009] Secondly, embodiments of this application also provide a foreground image extraction apparatus, the apparatus comprising:

[0010] The acquisition unit is used to acquire a first image of the foreground image to be extracted.

[0011] An input unit is used to input the first image into the semantic segmentation model;

[0012] The output unit is used to output the segmentation result of the foreground image in the first image based on the semantic segmentation model.

[0013] Thirdly, embodiments of this application also provide an electronic device, which includes at least a processor and a memory, wherein the processor is configured to execute a computer program stored in the memory to implement the steps of the foreground image extraction method as described in any of the preceding claims.

[0014] Fourthly, embodiments of this application also provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the foreground image extraction method as described in any of the preceding claims.

[0015] In this application, a first image of the foreground image to be extracted is obtained, and the first image is input into a semantic segmentation model. Based on the semantic segmentation model, the foreground image can be segmented from the first image. Since the semantic segmentation model takes into account the semantic information in the image, it can extract the foreground that is similar to the background, thereby improving the accuracy of foreground image extraction. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 A schematic diagram illustrating a foreground image extraction process provided for some embodiments of this application;

[0018] Figure 2 A schematic flowchart of a foreground image extraction method provided for some embodiments of this application;

[0019] Figure 3 A schematic diagram of a foreground image extraction device provided for some embodiments of this application;

[0020] Figure 4 This is a schematic diagram of an electronic device structure provided for some embodiments of this application. Detailed Implementation

[0021] To make the objectives and implementation methods of this application clearer, the exemplary implementation methods of this application will be clearly and completely described below with reference to the accompanying drawings of the exemplary embodiments of this application. Obviously, the exemplary embodiments described are only some embodiments of this application, and not all embodiments.

[0022] It should be noted that the brief descriptions of terms in this application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of this application. Unless otherwise stated, these terms should be understood in their ordinary and common meaning.

[0023] The terms "first," "second," "third," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar or related objects or entities, and do not necessarily imply a specific order or sequence, unless otherwise specified. It should be understood that such terms are interchangeable where appropriate.

[0024] The terms “comprising” and “having”, and any variations thereof, are intended to cover but not exclude inclusion, for example, a product or device that includes a range of components is not necessarily limited to all of the components that are clearly listed, but may include other components that are not clearly listed or that are inherent to such product or device.

[0025] The term "module" refers to any known or subsequently developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and / or software code that is capable of performing the functions associated with that element.

[0026] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.

[0027] For ease of explanation, the above description has been provided in conjunction with specific embodiments. However, the above exemplary discussion is not intended to be exhaustive or to limit the embodiments to the specific forms disclosed above. Various modifications and variations can be obtained based on the above teachings. The selection and description of the above embodiments are for the purpose of better explaining the principles and practical applications, thereby enabling those skilled in the art to better utilize the described embodiments and various different variations of embodiments suitable for specific use considerations.

[0028] This application provides a method, apparatus, device, and medium for extracting foreground images. In this method, a first image of the foreground image to be extracted is obtained, and the first image is input into a semantic segmentation model. Based on the semantic segmentation model, the foreground image can be segmented from the first image. Since the semantic segmentation model considers the semantic information in the image, it can extract the foreground that is similar to the background, thereby improving the accuracy of foreground image extraction.

[0029] To improve the accuracy of foreground image extraction, this application provides a method, apparatus, device, and medium for foreground image extraction.

[0030] Example 1:

[0031] Figure 1 A schematic diagram illustrating a foreground image extraction process provided for some embodiments of this application, the process including:

[0032] S101: Obtain the first image of the foreground image to be extracted.

[0033] The foreground image extraction method provided in this application embodiment is applied to an electronic device, which may be a PC (personal computer), server, image acquisition device, etc.

[0034] In this process, the first image to be extracted as the foreground image can be an image acquired in real time, meaning the foreground image can be extracted from an image acquired in real time; that is, the first image can be the current frame image. If the electronic device is not an image acquisition device, then in step S101, the electronic device can obtain the first image acquired in real time from the image acquisition device.

[0035] Alternatively, the first image from which the foreground image is to be extracted can be a historically acquired image; that is, the foreground image can be extracted from historically acquired images. For example, an electronic device may store historically acquired images.

[0036] S102: Input the first image as the input image into the semantic segmentation model.

[0037] In this step, the input image for the semantic segmentation model includes the first image.

[0038] For example, the input image can be the first image.

[0039] As another example, the input image may include a first image and second images at multiple time scales. For instance, the input image may be a stitched image of the first image and second images at multiple time scales.

[0040] As another example, the input image may include a first image and a preprocessed image obtained by preprocessing the first image. For example, the input may include a stitched image of the first image and the preprocessed image. Optionally, the preprocessing may be instance segmentation processing, and the preprocessed image is an instance segmentation image.

[0041] As another example, the input image may include a first image, second images at multiple time scales, and a preprocessed image.

[0042] The semantic segmentation model can be a pre-trained model. Typically, a semantic segmentation model divides an image into regions with specific semantic meanings and identifies the semantic category of each region, thus obtaining a semantically annotated segmented image. Therefore, in this embodiment, the semantic segmentation model is used to extract foreground images. This allows attention to the semantic information in the input image, thereby separating foregrounds similar to the background. Furthermore, it is not limited to object categories; it can separate objects of specific categories as well as objects of unspecified categories (such as spilled debris from highways or objects thrown from heights). Therefore, this embodiment is applicable to a variety of scenarios, including static foreground extraction, dynamic foreground extraction, and foreground extraction scenarios involving specific or unspecified categories.

[0043] S103: Based on the semantic segmentation model, output the segmentation result of the foreground image in the first image.

[0044] The segmentation result output by this step is the foreground image obtained by segmenting the first image based on the semantic information in the first image.

[0045] The segmentation result may include the foreground image, or it may be an image obtained by labeling the foreground image in the first image. There are no restrictions on the labeling method.

[0046] In this embodiment of the application, a first image of the foreground image to be extracted is obtained, and the first image is input into a semantic segmentation model. Based on the semantic segmentation model, the foreground image can be segmented from the first image. Since the semantic segmentation model takes into account the semantic information in the image, it can extract the foreground that is similar to the background, thereby improving the accuracy of foreground image extraction.

[0047] Example 2:

[0048] To improve the accuracy of foreground image extraction, based on the above embodiments, the method in this application embodiment further includes:

[0049] For multiple time scales, extract at least one frame of the image at that time scale from the video stream, determine the pixel value statistics of the at least one frame of the image, and determine the second image at that time scale based on the pixel value statistics.

[0050] The first image is input into the semantic segmentation model, including:

[0051] The stitched image of the first image and the second images at multiple time scales is input into the semantic segmentation model.

[0052] The time scale can be based on the current frame image, located before the current frame image, and separated from the current frame image by a certain time interval. For example, the time scale can be a few seconds before the current frame image, a few minutes before the current frame image, or a few hours before the current frame image, etc. There are no restrictions here.

[0053] The frame rate during video streaming is constant, so the corresponding image can be obtained from the video stream based on the time scale corresponding to the frame rate. For example, the frame rate can be, but is not limited to, 25 frames per second, or it could be 30 frames per second. Of course, the time scale and the image corresponding to the time scale in the video stream can also be determined according to the user's actual needs.

[0054] When extracting at least one frame of an image at a time scale from a video stream, it can be to extract multiple frames of images that are consecutive in time for that time scale from the video stream, or to extract multiple frames of images that are not consecutive in time (such as extracting at a set interval or randomly), or to extract a single frame of an image.

[0055] The pixel value statistics of at least one frame of an image may include, but are not limited to, the mean pixel value, the median pixel value, or the mode pixel value.

[0056] When determining the second image at a time scale based on pixel value statistics, an image matching the pixel value statistics can be extracted from the video stream as the second image; or a new image can be generated based on the pixel value statistics as the second image, such as using the statistical value of each pixel position in the pixel value statistics as the pixel value of the corresponding pixel position in the new image.

[0057] When determining the images to be stitched, the first image and multiple second images at different time scales can be stitched together along the image channel dimension, or the first image and multiple second images at different time scales can be stitched together along the time dimension.

[0058] To illustrate with a concrete example, an image acquisition device reads video stream frame data in real time, storing the acquired historical frame data and the current frame image data in a memory buffer. The device uses the historical data stored in the memory buffer to calculate the pixel-wise average of the video two seconds prior to the current frame, determining a second image at a 2-second timescale. Then, it uses the historical data stored in the memory buffer to calculate the pixel-wise average of the video two minutes prior to the current frame, determining a second image at a 2-minute timescale. Finally, the current frame image, the second image at the 2-second timescale, and the second image at the 2-minute timescale are stitched together along the image channel dimension. Here, multiple second images determined at different timescales serve as historical information for the video stream data, providing a reference for the current frame image, i.e., the first image.

[0059] In this embodiment, for multiple time scales, at least one frame of image at each time scale is extracted, the pixel value statistics are determined, and then the second image at the corresponding time scale is determined based on the pixel value statistics. The stitched image of the first image and the second images at multiple time scales is used as the input of the semantic segmentation model. The input image of the semantic segmentation model contains richer semantic information, which is beneficial to improving the accuracy of foreground image extraction.

[0060] Example 3:

[0061] To reduce the complexity of foreground image extraction and improve its accuracy, based on the above embodiments, the method in this application embodiment further includes:

[0062] The first image is input into the instance segmentation model;

[0063] Based on the instance segmentation model, output the instance segmentation image of the first image;

[0064] The step of inputting the first image as the input image into the semantic segmentation model includes:

[0065] The stitched image of the first image and the instance segmentation image is used as the input image to the semantic segmentation model.

[0066] Instance segmentation models can be pre-trained models. Typically, instance segmentation models use object detection methods to outline different instances in an image, and then use semantic segmentation methods to label each instance region pixel by pixel to obtain an instance segmentation image.

[0067] When determining the images to be stitched, the first image and the instance segmentation image can be stitched together in the image channel dimension, or the first image and the instance segmentation image can be stitched together in the time dimension. Alternatively, the first image, multiple second images at different time scales, and the instance segmentation image can be stitched together in the image channel dimension, or the first image, multiple second images at different time scales, and the instance segmentation image can be stitched together in the time dimension.

[0068] To illustrate with a specific example, the image acquisition device can stitch together the current frame image, the second image at the 2-second time scale, the second image at the 2-minute time scale, and the instance segmentation image along the image channel dimension to obtain a stitched image, and then input the stitched image into the semantic segmentation model.

[0069] Optionally, the semantic segmentation model and the instance segmentation model are stored in an electronic device.

[0070] In this embodiment of the application, the instance segmentation model preprocesses the first image to pre-segment the instance segmentation image, thereby saliency of semantic information. This can reduce the complexity of the foreground extraction result to a certain extent and also help improve the accuracy of the foreground extraction result.

[0071] Example 4:

[0072] To improve the accuracy of foreground extraction results, based on the above embodiments, in this embodiment, based on the semantic segmentation model, the segmentation result of the foreground image in the first image is output, including:

[0073] Based on the semantic segmentation model, feature encoding is performed on the input image to obtain the image features of the input image;

[0074] Based on the semantic segmentation model, feature decoding is performed on image features to obtain feature images;

[0075] Based on the semantic segmentation model, the feature image is binarized to determine the segmentation result of the foreground image in the first image.

[0076] Semantic segmentation models can implement feature encoding. For example, a semantic segmentation model can include a feature encoding unit, which can encode the features of the input image to obtain image features. For instance, an electronic device inputs an input image into the feature encoding unit, and after passing through multiple convolutional layers and multiple pooling layers, it can obtain encoded data of image features in a small size.

[0077] Semantic segmentation models can perform feature decoding. For example, a semantic segmentation model may include a feature decoding unit, which can decode the image to obtain a feature image. For instance, an electronic device inputs image features into the feature decoding unit, which then processes them through multiple convolutional layers and upsampling layers to further extract features, increase the image size, and output an image with the same size as the original input image.

[0078] Electronic devices perform binarization on feature images to highlight features of interest (foreground features) and ignore unwanted parts (background features), thus reducing noise from these unwanted parts. Specifically, the electronic device can binarize the feature image based on a stored threshold, which can be set manually or obtained using a predefined algorithm such as the Otsu's method. Optionally, this binarization process is performed pixel-by-pixel.

[0079] In this embodiment of the application, the feature image is binarized, so the channel dimension of the output feature image is usually 1, and after binary classification and logistic regression, the value of each pixel is between 0 and 1. If the number of channels of the output feature image is not 1, an additional 1x1 convolutional layer can be added (for example only) to make the number of channels of the output feature 1.

[0080] The following specific example will illustrate this; see [link / reference]. Figure 2 The electronic device acquires a first image and extracts at least one frame from the video stream at each time scale, determines the mean of the at least one frame, and determines a second image (i.e., the first-scale mean image, ..., the Nth-scale mean image) at each time scale. Based on the first image, the electronic device also acquires an instance segmentation image of the first image. The electronic device inputs the concatenated image of the first image, the first-scale mean image, ..., the Nth-scale mean image, and the instance segmentation image into the feature encoding unit of the semantic segmentation model to obtain image features. These image features are then input into the feature decoding unit of the semantic segmentation model to obtain a feature image. Finally, the electronic device binarizes the feature image to obtain the image binarization result, which is the segmentation result of the foreground image in the first image.

[0081] In this embodiment, the electronic device inputs the input image into the semantic segmentation model, performs feature encoding to obtain the image features of the input image, then performs feature decoding on the image features to obtain the feature image, and then binarizes the above feature image to obtain the foreground image and background image of the current frame image, thereby realizing the segmentation of the foreground image in the first image, effectively utilizing the semantic information in the image, improving the extraction effect of objects similar to the background, and thus improving the accuracy of foreground image extraction.

[0082] Example 5:

[0083] To improve the accuracy of foreground extraction results, based on the above embodiments, in this embodiment, feature encoding is performed on the first image based on the semantic segmentation model to obtain image features of the first image, including:

[0084] Based on the first and second neural network layers of the feature encoding unit in the semantic segmentation model, the first image is feature-encoded to obtain the image features of the first image.

[0085] The feature encoding unit includes, but is not limited to, a first neural network layer and a second neural network layer. Feature encoding of an image can be achieved based on the first neural network layer and the second neural network layer.

[0086] For example, the first neural network layer includes, but is not limited to, one or more convolutional layers, and the second neural network layer includes, but is not limited to, one or more pooling layers.

[0087] Feature encoding units typically employ an image pyramid structure. The input image passes through a first neural network layer and a second neural network layer. The first neural network layer includes multiple convolutional layers; the channel dimension of the feature image increases with each pass through the first layer. The second neural network layer includes multiple pooling layers; the width and height of the feature image decrease with each pass through the second layer, effectively reducing the size of the feature image. The output image features are small-sized encoded data. Multiple convolutional and pooling layers can effectively extract semantic information from the image, improving the extraction performance for objects similar to the background.

[0088] Specifically, in the embodiments of this application, the feature encoding unit structure includes fully convolutional backbone structures such as VGGNet (deep learning network) and ResNet (deep residual network).

[0089] In this embodiment, the electronic device inputs the input image into the feature encoding unit. After passing through the convolutional layer and pooling layer, the semantic information in the image can be fully extracted, improving the extraction effect of objects similar to the background, thereby improving the accuracy of foreground image extraction.

[0090] Example 6:

[0091] To improve the accuracy of foreground extraction results, based on the above embodiments, in this embodiment, image features are decoded using a semantic segmentation model to obtain a feature image, including:

[0092] Based on the third and fourth neural network layers of the feature decoding unit in the semantic segmentation model, feature decoding is performed on image features to obtain a feature image.

[0093] The feature decoding unit includes, but is not limited to, the third neural network layer and the fourth neural network layer, and feature decoding can be achieved based on the third neural network layer and the fourth neural network layer.

[0094] For example, the third neural network layer includes, but is not limited to, one or more convolutional layers, and the fourth neural network layer includes, but is not limited to, one or more upsampling layers and / or one or more deconvolutional layers.

[0095] For example, the feature decoding unit uses convolutional layers and upsampling layers as the network structure. The input image features are further extracted by passing through multiple convolutional layers; the input image features are increased in size by passing through multiple upsampling layers, and finally the output feature image is obtained with the same size as the original input image, i.e., the first image or the stitched image.

[0096] For example, the feature decoding unit uses convolutional and deconvolutional layers as its network structure. The input image features are further extracted by passing through multiple convolutional layers; the input feature image passes through multiple deconvolutional layers, increasing the image scale and the dimension of the feature image, ultimately resulting in an output feature image with the same size as the original input image, i.e., the first image or the stitched image.

[0097] In this embodiment, the electronic device inputs image features into the feature decoding unit. After passing through convolutional layers and upsampling / deconvolutional layers, the semantic information in the image can be fully extracted, improving the extraction effect of objects similar to the background, thereby improving the accuracy of foreground image extraction.

[0098] Example 7:

[0099] To increase the complexity of foreground image extraction, based on the above embodiments, in this embodiment, the input of the third neural network layer includes the output of the fourth neural network layer and / or the feature image.

[0100] In the convolutional layer of the third neural network, in addition to using the result of the upsampling layer of the fourth neural network as input, the feature image in the feature encoding unit corresponding to the size of the third neural network layer can also be used as input. This eliminates the need for several intermediate steps of convolutional layers, pooling layers and upsampling layers, which can reduce the complexity of foreground image extraction. Furthermore, after further feature extraction through the convolutional layer and increasing the image scale through the upsampling layer, the output image has the same scale as the original input image.

[0101] Example 8:

[0102] Based on the same technical concept and the above embodiments, this application provides a foreground image extraction device. Figure 3 A schematic diagram of a foreground image extraction device is provided for some embodiments of this application, such as... Figure 3 As shown, the device includes:

[0103] The acquisition module 301 is used to acquire the first image of the foreground image to be extracted;

[0104] Input module 302 is used to input the first image as the input image into the semantic segmentation model;

[0105] Output module 303 is used to output the segmentation result of the foreground image in the first image based on the semantic segmentation model.

[0106] In one possible implementation, the device further includes:

[0107] The determination module 304 is used to extract at least one frame of image at a time scale from a video stream for multiple time scales, determine the pixel value statistics of the at least one frame of image, and determine a second image at a time scale based on the pixel value statistics.

[0108] The input module 302 is specifically used to input the stitched image of the first image and the second images at multiple time scales as the input image into the semantic segmentation model.

[0109] In one possible implementation, the determining module 304 is used to input the first image into the instance segmentation model and, based on the instance segmentation model, output an instance segmentation image of the first image.

[0110] The input module 302 is specifically used to input the stitched image of the first image and the instance segmentation image into the semantic segmentation model as the input image.

[0111] In one possible implementation, the output module 303 is specifically used to encode the input image based on the semantic segmentation model to obtain the image features of the input image; to decode the image features based on the semantic segmentation model to obtain the feature image; and to perform binarization processing on the feature image based on the semantic segmentation model to determine the segmentation result of the foreground image in the first image.

[0112] In one possible implementation, the output module 303 is specifically used to encode the features of the first image based on the first neural network layer and the second neural network layer of the feature encoding unit in the semantic segmentation model, thereby obtaining the image features of the first image, wherein the first neural network layer includes a convolutional layer and the second neural network layer includes a pooling layer.

[0113] In one possible implementation, the output module 303 is specifically used to perform feature decoding on image features based on the third and fourth neural network layers of the feature decoding unit in the semantic segmentation model to obtain a feature image, wherein the third neural network layer includes a convolutional layer and the fourth neural network layer includes an upsampling layer and / or a deconvolutional layer.

[0114] In one possible implementation, the input to the third neural network layer includes the output of the fourth neural network layer and / or the feature image.

[0115] Example 9:

[0116] Based on the same technical concept, this application also provides an electronic device. Figure 4 This application provides a schematic diagram of an electronic device structure, such as... Figure 4As shown, it includes: processor 401, communication interface 402, memory 403 and communication bus 404, wherein processor 401, communication interface 402 and memory 403 communicate with each other through communication bus 404.

[0117] The memory 403 stores a computer program. When the program is executed by the processor 401, the processor 401 performs the following steps:

[0118] Obtain the first image of the foreground image to be extracted;

[0119] The first image is used as the input image and fed into the semantic segmentation model;

[0120] Based on the semantic segmentation model, the segmentation result of the foreground image in the first image is output.

[0121] In one possible implementation, processor 401 is specifically configured to acquire a first image of the foreground image to be extracted;

[0122] The first image is used as the input image and fed into the semantic segmentation model;

[0123] Based on the semantic segmentation model, the segmentation result of the foreground image in the first image is output.

[0124] Furthermore, processor 401 is also used for:

[0125] For multiple time scales, extract at least one frame image of the time scale from the video stream, determine the pixel value statistics of the at least one frame image, and determine the second image of the time scale based on the pixel value statistics.

[0126] The stitched image of the first image and the second images at multiple time scales is input into the semantic segmentation model.

[0127] Furthermore, processor 401 is also used for:

[0128] The first image is input into the instance segmentation model;

[0129] Based on the instance segmentation model, output the instance segmentation image of the first image;

[0130] The stitched image of the first image and the instance segmentation image is used as the input image to the semantic segmentation model.

[0131] Furthermore, the processor 401 is specifically used to: perform feature encoding on the input image based on a semantic segmentation model to obtain the image features of the input image;

[0132] Based on the semantic segmentation model, feature decoding is performed on image features to obtain feature images;

[0133] Based on the semantic segmentation model, the feature image is binarized to determine the segmentation result of the foreground image in the first image.

[0134] Furthermore, the processor 401 is specifically used to: encode the features of the first image based on the first neural network layer and the second neural network layer of the feature encoding unit in the semantic segmentation model to obtain the image features of the first image, wherein the first neural network layer includes a convolutional layer and the second neural network layer includes a pooling layer.

[0135] Furthermore, the processor 401 is specifically used to: perform feature decoding on image features based on the third and fourth neural network layers of the feature decoding unit in the semantic segmentation model to obtain a feature image, wherein the third neural network layer includes a convolutional layer and the fourth neural network layer includes an upsampling layer and / or a deconvolutional layer.

[0136] Furthermore, the input to the third neural network layer includes the output of the fourth neural network layer and / or the feature image.

[0137] The communication bus mentioned in the above electronic devices can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0138] Communication interface 402 is used for communication between the above-mentioned electronic device and other devices.

[0139] The memory may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0140] The processors mentioned above can be general-purpose processors, including central processing units, network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits, field-programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

[0141] Example 10:

[0142] Based on the same technical concept, embodiments of this application provide a computer-readable storage medium storing a computer program executable by an electronic device. When the program is run on the electronic device, the electronic device performs the following steps:

[0143] Obtain the first image of the foreground image to be extracted;

[0144] The first image is used as the input image and fed into the semantic segmentation model;

[0145] Based on the semantic segmentation model, the segmentation result of the foreground image in the first image is output.

[0146] In one possible implementation, it also includes:

[0147] For multiple time scales, extract at least one frame image of the time scale from the video stream, determine the pixel value statistics of the at least one frame image, and determine the second image of the time scale based on the pixel value statistics.

[0148] The first image is input into the semantic segmentation model, including:

[0149] The stitched image of the first image and the second images at multiple time scales is input into the semantic segmentation model.

[0150] In one possible implementation, it also includes:

[0151] The first image is input into the instance segmentation model;

[0152] Based on the instance segmentation model, output the instance segmentation image of the first image;

[0153] The first image is input into the semantic segmentation model, including:

[0154] The stitched image of the first image and the instance segmentation image is used as the input image to the semantic segmentation model.

[0155] In one possible implementation, based on a semantic segmentation model, the segmentation result of the foreground image in the first image is output, and the method further includes:

[0156] Based on the semantic segmentation model, feature encoding is performed on the input image to obtain the image features of the input image;

[0157] Based on the semantic segmentation model, feature decoding is performed on image features to obtain feature images;

[0158] Based on the semantic segmentation model, the feature image is binarized to determine the segmentation result of the foreground image in the first image.

[0159] In one possible implementation, based on a semantic segmentation model, feature encoding is performed on the first image to obtain image features of the first image, and the method further includes:

[0160] Based on the first and second neural network layers of the feature encoding unit in the semantic segmentation model, the first image is feature-encoded to obtain the image features of the first image, wherein the first neural network layer includes a convolutional layer and the second neural network layer includes a pooling layer.

[0161] In one possible implementation, based on a semantic segmentation model, feature decoding of image features is performed to obtain a feature image, and the method further includes:

[0162] Based on the third and fourth neural network layers of the feature decoding unit in the semantic segmentation model, feature decoding is performed on image features to obtain a feature image. The third neural network layer includes a convolutional layer, and the fourth neural network layer includes an upsampling layer and / or a deconvolutional layer.

[0163] In one possible implementation, the input to the third neural network layer includes the output of the fourth neural network layer and / or the feature image.

[0164] The aforementioned computer-readable storage medium can be any available medium or data storage device that can be accessed by the processor in an electronic device, including but not limited to magnetic storage such as floppy disks, hard disks, magnetic tapes, magneto-optical disks (MO), optical storage such as CDs, DVDs, BDs, HVDs, etc., and semiconductor storage such as ROMs, EPROMs, EEPROMs, non-volatile memory (NAND flash), solid-state drives (SSDs), etc.

[0165] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0166] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0167] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0168] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0169] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.

Claims

1. A method for extracting a foreground image, characterized in that, The method includes: Obtain the first image of the foreground image to be extracted; For multiple time scales, at least one frame of the image at each time scale is extracted from the video stream; the pixel value statistics of the at least one frame are determined; and based on the pixel value statistics, a second image at each time scale is determined; the stitched image of the first image and the second images at the multiple time scales is input into a semantic segmentation model; or The first image is input into the instance segmentation model; based on the instance segmentation model, the instance segmentation image of the first image is output; the stitched image of the first image and the instance segmentation image is input into the semantic segmentation model. Based on the semantic segmentation model, the segmentation result of the foreground image in the first image is output.

2. The method as described in claim 1, characterized in that, The step of outputting the segmentation result of the foreground image in the first image based on the semantic segmentation model includes: Based on the semantic segmentation model, the input image is feature-encoded to obtain the image features of the input image; Based on the semantic segmentation model, the image features are decoded to obtain a feature image; Based on the semantic segmentation model, the feature image is binarized to determine the segmentation result of the foreground image in the first image.

3. The method as described in claim 2, characterized in that, The step of encoding features of the first image based on the semantic segmentation model to obtain image features of the first image includes: Based on the first neural network layer and the second neural network layer of the feature encoding unit in the semantic segmentation model, the first image is feature encoded to obtain the image features of the first image, wherein the first neural network layer includes a convolutional layer and the second neural network layer includes a pooling layer.

4. The method as described in claim 2, characterized in that, The step of decoding the image features based on the semantic segmentation model to obtain a feature image includes: Based on the third and fourth neural network layers of the feature decoding unit in the semantic segmentation model, the image features are decoded to obtain a feature image, wherein the third neural network layer includes a convolutional layer and the fourth neural network layer includes an upsampling layer and / or a deconvolutional layer.

5. The method as described in claim 4, characterized in that, The input to the third neural network layer includes the output of the fourth neural network layer and / or the feature image.

6. A foreground image extraction device, characterized in that, The device includes: The acquisition module is used to acquire the first image of the foreground image to be extracted; An input module is configured to: extract at least one frame image from a video stream for multiple time scales; determine the pixel value statistics of the at least one frame image; and determine a second image for the time scale based on the pixel value statistics. The module then inputs a stitched image of the first image and the second images from the multiple time scales as an input image into a semantic segmentation model; or inputs the first image into an instance segmentation model; outputs an instance segmentation image of the first image based on the instance segmentation model; and inputs a stitched image of the first image and the instance segmentation image as an input image into the semantic segmentation model. The output module is used to output the segmentation result of the foreground image in the first image based on the semantic segmentation model.

7. An electronic device, characterized in that, The electronic device includes at least a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the steps of a foreground image extraction method as described in any one of claims 1-5.

8. A computer storage medium, characterized in that, It stores a computer program executable by an electronic device, which, when run on the electronic device, causes the electronic device to perform the steps of a foreground image extraction method according to any one of claims 1-5.