Image processing method and device, electronic equipment and computer readable storage medium

By generating a second depth image that matches the target style and combining it with a foreground recognition model, the problems of accuracy and adaptability in image foreground region recognition are solved, achieving a more efficient foreground recognition effect.

CN116152320BActive Publication Date: 2026-06-23GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP LTD
Filing Date
2021-11-23
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies for image foreground region recognition suffer from insufficient accuracy and poor adaptability, especially in image beautification and blurring processes where it is difficult to effectively identify the foreground and background.

Method used

By acquiring the image to be processed and its corresponding first depth image, a second depth image matching the target style is generated, and the trained foreground recognition model is used for processing. The depth information is combined to perform foreground recognition, thereby improving the recognition accuracy and adaptability.

Benefits of technology

It improves the accuracy and adaptability of image foreground region recognition, expands application scenarios, and solves the problems of insufficient semantic estimation ability and insufficient generalization performance of neural networks for unseen objects.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116152320B_ABST
    Figure CN116152320B_ABST
Patent Text Reader

Abstract

Embodiments of the present application disclose an image processing method and device, electronic equipment and a computer readable storage medium. The method comprises: acquiring a to-be-processed image and a first depth image corresponding to the to-be-processed image through an image acquisition device; generating a second depth image matching a target style according to the to-be-processed image and the first depth image; processing the to-be-processed image and the second depth image through a trained foreground recognition model to obtain a foreground recognition result; wherein the foreground recognition model is trained according to a training data set, the training data set comprises multiple frames of sample images, a sample depth image corresponding to each frame of the sample images, and a labeled foreground result corresponding to each frame of the sample images, and the style of the sample depth image is the target style. The image processing method, device, electronic equipment and computer readable storage medium described above can improve the recognition accuracy of the foreground area of the image and have stronger adaptability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of imaging technology, specifically to an image processing method, apparatus, electronic device, and computer-readable storage medium. Background Technology

[0002] In the field of imaging technology, image recognition problems that require distinguishing between the foreground and background of an image are frequently encountered. For example, before beautifying a portrait image, it is necessary to first identify the foreground portrait area and then apply beautification to that area. Similarly, before blurring an image, it is usually necessary to first identify the background area and then blur it. Currently, there is still a need to optimize the identification of the foreground region of an image. Summary of the Invention

[0003] This application discloses an image processing method, apparatus, electronic device, and computer-readable storage medium, which can improve the accuracy of foreground region recognition in an image and has stronger adaptability.

[0004] This application discloses an image processing method, including:

[0005] An image to be processed and a first depth image corresponding to the image to be processed are acquired by an image acquisition device. The first depth image is used to describe the depth information of the image to be processed.

[0006] Based on the image to be processed and the first depth image, a second depth image matching the target style is generated; depth images obtained through different depth estimation methods correspond to different styles.

[0007] The foreground recognition model, trained by training, is used to process the image to be processed and the second depth map to obtain a foreground recognition result. The foreground recognition result is used to describe the image location of the foreground region in the image to be processed. The foreground recognition model is trained based on a training dataset, which includes multiple frames of sample images, sample depth images corresponding to each frame of the sample images, and labeled foreground results corresponding to each frame of the sample images. The style of the sample depth images is the target style.

[0008] This application discloses an image processing apparatus, including:

[0009] An image acquisition module is used to acquire an image to be processed and a first depth image corresponding to the image to be processed through an image acquisition device. The first depth image is used to describe the depth information of the image to be processed.

[0010] The style adaptation module is used to generate a second depth image that matches the target style based on the image to be processed and the first depth image; the depth images obtained by different depth estimation methods correspond to different styles;

[0011] The foreground recognition module is used to process the image to be processed and the second depth map using a trained foreground recognition model to obtain a foreground recognition result. The foreground recognition result is used to describe the foreground region in the image to be processed. The foreground recognition model is trained based on a training dataset, which includes multiple frames of sample images, sample depth images corresponding to each frame of the sample images, and labeled foreground results corresponding to each frame of the sample images. The style of the sample depth images is the target style.

[0012] This application discloses an electronic device, including a memory and a processor. The memory stores a computer program, and when the computer program is executed by the processor, the processor performs the method described above.

[0013] This application discloses a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the method described above.

[0014] The image processing method, apparatus, electronic device, and computer-readable storage medium provided in this application embodiment acquire an image to be processed and a first depth image corresponding to the image to be processed through an image acquisition device. The first depth image is used to describe the depth information of the image to be processed. Based on the image to be processed and the first depth image, a second depth image matching the target style is generated. Then, a trained foreground recognition model processes the image to be processed and the second depth image to obtain the foreground recognition result. In this application embodiment, when the foreground recognition model performs foreground recognition on the image to be processed, combining depth information can better estimate the semantic information of the image to be processed, solving the problems of insufficient semantic estimation ability of neural networks and insufficient generalization performance for unseen objects in related technologies, and improving the accuracy of foreground region recognition of the image. Moreover, the first depth image acquired by the image acquisition device is converted into a second depth image with a style consistent with the sample depth image, making the second depth image suitable for the foreground recognition model, improving the adaptability of the foreground recognition model, and expanding the application scenarios. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a block diagram of an image processing circuit in one embodiment;

[0017] Figure 2 This is a flowchart of an image processing method in one embodiment;

[0018] Figure 3A A flowchart for generating a second depth image that matches the target style in one embodiment;

[0019] Figure 3B This is a schematic diagram illustrating the generation of a second depth image that matches the target style in one embodiment;

[0020] Figure 4 A flowchart of an image processing method in another embodiment;

[0021] Figure 5 This is a schematic diagram of obtaining the foreground mask of the image to be processed in one embodiment;

[0022] Figure 6 Here is a flowchart of an image processing method in yet another embodiment;

[0023] Figure 7 This is a block diagram of an image processing apparatus in one embodiment;

[0024] Figure 8 This is a structural block diagram of an electronic device in one embodiment. Detailed Implementation

[0025] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0026] It should be noted that the terms "comprising" and "having," and any variations thereof, in the embodiments and accompanying drawings of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the steps or units listed, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or devices.

[0027] It is understood that the terms "first," "second," etc., used in this application may be used to describe various elements herein, but these elements are not limited by these terms. These terms are only used to distinguish one element from another. For example, without departing from the scope of this application, a first depth image may be referred to as a second depth image, and similarly, a second depth image may be referred to as a first depth image. Both the first depth image and the second depth image are depth images, but they are not the same depth image. Furthermore, it should be noted that the terms "multiple," etc., used in the embodiments of this application refer to two or more.

[0028] This application provides an electronic device, which may include, but is not limited to, mobile phones, smart wearable devices, tablet computers, PCs (Personal Computers), vehicle terminals, digital cameras, etc., and this application does not limit the scope of the devices. The electronic device includes an image processing circuit, which can be implemented using hardware and / or software components and may include various processing units that define an ISP (Image Signal Processing) pipeline. Figure 1 This is a block diagram of an image processing circuit in one embodiment. For ease of explanation, Figure 1 Only aspects of the image processing techniques relevant to embodiments of this application are shown.

[0029] like Figure 1 As shown, the image processing circuit includes an ISP processor 140 and a control logic unit 150. Image data captured by the imaging device 110 is first processed by the ISP processor 140, which analyzes the image data to capture image statistics that can be used to determine one or more control parameters of the imaging device 110. The imaging device 110 may include one or more lenses 112 and an image sensor 114. The image sensor 114 may include a color filter array (such as a Bayer filter), and can acquire light intensity and wavelength information captured by each imaging pixel, providing a set of raw image data that can be processed by the ISP processor 140. An attitude sensor 120 (such as a three-axis gyroscope, Hall sensor, accelerometer, etc.) can provide the acquired image processing parameters (such as image stabilization parameters) to the ISP processor 140 based on the attitude sensor 120 interface type. The attitude sensor 120 interface can be an SMIA (Standard Mobile Imaging Architecture) interface, other serial or parallel camera interfaces, or a combination of the above interfaces.

[0030] It should be noted that, although Figure 1Only one imaging device 110 is shown in the illustration, but in this embodiment, at least two imaging devices 110 may be included. Each imaging device 110 may correspond to one image sensor 114, or multiple imaging devices 110 may correspond to one image sensor 114; this is not limited here. The operation of each imaging device 110 can be referred to the description above.

[0031] In addition, image sensor 114 can also send raw image data to attitude sensor 120. Attitude sensor 120 can provide raw image data to ISP processor 140 based on attitude sensor 120 interface type, or attitude sensor 120 can store raw image data in image memory 130.

[0032] The ISP processor 140 processes raw image data pixel by pixel in various formats. For example, each image pixel may have a bit depth of 8, 10, 12, or 14 bits. The ISP processor 140 may perform one or more image processing operations on the raw image data and collect statistical information about the image data. The image processing operations may be performed with the same or different bit depth precision.

[0033] The ISP processor 140 can also receive image data from the image memory 130. For example, the attitude sensor 120 interface sends raw image data to the image memory 130, and the raw image data in the image memory 130 is then provided to the ISP processor 140 for processing. The image memory 130 may be part of a memory device, a storage device, or a separate dedicated memory within an electronic device, and may include DMA (Direct Memory Access) features.

[0034] Upon receiving raw image data from image sensor 114 interface, attitude sensor 120 interface, or image memory 130, ISP processor 140 may perform one or more image processing operations, such as temporal filtering. The processed image data may be sent to image memory 130 for further processing before display. ISP processor 140 receives processed data from image memory 130 and performs image data processing on the processed data in the raw domain and in the RGB and YCbCr color spaces. The processed image data may be output to display 160 for user viewing and / or further processed by a graphics engine or GPU (Graphics Processing Unit). Furthermore, the output of ISP processor 140 may also be sent to image memory 130, and display 160 may read image data from image memory 130. In one embodiment, image memory 130 may be configured to implement one or more frame buffers.

[0035] The statistical data determined by the ISP processor 140 can be sent to the control logic unit 150. For example, the statistical data may include image sensor 114 statistics such as gyroscope vibration frequency, automatic exposure, automatic white balance, automatic focus, flicker detection, black level compensation, and lens 112 shading correction. The control logic unit 150 may include a processor and / or microcontroller executing one or more routines (such as firmware) that determine control parameters for the imaging device 110 and the ISP processor 140 based on the received statistical data. For example, the control parameters for the imaging device 110 may include attitude sensor 120 control parameters (e.g., gain, integral time for exposure control, image stabilization parameters, etc.), camera flash control parameters, camera image stabilization shift parameters, lens 112 control parameters (e.g., focal length for focusing or zooming), or combinations of these parameters. The ISP control parameters may include gain levels and color correction matrices for automatic white balance and color adjustment (e.g., during RGB processing), and lens 112 shading correction parameters.

[0036] For example, combined Figure 1 The image processing circuit described herein will be used to illustrate the image processing method provided in the embodiments of this application. The imaging device 110 can acquire raw image data and send it to the ISP processor 140. The ISP processor 140 can process the raw image data to obtain an image to be processed and a first depth image corresponding to the image to be processed. The first depth image is used to describe the depth information of the image to be processed. The ISP processor 140 can generate a second depth image matching the target style based on the image to be processed and the first depth image. Then, it processes the image to be processed and the second depth image using a trained foreground recognition model to obtain the foreground recognition result.

[0037] It should be noted that the image processing method provided in the embodiments of this application can also be implemented in other processors of electronic devices, such as CPU (central processing unit) and GPU (graphics processing unit), and is not limited to the ISP processor mentioned above.

[0038] like Figure 2 As shown, in one embodiment, an image processing method is provided, which can be applied to the above-described electronic device. The method may include the following steps:

[0039] Step 210: Acquire the image to be processed and the first depth image corresponding to the image to be processed through the image acquisition device.

[0040] The first depth image can be used to describe the depth information of the image to be processed. The first depth image may include the depth values ​​of each pixel in the image to be processed. The electronic device can acquire the image to be processed through an image acquisition device (such as a camera) and determine the first depth image corresponding to the image to be processed through the image acquisition device. The image to be processed may be in RGB format or YUV format, etc., which is not limited here.

[0041] The first depth image can be obtained by depth estimation using hardware devices. Optionally, depth estimation using hardware devices may include, but is not limited to, depth estimation using multiple cameras (e.g., dual cameras), depth estimation using structured light, and depth estimation using TOF (Time of Flight). Taking depth estimation using multiple cameras as an example, the image acquisition device may include two cameras, which can acquire two images respectively (the image to be processed can be either one of the images or an image obtained by fusing the two images). The two images can be matched, and depth estimation can be performed based on the matched pixel pairs to obtain the depth value corresponding to each pixel.

[0042] Step 220: Generate a second depth image that matches the target style based on the image to be processed and the first depth image.

[0043] Besides hardware-based depth estimation, there are also software-based methods. Software-based depth estimation methods include, but are not limited to, using neural networks such as depth estimation models. These models are trained on a depth training set, which may include multiple training images and their corresponding depth images. For the same image, depth images obtained using different depth estimation methods may differ significantly, thus corresponding to different styles. For example, the depth image estimated by a depth estimation model for the same image will have a different style than the depth image estimated by multiple cameras; similarly, the depth image estimated by multiple cameras will have a different style than the depth image estimated by Time-of-Flight (TOF) method.

[0044] The electronic device can adaptively transform a first depth image into a second depth image that matches the target style, making the second depth image suitable for a foreground recognition model. In this embodiment, the target style refers to the style of the depth image generated by a target depth estimation method, which may refer to a depth estimation method that performs depth estimation on a sample image to obtain a sample depth image. The sample image and the sample depth image are used to train the foreground recognition model.

[0045] Step 230: The foreground recognition model obtained through training is used to process the image to be processed and the second depth map to obtain the foreground recognition result.

[0046] Foreground recognition results can be used to describe the image location of foreground regions in the image to be processed. As one implementation, the foreground recognition results may include a foreground mask, which can label pixels in the image to be processed that belong to the foreground region. Different pixel values ​​can be used in the foreground mask to represent whether a pixel belongs to the foreground or background region. For example, pixels belonging to the foreground region may have a pixel value of 1, while pixels belonging to the background region may have a pixel value of 0; or pixels belonging to the foreground region may have a pixel value of 255, while pixels belonging to the background region may have a pixel value of 0, etc. The foreground mask can also use different pixel values ​​to represent the probability that a pixel belongs to the foreground region; the larger the pixel value, the greater the probability that the pixel belongs to the foreground region.

[0047] Specifically, the foreground mask can be an alpha image, where the pixel value of each pixel can be between 0 and 255. A pixel value of 255 indicates that the pixel is completely opaque in the foreground region of the image to be processed, meaning that the pixel belongs to the foreground region. A pixel value of 0 indicates that the pixel is completely transparent in the foreground region of the image to be processed, meaning that the pixel belongs to the background region. Optionally, pixels with a pixel value greater than 0 and less than 255 can be defined as the transition region between the foreground region and the background region.

[0048] The trained foreground recognition model processes the image to be processed and the second depth map to obtain the foreground recognition result. This foreground recognition model can be a foreground segmentation model or a foreground matting model, etc. The foreground recognition model is trained on a training dataset, which may include multiple frames of sample images, sample depth images corresponding to each frame of sample images, and labeled foreground results corresponding to each frame of sample images. The style of the sample depth images is the target style, that is, the sample depth images are obtained by estimating the depth of the sample images using the target depth estimation method.

[0049] In one specific implementation, the foreground recognition model may include an encoder and a decoder. The image to be processed and a second depth map can be input into the foreground recognition model. The foreground recognition model can extract feature information from the image to be processed and the second depth map through the encoder to obtain a feature image, and process the feature image through the decoder to obtain a foreground mask.

[0050] Optionally, the foreground recognition model can be a U-Net architecture model, where the encoder may include multiple downsampling layers and the decoder may include multiple upsampling layers. The foreground recognition model can first perform multiple downsampling convolutions on the image to be processed and the second depth map through multiple downsampling layers of the encoder to obtain a feature image, and then perform multiple upsampling processes on the feature image through multiple upsampling layers of the decoder to obtain a foreground mask.

[0051] Alternatively, the foreground recognition model can also be a model based on a self-attention mechanism. This application does not limit the specific model architecture of the foreground recognition model.

[0052] In some embodiments, after obtaining the foreground recognition result of the image to be processed, the image to be processed can be further processed according to the foreground recognition result to obtain the target image. For example, the background region of the image to be processed can be determined according to the foreground recognition result, and the background region of the image to be processed can be blurred to obtain a blurred image; or, for example, the foreground region of the image to be processed can be determined according to the foreground recognition result, and the foreground region of the image to be processed can be brightened, its saturation adjusted, etc., to obtain the target image, but these are not limited to these examples. Since the foreground recognition result obtained by the foreground recognition model can accurately describe the foreground region of the image to be processed, the image effect of the target image can also be improved.

[0053] Compared to directly estimating the depth of the image to be processed using target depth estimation methods and then using the depth estimation results to identify the foreground region, this embodiment first uses hardware to obtain a more accurate first depth image, and then adapts the first depth image to a second depth image with a uniform style. This ensures both the accuracy of the depth information in the second depth image and its adaptability to the foreground recognition model, thereby improving the accuracy of the foreground recognition results. The image processing method is applicable to various types of depth maps (e.g., obtained through different depth estimation methods), expanding its application scenarios. The computer program implementing this image processing method can be deployed on any hardware device, demonstrating stronger adaptability.

[0054] In this embodiment, when the foreground recognition model performs foreground recognition on the image to be processed, combining depth information can better estimate the semantic information of the image to be processed, solving the problems of insufficient semantic estimation ability of neural networks and insufficient generalization performance for unseen objects in related technologies, and improving the accuracy of foreground region recognition of the image; moreover, the first depth image obtained by the image acquisition device is converted into a second depth image with the same style as the sample depth image, making the second depth image suitable for the foreground recognition model, improving the adaptability of the foreground recognition model and expanding the application scenarios.

[0055] like Figure 3A As shown, in one embodiment, the step of generating a second depth image that matches the target style based on the image to be processed and the first depth image may include the following steps:

[0056] Step 302: Use the target depth estimation method to estimate the depth of the image to be processed, and obtain a depth estimation result that matches the target style; wherein, the target depth estimation method is the depth estimation method of the generated sample depth image.

[0057] The style of the depth image obtained through depth estimation using a target depth estimation method is the target style. This target depth estimation method can be used to estimate the depth of the image to be processed, obtaining a depth estimation result that matches the target style. In some embodiments, the target depth estimation method can be performed using a depth estimation model. The image to be processed can be input into the depth estimation model, which extracts image features and obtains the depth estimation result based on these features. This depth estimation model is also used during the training of a foreground recognition model to estimate the depth of sample images, obtaining sample depth images that match the target style.

[0058] It should be noted that the target depth estimation method can also be other depth estimation methods, not limited to the depth estimation method through the depth estimation model mentioned above. For example, it can also be other software depth estimation methods, or other depth estimation methods combined with hardware, etc., which are not limited here.

[0059] Step 304: Determine the constraints based on the first depth image, and generate the second depth image based on the constraints and the depth estimation results.

[0060] To ensure the semantic accuracy of the second depth image input to the foreground recognition model, constraints can be determined based on the first depth image. Optionally, these constraints can be used to constrain the depth value order of each pixel in the second depth image, ensuring that the depth value order of each pixel in the second depth image is the same as that in the first depth image. For example, if the depth value of pixel (x, y) in the first depth image is y-th in descending order, the depth value of pixel (x, y) in the second depth image is also y-th, thus guaranteeing the semantic accuracy of the second depth image.

[0061] In some embodiments, determining constraints based on a first depth image may include: determining the vector index corresponding to each pixel in the first depth image according to the depth value of each pixel in descending or ascending order of depth value; and generating constraints based on the vector index corresponding to each pixel.

[0062] Each pixel in the first depth image can be represented by a first vector. The pixels can be arranged in descending or ascending order of depth values, and each pixel can be assigned a vector index, starting from 0 and increasing sequentially according to the arrangement. Constraints generated based on the vector indices of each pixel can be used to constrain the relationship between the vector indices and the aforementioned order.

[0063] For example, the first vector D can be used. c Let each pixel in the first depth image represent a pixel. The pixels in the first depth image can be sorted in descending order of their depth values ​​and assigned vector indices sequentially. Then, for the first depth image, the following condition holds: 1≤i≤n-1; where, This represents the depth value of the pixel with vector index i in the first depth image. The depth value of the pixel with vector index i+1 in the first depth image, where n represents the number of pixels in the first depth image. The second vector D can be used... a Let each pixel in the second depth image represent a pixel. For pixels at the same location, the vector index in the second depth image can be the same as the vector index in the first depth image. Therefore, the constraint condition can be: 1≤i≤n-1, the vector index of the second depth image should also be in descending order of depth value, so as to ensure that the depth value order of each pixel in the second depth image is consistent with the depth value order of each pixel in the first depth image.

[0064] After generating the constraints, these constraints can be used to adapt the first depth image into a second depth image whose style is as similar as possible to the depth estimation result. In some embodiments, the second depth image is obtained by minimizing the mean square error between the second and depth estimation results while satisfying the constraints. Specifically, the calculation of the second depth image can be transformed into a quadratic programming problem, which can be represented by formula (1):

[0065]

[0066] Where C represents the constraint condition, and stC indicates that constraint condition C is satisfied. This represents the depth value of the pixel with vector index i in the second depth image. This represents the depth value of the pixel with vector index i in the depth estimation result. The above quadratic programming problem can be solved to obtain the second depth image. Furthermore, methods such as OSQP (which refers to a solution programming language) can be used to solve the above quadratic programming problem, but it is not limited to these methods.

[0067] Figure 3B This is a schematic diagram illustrating the generation of a second depth image that matches the target style in one embodiment. Figure 3B As shown, the electronic device acquires the image to be processed 310 and the corresponding first depth image 320 through an image acquisition device. The first depth image 320 may include the depth values ​​of each pixel in the image to be processed 310. The depth of the image to be processed 310 can be estimated by a target depth estimation method to obtain a depth estimation result 330 that matches the target style. The depth estimation result 330 can provide a style adaptation target. According to the depth order of the depth values ​​of each pixel in the first depth image 320, a second depth image 340 is generated, so that the second depth image 340 maintains the same semantic information as the first depth image 320 and is as close as possible to the target style, which can further improve the accuracy of the foreground recognition result obtained by the foreground recognition model.

[0068] In this embodiment, constraints are generated based on the first depth image, and a second depth image matching the target style is generated under the premise of satisfying the constraints. This not only ensures the accuracy of the semantic information in the second depth image, but also makes the style of the second depth image closer to the target style, which is suitable for the foreground recognition model and further improves the accuracy of the foreground recognition result obtained by the foreground recognition model.

[0069] like Figure 4 As shown, in another embodiment, an image processing method is provided, which can be applied to the above-described electronic device. The method may include the following steps:

[0070] Step 402: Acquire the image to be processed and the first depth image corresponding to the image to be processed through the image acquisition device.

[0071] Step 404: Generate a second depth image that matches the target style based on the image to be processed and the first depth image.

[0072] Step 406: The foreground recognition model obtained through training is used to process the image to be processed and the second depth map to obtain the foreground recognition result, which includes the foreground mask.

[0073] The descriptions of steps 402 to 406 can be found in the relevant descriptions in the above embodiments, and will not be repeated here.

[0074] Step 408: Perform connected component detection on the foreground mask and remove image noise from the foreground mask based on the detection results.

[0075] As one implementation method, connected component detection can be performed on the foreground mask based on the pixel values ​​of each pixel in the foreground mask. Optionally, pixels belonging to the same connected component can have the same pixel value, such as all pixel values ​​being 255 or all pixel values ​​being 0, or the difference in pixel values ​​belonging to the same connected component being less than a preset threshold. After determining each connected component in the foreground mask, connected components with an area smaller than an area threshold can be deleted to remove image noise from the foreground mask. The area of ​​the connected component can be represented by the number of pixels contained in the connected component.

[0076] As another implementation, a depth-bounded search algorithm can be used to perform connected component detection on the foreground mask. Each pixel in the foreground mask can be traversed, and it can be determined whether the currently visited pixel has a corresponding connected component label. This connected component label can be used to identify the connected component to which the pixel belongs. If the current pixel has a corresponding connected component label, it means that the current pixel has been assigned to a certain connected component, and the next pixel can be visited. If the current pixel does not have a corresponding connected component label, a connected component label corresponding to the current pixel can be assigned. The assigned connected component label corresponding to the current pixel is different from other existing connected component labels. For example, if the existing connected component labels in the current foreground mask include label 1 and label 2, where label 1 indicates belonging to connected component 1 and label 2 indicates belonging to connected component 2, then if the current pixel does not have a corresponding connected component label, a new label 3 can be assigned to the current pixel.

[0077] After assigning a connected component label to the current pixel, reachable pixels within the reachable region of the current pixel in the foreground mask can be detected, and connected component labels can be assigned to each reachable pixel. The connected component labels assigned to each reachable pixel are the same as those assigned to the current pixel. For example, if label 3 is assigned to the current pixel, then label 3 can be assigned to each reachable pixel within the reachable region of the current pixel.

[0078] In this context, the reachable region refers to the region that the current pixel can connect to. A reachable pixel can be any pixel that can be found using a depth-bounded search algorithm, starting from the current pixel. The image region formed by the reachable pixels within the current pixel's reachable region is the connected component corresponding to the assigned connected component label. For example, if the current pixel is assigned label 3, then each reachable pixel within the current pixel's reachable region can be assigned label 3, and the region formed by the current pixel and each reachable pixel is the connected component corresponding to label 3. After finding all reachable pixels corresponding to the current pixel and assigning corresponding connected component labels, the next pixel can be visited, directly traversing all pixels in the foreground mask. Thus, each pixel in the foreground mask has a corresponding connected component label.

[0079] The number of pixels corresponding to each connected component label can be counted. The connected component label with the largest number of corresponding pixels corresponds to the largest connected component in the foreground mask. Connected components in the foreground mask corresponding to the target connected component label can be deleted, provided the number of pixels corresponding to the target connected component label is less than a threshold, thus removing image noise. Further, deleting connected components in the foreground mask corresponding to the target connected component label can be achieved by changing the pixel value of the pixel corresponding to the target connected component label to 0 (i.e., consistent with the pixel value representing a background region), ensuring the accuracy of the foreground region in the foreground mask.

[0080] In some embodiments, if the number of pixels corresponding to all connected component labels is less than the number threshold, then the connected components corresponding to the first two connected component labels can be retained in descending order of the number of pixels corresponding to each connected component label, and other connected components other than the two retained connected components can be deleted, thereby further ensuring the accuracy of the foreground mask after noise reduction.

[0081] Figure 5 This is a schematic diagram illustrating the foreground mask of the image to be processed in one embodiment. For example... Figure 5 As shown, the electronic device can acquire the image to be processed 502 and the first depth image 504, perform style adaptation on the first depth image 504, and obtain a second depth image 506 that matches the target style. The image to be processed 502 and the second depth image 506 can be input into the foreground recognition model 500, and the foreground mask 508 corresponding to the image to be processed 502 can be obtained through the foreground recognition model 500. Then, the image noise in the foreground mask 508 is removed to obtain the denoised foreground mask 510.

[0082] In this embodiment, after obtaining the foreground mask through the foreground recognition model, the foreground mask is further post-processed to remove image noise from the foreground mask, thereby further improving the accuracy of the foreground mask.

[0083] like Figure 6 As shown, in another embodiment, an image processing method is provided, which can be applied to the above-described electronic device. The method may include the following steps:

[0084] Step 602: Obtain multiple sample images and the labeled foreground results corresponding to each sample image.

[0085] A large number of sample images can be acquired, and annotation tools can be used to annotate the foreground region of each frame of the sample image, resulting in an annotated foreground result for each frame. Optionally, the acquired sample images can be images of the same type, for example, all sample images can be portrait images. The foreground region in a portrait image is the portrait region, which allows the foreground recognition model trained using sample images to focus more on recognizing the portrait region in portrait images, thus improving accuracy. Optionally, the acquired sample images can also include various different types of images, such as portrait images, animal images, and architectural images, which can improve the adaptability of the foreground recognition model.

[0086] In some embodiments, the labeled foreground result corresponding to each frame of sample image may include a labeled foreground mask corresponding to each frame of sample image. The labeled foreground mask can be used to determine the image position of the foreground region in the corresponding sample image. For example, the labeled foreground mask may also be an alpha image, etc. The alpha image may include the alpha value corresponding to each pixel in the corresponding sample image. The alpha value can be used to represent the transparency of the pixel. Optionally, the greater the transparency of the pixel (i.e., the greater the alpha value), the higher the probability that the pixel belongs to the foreground region.

[0087] Step 604: Use the target depth estimation method to perform depth estimation on each frame of sample image to obtain the sample depth image corresponding to each frame of sample image.

[0088] A target depth estimation method can be used to estimate the depth of each sample image frame to obtain a sample depth image that matches the target style. In some embodiments, the target depth estimation method can be to perform depth estimation through a depth estimation model. The sample image can be input into the depth estimation model, and the sample depth image corresponding to the sample image can be obtained through the depth estimation model. The sample depth image can be used to describe the depth information of the sample image, and the sample depth image can include the depth value of each pixel in the sample image.

[0089] Step 606: Train the foreground recognition model to be trained based on multiple sample images, the labeled foreground results corresponding to each sample image, and the sample depth image corresponding to each sample image, until the foreground recognition model converges.

[0090] In some embodiments, a single sample image, its corresponding annotated foreground result, and its corresponding sample depth image can be input into the foreground recognition model to be trained at a time. Alternatively, a batch of sample images, the sample depth image corresponding to each sample image, and the annotated foreground result corresponding to each sample image can be input into the foreground recognition model to be trained simultaneously. The foreground recognition model to be trained can output corresponding results based on the input sample images and their corresponding sample depth images. The parameters of the foreground recognition model to be trained can be adjusted based on the output results, and the parameters can be iteratively updated until the foreground recognition model converges.

[0091] The foreground recognition model under training processes the current frame sample image and its corresponding depth image to obtain the predicted foreground recognition result. Then, based on the labeled foreground result and the predicted foreground recognition result of the current frame sample image, the target loss is determined, and the parameters of the foreground recognition model under training are updated according to the target loss. If the foreground recognition model under training has not yet reached convergence, the next frame sample image and its corresponding depth image can be input into the foreground recognition model under training to continue training.

[0092] Optionally, the above convergence conditions may include, but are not limited to, any one of the following: the number of iterations of the parameters of the foreground recognition model to be trained reaches a threshold, the determined target loss is less than a loss threshold, or the determined target loss no longer decreases within a target number of iteration cycles.

[0093] In some embodiments, the foreground recognition model to be trained processes the current frame sample image and the corresponding sample depth image to obtain a predicted foreground recognition result. This predicted foreground recognition result may include a first foreground mask, a second foreground mask, and a third foreground mask, wherein the accuracy of the first foreground mask is less than the accuracy of the second foreground mask, and the accuracy of the second foreground mask is less than the accuracy of the third foreground mask. Further, the first foreground mask may be a downsampled and detail-blurred foreground mask, and the first foreground mask may be a first image size. The second foreground mask, the third foreground mask, and the current frame sample image are all of a second image size, where the first image size is smaller than the second image size. The second foreground mask may be a foreground mask that is accurate in edge regions but does not guarantee accuracy in non-edge regions. The third foreground mask can be understood as the final output of the foreground recognition model, a standard foreground mask.

[0094] Electronic devices can decompose the training of a foreground recognition model into three tasks: overall training, detailed training, and training of the final output result. Therefore, losses can be calculated separately for the first, second, and third foreground masks mentioned above, improving training effectiveness. The labeled foreground result corresponding to the current frame sample image can be processed to obtain a coarse foreground mask, whose accuracy is lower than that of the labeled foreground result. Then, the first loss between the first foreground mask and the coarse foreground mask, the second loss between the second foreground mask and the labeled foreground result at the edges, and the third loss between the third foreground mask and the labeled foreground result are calculated separately. The target loss is then determined based on the first, second, and third losses.

[0095] In some embodiments, the coarse foreground mask may be similar to the first foreground mask, and may also be a downsampled and detail-blurred foreground mask, with the coarse foreground mask being the first image size. The electronic device may downsample the labeled foreground result corresponding to the current frame sample image to obtain a downsampled mask of the first image size, and then blur the downsampled mask to obtain the coarse foreground mask. The blurring process can omit details in the downsampled mask. Optionally, the blurring process may include, but is not limited to, Gaussian blur, median blur, mean blur, etc. The first image size may be 1 / 16 or 1 / 8 of the second image size, etc., and is not limited in this embodiment.

[0096] In some embodiments, N sample images and the corresponding labeled foreground results for each sample image can be selected at a time. For example, 16, 12, or 15 sample images can be selected, but this is not limited to these. Depth estimation can be performed on the N sample images to obtain the sample depth image corresponding to each sample image. The N sample images and the corresponding sample depth images can be input into the foreground recognition model to be trained, and the foreground recognition model can obtain the first foreground mask, second foreground mask, and third foreground mask corresponding to each sample image. The electronic device can also process the labeled foreground results corresponding to the N sample images respectively to obtain the coarse foreground mask corresponding to each sample image. Then, based on the first foreground mask, second foreground mask, and third foreground mask corresponding to each sample image, as well as the coarse foreground mask and labeled foreground results corresponding to each sample image, the first loss, second loss, and third loss corresponding to each sample image can be calculated, thereby determining the target loss.

[0097] Specifically, the target loss can be calculated according to formula (2):

[0098]

[0099] Where N represents the number of sample images selected. This represents the first foreground mask corresponding to the nth sample image in N frame sample images. This represents the second foreground mask corresponding to the nth frame sample image. This represents the third foreground mask corresponding to the sample image in frame b. This represents the coarse foreground mask corresponding to the nth frame sample image. This represents the annotated foreground result corresponding to the nth frame sample image; m can refer to the edge mask, used to determine the edge region of the nth frame sample image, and m can be obtained by... The boundary line is obtained by dilating (e.g., dilating by 26 pixels). This boundary line can include the boundary line between the foreground region and the transition region, as well as the boundary line between the background region and the transition region. The transition region can refer to the area where the foreground region and the background region meet. Using an edge mask allows the second loss to focus more on edge detail features.

[0100] This represents the first loss corresponding to the nth frame sample image. This represents the second loss corresponding to the nth frame sample image. This represents the third loss corresponding to the nth frame sample image. The target loss L can be obtained by summing the first, second, and third losses corresponding to each frame sample image, dividing by N, and then averaging the sums.

[0101] After determining the target loss, if the foreground recognition model to be trained has not yet met the convergence condition, the next batch of sample images can be selected for the next training iteration until the foreground recognition model meets the convergence condition. The foreground recognition model that meets the convergence condition can be directly used as the trained foreground recognition model, and the parameters of the foreground recognition model that meets the convergence condition are the parameters of the trained foreground recognition model.

[0102] In some embodiments, the parameters of the trained foreground recognition model can also be reselected. After updating the parameters of the foreground recognition model to be trained according to the target loss, the current parameters of the foreground recognition model to be trained can be saved. Optionally, the parameters of the foreground recognition model to be trained can be saved every time they are updated; alternatively, the current parameters of the foreground recognition model to be trained can be saved at fixed time intervals, such as every 3 minutes, 5 minutes, or 10 minutes.

[0103] After the foreground recognition model meets the convergence condition, the parameters saved each time can be tested using a validation dataset. Based on the test results, the parameters that meet the accuracy requirements are selected as the parameters for the trained foreground recognition model. The validation dataset includes multiple validation images, the labeled foreground results for each validation image, and the corresponding depth image for each validation image. Since the foreground recognition model is continuously trained using sample images from the training dataset, the later updated parameters may become overly fitted to the training dataset. To select better model parameters, the saved parameters can be tested using a validation dataset that differs from the training dataset.

[0104] In one specific implementation, the parameters of the current test can be substituted into the foreground recognition model, and the validation dataset can be input into the foreground recognition model. The foreground recognition model processes each frame of validation image and the corresponding depth image in the validation dataset with the parameters of the current test to obtain the predicted foreground result corresponding to each frame of validation image. Based on the predicted foreground result and the labeled foreground result corresponding to each frame of validation image, the target loss corresponding to the parameters of the current test is calculated. Optionally, the target loss can be the average loss of all validation images included in the validation dataset, or the median of the loss, etc.

[0105] The target loss corresponding to each saved parameter can be compared, and the parameters whose target loss satisfies the accuracy condition can be selected as the parameters of the trained foreground recognition model. Optionally, the accuracy condition may include any one of the following: minimum target loss, target loss less than a preset threshold, etc. In other embodiments, the accuracy condition may also be that the loss corresponding to each frame of the validation dataset is the most stable. By selecting the best-performing parameters from the validation dataset as the parameters of the trained foreground recognition model, it is ensured that the trained foreground recognition model is more accurate and has higher precision, thereby improving the accuracy of foreground recognition.

[0106] Step 608: Acquire the image to be processed and the first depth image corresponding to the image to be processed through the image acquisition device.

[0107] Step 610: Generate a second depth image that matches the target style based on the image to be processed and the first depth image.

[0108] Step 612: The foreground recognition model obtained through training is used to process the image to be processed and the second depth map to obtain the foreground recognition result, which includes the foreground mask.

[0109] Step 614: Perform connected component detection on the foreground mask and remove image noise from the foreground mask based on the detection results.

[0110] The descriptions of steps 608 to 614 can be found in the relevant descriptions in the above embodiments, and will not be repeated here.

[0111] In this embodiment, the foreground recognition model can be trained using a training dataset, which includes multiple sample images, sample depth images corresponding to each sample image, and labeled foreground results corresponding to each sample image. This solves the problems of insufficient semantic estimation ability and insufficient generalization performance for unseen objects in related technologies, thereby improving the accuracy and precision of the foreground recognition model. Furthermore, the training is divided into three different tasks, making the information flow of the foreground recognition model training more efficient.

[0112] like Figure 7 As shown, in one embodiment, an image processing device 700 is provided, which can be applied to the above-mentioned electronic device. The image processing device 700 may include an image acquisition module 710, a style adaptation module 720, and a foreground recognition module 730.

[0113] The image acquisition module 710 is used to acquire an image to be processed and a first depth image corresponding to the image to be processed through an image acquisition device. The first depth image is used to describe the depth information of the image to be processed.

[0114] The style adaptation module 720 is used to generate a second depth image that matches the target style based on the image to be processed and the first depth image; the depth images obtained by different depth estimation methods correspond to different styles.

[0115] The foreground recognition module 730 is used to process the image to be processed and the second depth map through the trained foreground recognition model to obtain the foreground recognition result, which is used to describe the foreground region in the image to be processed. The foreground recognition model is trained based on the training dataset, which includes multiple frames of sample images, sample depth images corresponding to each frame of sample images, and labeled foreground results corresponding to each frame of sample images. The style of the sample depth images is the target style.

[0116] In this embodiment, when the foreground recognition model performs foreground recognition on the image to be processed, combining depth information can better estimate the semantic information of the image to be processed, solving the problems of insufficient semantic estimation ability of neural networks and insufficient generalization performance for unseen objects in related technologies, and improving the accuracy of foreground region recognition of the image; moreover, the first depth image obtained by the image acquisition device is converted into a second depth image with the same style as the sample depth image, making the second depth image suitable for the foreground recognition model, improving the adaptability of the foreground recognition model and expanding the application scenarios.

[0117] In one embodiment, the style adaptation module 720 includes a depth estimation unit and a constraint unit.

[0118] The depth estimation unit is used to perform depth estimation on the image to be processed using a target depth estimation method to obtain a depth estimation result that matches the target style; wherein, the target depth estimation method is the depth estimation method of the generated sample depth image.

[0119] The constraint unit is used to determine the constraint conditions based on the first depth image and generate a second depth image based on the constraint conditions and the depth estimation results.

[0120] In one embodiment, the first depth image includes depth values ​​corresponding to each pixel in the image to be processed. The constraint unit is further configured to determine the vector index corresponding to each pixel according to the depth values ​​in the first depth image, in descending or ascending order of depth values; and to generate constraint conditions according to the vector indexes corresponding to each pixel, the constraint conditions being used to constrain the relationship between the vector indexes and the order of each pixel.

[0121] In one embodiment, the constraint unit is further configured to minimize the mean square error between the depth estimation result and the constraint condition, thereby obtaining a second depth image.

[0122] In this embodiment, constraints are generated based on the first depth image, and a second depth image matching the target style is generated under the premise of satisfying the constraints. This not only ensures the accuracy of the semantic information in the second depth image, but also makes the style of the second depth image closer to the target style, which is suitable for the foreground recognition model and further improves the accuracy of the foreground recognition result obtained by the foreground recognition model.

[0123] In one embodiment, the image processing apparatus 700 includes an image acquisition module 710, a style adaptation module 720, and a foreground recognition module 730, as well as a noise reduction module.

[0124] The noise reduction module is used to perform connected component detection on the foreground mask and remove image noise from the foreground mask based on the detection results.

[0125] In one embodiment, the noise reduction module is further configured to traverse each pixel in the foreground mask; if the current pixel does not have a corresponding connected component label, assign a connected component label corresponding to the current pixel, the connected component label being used to identify the connected component to which the pixel belongs; detect reachable pixels in the foreground mask within the reachable region of the current pixel, and assign connected component labels to each reachable pixel; and delete connected components in the foreground mask corresponding to the target connected component label, wherein the number of pixels in the foreground mask corresponding to the target connected component label is less than a number threshold.

[0126] In this embodiment, after obtaining the foreground mask through the foreground recognition model, the foreground mask is further post-processed to remove image noise from the foreground mask, thereby further improving the accuracy of the foreground mask.

[0127] In one embodiment, the image processing apparatus 700 includes an image acquisition module 710, a style adaptation module 720, a foreground recognition module 730, and a noise reduction module, as well as a training module.

[0128] The training module is used to train the foreground recognition model to be trained based on the training dataset.

[0129] The training module includes an acquisition unit, a depth estimation unit, and a training unit.

[0130] The acquisition unit is used to acquire multiple frames of sample images and the labeled foreground results corresponding to each frame of sample images.

[0131] The depth estimation unit is used to perform depth estimation on each frame of sample image using the target depth estimation method, so as to obtain the sample depth image corresponding to each frame of sample image.

[0132] The training unit is used to train the foreground recognition model to be trained based on multiple sample images, the labeled foreground results corresponding to each sample image, and the sample depth image corresponding to each sample image, until the foreground recognition model converges.

[0133] In one embodiment, the training unit is further configured to process the current frame sample image and the corresponding sample depth image through the foreground recognition model to be trained to obtain the predicted foreground recognition result; and to determine the target loss based on the labeled foreground result and the predicted foreground recognition result corresponding to the current frame sample image, and update the parameters of the foreground recognition model to be trained based on the target loss.

[0134] In one embodiment, the predicted foreground recognition result includes a first foreground mask, a second foreground mask, and a third foreground mask, wherein the accuracy of the first foreground mask is less than the accuracy of the second foreground mask, and the accuracy of the second foreground mask is less than the accuracy of the third foreground mask.

[0135] The training unit is also used to process the labeled foreground result corresponding to the current frame sample image to obtain a coarse foreground mask. The accuracy of the coarse foreground mask is less than that of the labeled foreground result. The unit calculates the first loss between the first foreground mask and the coarse foreground mask, the second loss between the second foreground mask and the labeled foreground result at the edge, and the third loss between the third foreground mask and the labeled foreground result. The unit then determines the target loss based on the first loss, the second loss, and the third loss.

[0136] In one embodiment, the first foreground mask is a first image size, and the second foreground mask, the third foreground mask, and the current frame sample image are all a second image size, wherein the first image size is smaller than the second image size.

[0137] The training unit is also used to downsample the labeled foreground result corresponding to the current frame sample image to obtain a downsampled mask of the first image size; and to blur the downsampled mask to obtain a coarse foreground mask.

[0138] In one embodiment, the training module further includes a verification unit.

[0139] The validation unit is used to save the current parameters of the foreground recognition model to be trained after the training unit updates the parameters of the foreground recognition model to be trained according to the target loss; and to test the saved parameters with the validation dataset after the foreground recognition model converges, and select the parameters that meet the accuracy conditions as the parameters of the trained foreground recognition model based on the test results; wherein, the validation dataset includes multiple frames of validation images, the labeled foreground results corresponding to each frame of validation images, and the depth image corresponding to each frame of validation images.

[0140] In this embodiment, the foreground recognition model can be trained using a training dataset, which includes multiple sample images, sample depth images corresponding to each sample image, and labeled foreground results corresponding to each sample image. This solves the problems of insufficient semantic estimation ability and insufficient generalization performance for unseen objects in related technologies, thereby improving the accuracy and precision of the foreground recognition model. Furthermore, the training is divided into three different tasks, making the information flow of the foreground recognition model training more efficient.

[0141] Figure 8 This is a structural block diagram of an electronic device in one embodiment. For example... Figure 8 As shown, the electronic device 800 may include one or more of the following components: a processor 810 and a memory 820 coupled to the processor 810, wherein the memory 820 may store one or more computer programs, which may be configured to implement the methods described in the above embodiments when executed by one or more processors 810.

[0142] The processor 810 may include one or more processing cores. The processor 810 connects to various parts within the electronic device 800 using various interfaces and lines, and performs various functions and processes data of the electronic device 800 by running or executing instructions, programs, code sets, or instruction sets stored in the memory 820, and by calling data stored in the memory 820. Optionally, the processor 810 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 810 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the processor 810 and may be implemented separately using a communication chip.

[0143] The memory 820 may include random access memory (RAM) or read-only memory (ROM). The memory 820 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 820 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as touch functionality, sound playback functionality, image playback functionality, etc.), and instructions for implementing the various method embodiments described above. The data storage area may also store data created by the electronic device 800 during use.

[0144] Understandably, the electronic device 800 may include more or fewer structural elements than those shown in the above block diagram, such as a power module, physical buttons, a WiFi (Wireless Fidelity) module, a speaker, a Bluetooth module, sensors, etc., and may not be limited thereto.

[0145] This application discloses a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method described in the above embodiments.

[0146] This application discloses a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program can be executed by a processor to implement the methods described in the above embodiments.

[0147] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, ROM, etc.

[0148] Any references to memory, storage, databases, or other media used herein may include non-volatile and / or volatile memory. Suitable non-volatile memory may include ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM), which is used as an external cache. By way of illustration and not limitation, RAM may take many forms, such as Static RAM (SRAM), Dynamic Random Access Memory (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchlink DRAM (SLDRAM), Rambus DRAM (RDRAM), and Direct Rambus DRAM (DRDRAM).

[0149] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Those skilled in the art should also recognize that the embodiments described in the specification are optional embodiments, and the actions and modules involved are not necessarily essential to this application.

[0150] In the various embodiments of this application, it should be understood that the sequence number of each process does not necessarily imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0151] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; they can be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0152] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0153] The foregoing has provided a detailed description of an image processing method, apparatus, electronic device, and computer-readable storage medium disclosed in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. Furthermore, those skilled in the art will recognize that, based on the ideas of this application, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. An image processing method, characterized in that, include: An image to be processed and a first depth image corresponding to the image to be processed are acquired by an image acquisition device. The first depth image is used to describe the depth information of the image to be processed. The depth of the image to be processed is estimated using a target depth estimation method to obtain a depth estimation result that matches the target style; wherein, the target depth estimation method is the depth estimation method for generating sample depth images; the depth images obtained by different depth estimation methods correspond to different styles; Constraints are determined based on the first depth image, and a second depth image is generated based on the constraints and the depth estimation results. The foreground recognition model, trained by training, is used to process the image to be processed and the second depth map to obtain a foreground recognition result. The foreground recognition result is used to describe the image location of the foreground region in the image to be processed. The foreground recognition model is trained based on a training dataset, which includes multiple frames of sample images, a sample depth image corresponding to each frame of the sample image, and a labeled foreground result corresponding to each frame of the sample image. The style of the sample depth image is the target style. The first depth image includes the depth value corresponding to each pixel in the image to be processed; the step of determining the constraint conditions based on the first depth image includes: Based on the depth value corresponding to each pixel in the first depth image, the vector index corresponding to each pixel is determined in descending or ascending order of depth value; Constraints are generated based on the vector indices corresponding to each pixel, and these constraints are used to constrain the relationship between the vector indices corresponding to each pixel and the order of the vectors.

2. The method according to claim 1, characterized in that, The step of generating a second depth image based on the constraints and the depth estimation result includes: Under the condition that the constraints are met, the mean square error between the depth estimation result and the mean square error is minimized to obtain the second depth image.

3. The method according to claim 1, characterized in that, The foreground recognition result includes a foreground mask; after obtaining the foreground recognition result, the method further includes: Connectivity detection is performed on the foreground mask, and image noise of the foreground mask is removed based on the detection results.

4. The method according to claim 3, characterized in that, The step of performing connected component detection on the foreground mask and removing image noise from the foreground mask based on the detection results includes: Traverse each pixel in the foreground mask. If the current pixel does not have a corresponding connected component label, assign a connected component label to the current pixel. The connected component label is used to identify the connected component to which the pixel belongs. In the foreground mask, detect reachable pixels within the reachable region of the current pixel, and assign connected component labels to each reachable pixel; Delete the connected components in the foreground mask that correspond to the target connected component label, wherein the number of pixels in the foreground mask that correspond to the target connected component label is less than a number threshold.

5. The method according to claim 1, characterized in that, The foreground recognition model is trained through the following steps: Obtain multiple frames of sample images and the labeled foreground results corresponding to each frame of the sample images; The depth of each frame of the sample image is estimated using a target depth estimation method to obtain a sample depth image corresponding to each frame of the sample image. The foreground recognition model to be trained processes the current frame sample image and the corresponding sample depth image to obtain the predicted foreground recognition result. The target loss is determined based on the labeled foreground result corresponding to the current frame sample image and the predicted foreground recognition result, and the parameters of the foreground recognition model to be trained are updated based on the target loss until the foreground recognition model converges.

6. The method according to claim 5, characterized in that, The predicted foreground recognition result includes a first foreground mask, a second foreground mask, and a third foreground mask, wherein the accuracy of the first foreground mask is less than the accuracy of the second foreground mask, and the accuracy of the second foreground mask is less than the accuracy of the third foreground mask. The step of determining the target loss based on the labeled foreground result corresponding to the current frame sample image and the predicted foreground recognition result includes: The labeled foreground result corresponding to the current frame sample image is processed to obtain a coarse foreground mask, wherein the precision of the coarse foreground mask is less than the precision of the labeled foreground result; Calculate the first loss between the first foreground mask and the rough foreground mask, the second loss between the second foreground mask and the labeled foreground result at the edge, and the third loss between the third foreground mask and the labeled foreground result, and determine the target loss based on the first loss, the second loss and the third loss.

7. The method according to claim 6, characterized in that, The first foreground mask has a first image size, the second foreground mask, the third foreground mask, and the current frame sample image all have a second image size, and the first image size is smaller than the second image size; The step of processing the annotated foreground result corresponding to the current frame sample image to obtain a coarse foreground mask includes: The labeled foreground result corresponding to the current frame sample image is downsampled to obtain a downsampled mask of the first image size; The downsampling mask is blurred to obtain a rough foreground mask.

8. The method according to claim 6 or 7, characterized in that, After updating the parameters of the foreground recognition model to be trained according to the target loss, the method further includes: The current parameters of the foreground recognition model to be trained are saved; After the foreground recognition model converges, the method further includes: The parameters saved each time are tested using a validation dataset, and the parameters that meet the accuracy requirements are selected as the parameters of the trained foreground recognition model based on the test results. The validation dataset includes multiple validation images, the labeled foreground results corresponding to each validation image, and the depth image corresponding to each validation image.

9. An image processing apparatus, characterized in that, include: An image acquisition module is used to acquire an image to be processed and a first depth image corresponding to the image to be processed through an image acquisition device. The first depth image is used to describe the depth information of the image to be processed. The style adaptation module is used to generate a second depth image that matches the target style based on the image to be processed and the first depth image; the depth images obtained by different depth estimation methods correspond to different styles. A foreground recognition module is used to process the image to be processed and the second depth map using a trained foreground recognition model to obtain a foreground recognition result. The foreground recognition result is used to describe the foreground region in the image to be processed. The foreground recognition model is trained based on a training dataset, which includes multiple frames of sample images, a sample depth image corresponding to each frame of the sample image, and a labeled foreground result corresponding to each frame of the sample image. The style of the sample depth image is the target style. The style adaptation module includes a depth estimation unit and a constraint unit; The depth estimation unit is used to perform depth estimation on the image to be processed using a target depth estimation method to obtain a depth estimation result that matches the target style; wherein, the target depth estimation method is the depth estimation method for generating sample depth images; The constraint unit is used to determine constraint conditions based on the first depth image, and generate a second depth image based on the constraint conditions and the depth estimation result; the first depth image includes the depth value corresponding to each pixel in the image to be processed. The constraint unit is further configured to determine the vector index corresponding to each pixel point according to the depth value corresponding to each pixel point in the first depth image, in descending or ascending order of depth value; and generate constraint conditions according to the vector index corresponding to each pixel point, wherein the constraint conditions are used to constrain the relationship between the vector index corresponding to each pixel point and the order.

10. An electronic device, characterized in that, The system includes a memory and a processor, wherein the memory stores a computer program that, when executed by the processor, causes the processor to perform the method as described in any one of claims 1 to 8.

11. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 8.