A deep learning-based image processing method, electronic device, and storage medium
By using a deep learning-based object detection and image segmentation network model, target regions in images are identified and filled, solving the problems of low accuracy and high computational cost in existing technologies, and achieving efficient and accurate image filling results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2021-02-04
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies have low accuracy in determining the filling region in an image by combining 3D point cloud and depth information, while semantic segmentation is computationally intensive and inefficient, resulting in poor image filling effects.
A deep learning-based approach is adopted, which uses an object detection network model to determine the target region image and uses an image segmentation network model for semantic segmentation, processing only the target region to determine the filling region.
It improves the efficiency and accuracy of image filling, avoids the huge computational load caused by semantic segmentation of the entire image, and achieves efficient and accurate filling results.
Smart Images

Figure CN113570615B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to an image processing method, electronic device, and storage medium based on deep learning. Background Technology
[0002] With the rapid advancement of science and technology, the performance of hardware computing units has significantly improved, enabling the rapid development of artificial intelligence technology, primarily deep learning, and leading to increasingly diverse understandings and applications of the three-dimensional world. Among these, entertainment using the cameras of smartphones and other electronic devices as input media is currently the mainstream choice for consumers seeking to leverage artificial intelligence technology.
[0003] Among various applications, virtual reality (VR) technology has seen a significant rise in recent years compared to traditional beauty and face-changing technologies, and its applications are becoming increasingly widespread. VR applications mostly involve filling controlled materials (such as images and expressions) into specific areas of an image, essentially covering a portion of the image with controllable materials. This area could be the location of objects like windows or doors. Currently, determining the filling area in an image typically involves a combination of 3D point clouds and depth information, or semantic segmentation. However, the combination of 3D point clouds and depth information often only extracts the approximate location of the filling area, resulting in low accuracy and poor performance during replacement. Semantic segmentation usually requires a large amount of pixel-level annotation data, leading to computational complexity, low efficiency, and the segmentation results are prone to issues such as mottled appearance and unclear edges. Therefore, how to efficiently and accurately extract the filling area from an image has become a pressing problem to be solved. Summary of the Invention
[0004] This invention provides a deep learning-based image processing method, electronic device, and storage medium that can efficiently and accurately extract filling regions from images, improving the effect and accuracy of image filling.
[0005] On one hand, embodiments of the present invention provide an image processing method based on deep learning, the method comprising:
[0006] Obtain the image to be processed.
[0007] The object detection network model is invoked to determine the target region image in the image to be processed, the target region image including the image of the region where the target object is located.
[0008] An image segmentation network model is invoked to perform semantic segmentation on the target region image to obtain the semantic segmentation result of the target region image. The semantic segmentation result is used to indicate whether the pixels in the target region image belong to the bounding box region of the target detection object.
[0009] Based on the semantic segmentation result, a filling region is determined from the target region image, and a source image is filled into the filling region. The filling region includes the region image of the area where the target detection object is located, excluding the border region.
[0010] On the other hand, embodiments of the present invention provide an image processing apparatus, the apparatus comprising:
[0011] The acquisition module is used to acquire the image to be processed.
[0012] The determination module is used to call the object detection network model to determine the target region image in the image to be processed, wherein the target region image includes the image of the region where the target object is located.
[0013] The processing module is used to call an image segmentation network model to perform semantic segmentation processing on the target region image, and obtain the semantic segmentation result of the target region image. The semantic segmentation result is used to indicate whether the pixels in the target region image belong to the bounding box region of the target detection object.
[0014] The determining module is further configured to determine a filling region from the target region image based on the semantic segmentation result, wherein the filling region includes the region image of the region where the target detection object is located, excluding the border region.
[0015] The processing module is also used to fill the filling area with source images.
[0016] In another aspect, embodiments of the present invention provide an electronic device, the electronic device including a processor and a storage device, the processor and the storage device being interconnected, wherein the storage device is used to store a computer program, the computer program including program instructions, and the processor is configured to call the program instructions to execute the above-described deep learning-based image processing method.
[0017] In another aspect, embodiments of the present invention provide a computer-readable storage medium storing a computer program, the computer program including program instructions, which are executed by a processor to perform the above-described deep learning-based image processing method.
[0018] In another aspect, this invention discloses a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the aforementioned deep learning-based image processing method.
[0019] In this embodiment of the invention, by calling an object detection network model, a target region image can be determined in the image to be processed. The target region image includes the image of the region where the target object is located. By calling an image segmentation network model to perform semantic segmentation processing on the target region image, a semantic segmentation result of the target region image can be obtained. The semantic segmentation result is used to indicate whether the pixels in the target region image belong to the border region of the target object. Performing semantic segmentation only on the target region image avoids the huge amount of computation caused by performing semantic segmentation on the entire image. Based on the semantic segmentation result, a filling region is determined from the target region image. The filling region refers to the region image in the image where the target object is located, excluding the border region. The source image can be used for filling, which can efficiently and accurately extract the filling region in the image, such as the filling region of windows and doors, and quickly complete the image filling, improving the effect and accuracy of image filling. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a schematic diagram of the structure of an image processing framework provided in an embodiment of the present invention;
[0022] Figure 2 This is a flowchart illustrating an image processing method based on deep learning provided in an embodiment of the present invention;
[0023] Figure 3a This is a schematic diagram of an image to be processed according to an embodiment of the present invention;
[0024] Figure 3b This is a schematic diagram of a target area detection result provided in an embodiment of the present invention;
[0025] Figure 3c This is a schematic diagram of an image processing effect provided by an embodiment of the present invention;
[0026] Figure 4 This is a flowchart illustrating another deep learning-based image processing method provided in an embodiment of the present invention;
[0027] Figure 5a This is a schematic diagram of a candidate box for an image to be processed, provided in an embodiment of the present invention;
[0028] Figure 5bThis is a schematic diagram of a target candidate box provided in an embodiment of the present invention;
[0029] Figure 5c This is a schematic diagram of an expanded target candidate box provided in an embodiment of the present invention;
[0030] Figure 5d This is a schematic diagram of another image to be processed provided in an embodiment of the present invention;
[0031] Figure 5e This is a schematic diagram illustrating another image processing effect provided by an embodiment of the present invention;
[0032] Figure 6 This is a schematic diagram of the structure of an image processing device provided in an embodiment of the present invention;
[0033] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0034] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0035] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.
[0036] Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning / deep learning.
[0037] Computer vision (CV) is the science that studies how to enable machines to "see." More specifically, it refers to machine vision, which uses cameras and computers to replace human eyes in recognizing and measuring targets, and then performs image processing to create images more suitable for human observation or transmission to instruments. As a scientific discipline, computer vision studies related theories and technologies, attempting to build artificial intelligence systems capable of extracting information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content / behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and mapping (SLAM), and common biometric recognition technologies such as facial recognition and fingerprint recognition.
[0038] Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and instructional learning.
[0039] The solutions provided in this application mainly involve artificial intelligence technologies such as machine learning and computer vision, which are specifically illustrated through the following embodiments:
[0040] Please see Figure 1 This is a schematic diagram of the structure of an image processing framework provided in an embodiment of the present invention. The basic structure of the image processing framework in this embodiment of the present invention can be as follows:
[0041] (1) Obtain the input image.
[0042] (2) The input image is fed into the object detection network model for processing to identify the specified detection objects in the input image and extract the corresponding region images, thereby obtaining the local image in the input image. The local image can be the region where the specified detection object is located in the input image. The specified detection object can specifically refer to the object surrounded by the outer border and the middle region, such as a window or door. The window can be regarded as being composed of the outer border (i.e., the window frame) and the glass surrounded by the outer border.
[0043] (3) The local image is fed into the image segmentation network model for semantic segmentation. Semantic segmentation classifies each pixel in the local image and finds the region surrounded by the outer border of the specified detection object in the local image based on the classification result, and uses it as the filling region.
[0044] (4) Use images that match the current festive atmosphere to fill the area to obtain an image filled with the images.
[0045] As can be seen, this application can first perform object detection on the image using the above image processing framework, find the region in the image that contains the detected object, and then perform semantic segmentation on the region, avoiding the huge amount of computation caused by semantic segmentation of the entire image. After determining the filling region from the region where the detected object is located based on the semantic segmentation result, the source image can be used for filling. It can efficiently and accurately extract the filling region in the image and quickly complete the image filling, achieving a good filling effect.
[0046] Please see Figure 2 This is a flowchart illustrating an image processing method based on deep learning provided in an embodiment of the present invention. The image processing method of the present invention includes the following steps:
[0047] 201. Obtain the image to be processed.
[0048] Specifically, the image to be processed can be obtained in real time by the shooting device, or it can be an image stored in a local image library or cloud image library. This embodiment of the invention does not limit this.
[0049] Taking the example of using a camera to acquire an image to be processed, when a user wants to experience the virtual reality function provided by the target application, they can point the camera of the electronic device at the object to be captured, such as a window or door. After the virtual reality function of the target application is turned on, the user can trigger a shooting command. The electronic device responds to the shooting command, starts the camera, and can acquire the image captured by the camera. The image captured by the camera can include the image displayed in the preview window. The image captured by the camera is used as the image to be processed.
[0050] It should be noted that electronic devices can specifically include smartphones, tablets, laptops, in-vehicle terminals, smart wearable devices, etc.
[0051] 202. Call the object detection network model to determine the target region image in the image to be processed, wherein the target region image includes the image of the region where the target object is located.
[0052] Among them, the object detection network model can be used to identify a specified target object and determine the location of the target object in the image.
[0053] Specifically, an object detection network model can be invoked to detect target objects in the image to be processed, thereby determining the location of the target objects in the image, and based on this location, a target region image including the area where the target objects are located can be determined. The target objects can be pre-defined objects to be filled in, such as windows or doors.
[0054] Taking a window as the target object for detection as an example, the image to be processed is as follows: Figure 3a As shown, the image to be processed 10 includes the region image of window 20 and other background region images. Specifically, the region image of window 20 includes the border region (i.e., the region image of window frame 21) and the region image of glass 22. An object detection network model is used to detect the window in the image to be processed, and the detection results are as follows: Figure 3b As shown, a target area image 30, including the area where the window is located, can be obtained.
[0055] 203. Call the image segmentation network model to perform semantic segmentation processing on the target region image to obtain the semantic segmentation result of the target region image. The semantic segmentation result is used to indicate whether the pixels in the target region image belong to the bounding box region of the target detection object.
[0056] The image segmentation network model can be used to classify pixels in an image. In this invention, the image segmentation network model can perform binary classification of pixels, that is, divide the pixels in the image into one of two categories.
[0057] Specifically, the target region image can be semantically segmented by calling an image segmentation network model to obtain the semantic segmentation result of each pixel in the target region image. The semantic segmentation result indicates whether each pixel belongs to the bounding box region of the target detection object.
[0058] 204. Determine a filling region from the target region image based on the semantic segmentation result, and fill the filling region with a source image. The filling region includes the region image of the area where the target detection object is located, excluding the border region.
[0059] Specifically, based on the semantic segmentation results, the region to which each pixel in the target area image belongs can be determined. Then, based on the region to which each pixel belongs, the filling region to be covered (or replaced) is determined. This filling region refers to the area of the image where the target object is located, excluding the aforementioned border region. Then, the corresponding source image is filled into the filling region. Figure 3bFor example, calling the image segmentation network model can obtain the semantic segmentation result of each pixel in the target region image 30. The semantic segmentation result indicates whether each pixel belongs to the border region of window 20 (i.e., window frame 21). Based on this semantic segmentation result, the region in the target region image 30 surrounded by window frame 21 can be obtained, which is the region where glass 22 is located. Of course, if window 20 is open, the region surrounded by window frame 21 includes not only the region where glass 22 is located, but also background objects seen through window 20, such as distant buildings, landscapes, etc. The region surrounded by window frame 21 can be used as the filling region. Then, the filling region is filled with the source image. The filling effect is that the glass in the region surrounded by window frame 21 and the background objects seen through glass 22 are covered by the source image, such as... Figure 3c As shown. The image can be randomly selected from a resource library, or it can be an image selected by the user from a resource library.
[0060] In some feasible implementations, the current time information can be obtained, and then images matching the current time information can be determined from a resource library. For example, if the current time is December 25th, the festive atmosphere can be determined to be Christmas, and images related to Christmas can be found from the resource library. Then, images related to Christmas can be filled into the filling area, such as... Figure 3c As shown, the image being filled is of Santa Claus riding a reindeer; for Halloween, the image could be of a monster perched on a window, and so on. This allows for a virtual display effect that matches the current holiday atmosphere, enhancing the fun and playability of image processing.
[0061] In this embodiment of the invention, by calling an object detection network model, a target region image can be determined in the image to be processed. The target region image includes the image of the region where the target object is located. By calling an image segmentation network model to perform semantic segmentation processing on the target region image, a semantic segmentation result of the target region image can be obtained. The semantic segmentation result is used to indicate whether the pixels in the target region image belong to the border region of the target object. Performing semantic segmentation only on the target region image avoids the huge amount of computation caused by performing semantic segmentation on the entire image. Based on the semantic segmentation result, a filling region is determined from the target region image. The filling region refers to the region image in the image where the target object is located, excluding the border region. The source image can be used for filling, which can efficiently and accurately extract the filling region in the image, such as the filling region of windows and doors, and quickly complete the image filling, improving the effect and accuracy of image filling.
[0062] Please see Figure 4This is a flowchart illustrating another deep learning-based image processing method provided in an embodiment of the present invention. The image processing method of this embodiment includes the following steps:
[0063] 401. Obtain the image to be processed.
[0064] The specific implementation of step 401 can be found in the relevant description of step 201 in the foregoing embodiments, and will not be repeated here.
[0065] 402. Input the image to be processed into the object detection network model to obtain target candidate boxes including the target detection object.
[0066] Specifically, by inputting the image to be processed into an object detection network model, target objects in the image can be detected, and candidate bounding boxes containing the target objects can be obtained. For example... Figure 3b As shown, the boundary of the target region image 30 in the image to be processed can be used as the target candidate box including window 20.
[0067] In some feasible implementations, the specific implementation of calling the object detection network model to obtain target candidate boxes may include:
[0068] First, the image to be processed is input into the object detection network model, which outputs multiple candidate boxes for the image, along with the probability distribution, location information, and size information of each candidate box. The probability distribution indicates the probability that the candidate box includes the target object. The location information refers to the offset of the candidate box's center position from the target object's center position. The size information refers to the width and height of the candidate box. For example, for a given candidate box, the output of the object detection network model can include two parts: one is the probability (p0, p1) that the candidate box includes the target object, where p1 represents the probability that the candidate box includes the target object, and p0 represents the probability that the candidate box does not include the target object; the sum of p0 and p1 is 1. The other part is the location and size information (Δcx, Δcy, w, h), where Δcx and Δcy represent the offset of the candidate box's center position (cx, cy) from the target object's center position, and w and h represent the candidate box's width and height.
[0069] Then, based on the probability distribution, at least one candidate box is determined from the plurality of candidate boxes. For example, the probability of each candidate box can be compared with a probability threshold of 0.6, and boxes that reach this probability threshold are retained. This determines at least one candidate box that is retained. After determining the at least one retained candidate box based on the probability, a second filtering can be performed based on the probability distribution, position information, and size information of the at least one candidate box, thereby determining the target candidate box that includes the target detection object from the at least one candidate box.
[0070] It should be noted that the number of candidate bounding boxes is the same as the number of target objects in the image to be processed. If there is only one target object in the image, there will also be only one candidate bounding box; if there are k target objects in the image, there will also be k candidate bounding boxes, where k is an integer greater than 1. Taking windows as an example, if the image to be processed contains one window, then the candidate bounding box corresponding to that window can be obtained; if the image to be processed contains three windows, then the candidate bounding box corresponding to each of the three windows can be obtained.
[0071] In some feasible implementations, the specific implementation of obtaining the target candidate box by secondary filtering based on the probability distribution, position information, and size information of the at least one candidate box may include:
[0072] First, based on the position and size information of each candidate box, a predicted position of the target object can be calculated. Then, based on the position and size information of at least one candidate box, at least one predicted position of the target object can be calculated. For example, if the position and size information of a candidate box are (Δcx, Δcy, w, h), and the center position of the candidate box is (cx, cy), then the predicted position information of the target object is calculated as (rx, ry, rw, rh), where rx = cx + Δcx, ry = cy + Δcy, rw = w, and rh = h.
[0073] Then, the at least one predicted location information is filtered using the non-maximum suppression (NMS) filtering strategy to obtain the most accurate target predicted location information, and then the candidate box with the calculated target predicted location information is used as the target candidate box.
[0074] For example, such as Figure 5a As shown, based on the probability distribution, five candidate boxes A, B, C, D, and E are determined to be retained. The probability of including the window is arranged from smallest to largest as A, B, C, D, and E. Starting from the candidate box E with the highest probability, it is determined whether the overlap between A, D, and E is greater than a certain set threshold. The overlap can be evaluated by the intersection over union (IoU). If the overlap between A, D, and E exceeds the threshold, then A, D are removed, and candidate box E is marked as needing to be retained. In this way, the candidate box with the most accurate prediction is found, and this candidate box E is taken as the target candidate box including the window.
[0075] Understandably, if there are multiple windows in the image to be processed, then a corresponding number of candidate boxes will be retained. For example, if the image to be processed includes two windows, denoted as window 1 and window 2, window 1 has four candidate boxes to retain (A, B, C, D), and window 2 has four candidate boxes to retain (E, F, G), arranged in ascending order of probability as A, B, E, G, D, C, F. Starting with the candidate box F with the highest probability, we determine whether the overlap between A, B, E, G, D, C and F is greater than a certain set threshold. If the overlap between E, G and F exceeds the threshold, then E and G are discarded, and candidate box F is marked as needing to be retained. Then, from the remaining candidate boxes A, B, C, D, we select the candidate box C with the highest probability, and then determine the overlap between A, B, D and C. If the overlap all exceeds the set threshold, then A, B, D are discarded, and candidate box C is marked as needing to be retained. Thus, we can obtain the target candidate box C for window 1 and the target candidate box F for window 2.
[0076] 403. Determine the target region image from the image to be processed based on the position and size information of the target candidate box.
[0077] Specifically, based on the position and size information of the target candidate box, a region image of the corresponding position and size can be extracted from the image to be processed and used as the target region image including the target detection object.
[0078] In some feasible implementations, considering that the image to be processed may have irregularities such as rotation or stretching during shooting, and that shooting devices are also diverse, after determining the target candidate box, the width and height of the target candidate box can be expanded by a preset ratio (e.g., 10%) to obtain the expanded target candidate box. Then, based on the position and size information of the expanded target candidate box, the target region image at the corresponding position and size is extracted from the image to be processed. By expanding the candidate box, it can be ensured that all target detection objects in the image are detected, avoiding the omission of some target detection objects due to model instability. Figure 5b As shown, because the image to be processed is rotated during shooting, the obtained target candidate box cannot completely cover the target detection object (i.e., the window). The target candidate box can be expanded according to a preset ratio. The expanded target candidate box is shown below. Figure 5c As shown, the expanded target candidate box can cover the target detection object more completely, which helps to accurately extract the filling area in the image and improve the image filling effect.
[0079] In some feasible implementations, the specific implementation of expanding the target candidate box to obtain the expanded target candidate box may include:
[0080] The boundary of the target candidate box is moved in the direction of expanding the target candidate box according to a preset ratio to obtain the expanded boundary. If the expanded boundary exceeds the boundary of the image to be processed, the boundary of the image to be processed is used as the boundary of the expanded target candidate box, and then the expanded target candidate box is determined according to the boundary of the expanded target candidate box.
[0081] 404. Input the target region image into the image segmentation network model for binarization segmentation to obtain the classification labels of the pixels in the target region image.
[0082] Specifically, the image segmentation network model is called to perform binarization segmentation on the pixels in the target region image. Binarization segmentation is also known as binary classification of pixels to obtain the classification labels (also known as segmentation masks) of the pixels in the target region image. The classification labels indicate whether the pixel belongs to the bounding box region of the target object. For example, when the classification label is 1, it indicates that the pixel belongs to the bounding box region of the target object. When the classification label is 0, it indicates that the pixel does not belong to the bounding box region of the target object.
[0083] 405. Determine the semantic segmentation result of the target region image based on the classification labels of the pixels in the target region image.
[0084] 406. Based on the semantic segmentation result, obtain the pixels in the image of the region where the target detection object is located that do not belong to the border region of the target detection object.
[0085] 407. Use the second region image composed of pixels that do not belong to the border region of the target detection object as the filling region, and fill the filling region with the source image.
[0086] Specifically, after obtaining the semantic segmentation results, pixels in the image containing the target detection object that do not belong to the target detection object can be obtained based on the pixel classification labels. These pixels are then used as fill areas, and the source image is used to fill these fill areas. Figure 3b For example, pixels in the image of the area where the window is located that do not belong to the border area of window 20 can be obtained according to the pixel classification label, and the pixels that do not belong to the border area of window 20 can be used as the filling area.
[0087] In some feasible implementations, when determining the filling region, pixels belonging to the border region of the target detection object in the target region image can be obtained based on the pixel classification labels. A first region image composed of pixels belonging to the border region of the target detection object is then determined, and the region image outside the first region image in the image of the target detection object's region is determined as the filling region. Figure 3bFor example, the pixels belonging to the border region of window 20 in the target region image 30 can be obtained according to the pixel classification label, the first region image composed of the pixels belonging to the border region of window 20 (i.e. the region image corresponding to the border region) can be determined, and then the region image other than the first region image in the image of the region where the window is located can be determined as the filling region.
[0088] In some feasible implementations, object detection network models and image segmentation network models can be trained as follows: Specifically, a training sample set is obtained, comprising multiple images and annotation information. The annotation information includes bounding boxes for the detected objects and semantic segmentation results. A bounding box is a rectangular box in the image that includes the detected object. Taking a window as an example, the multiple images can include images labeled as windows from public datasets such as SUNRGBD and COCO, images of indoor windows captured by cameras, and negative examples of indoor furniture, elevators, floors, and stair lights captured by cameras. These negative examples help avoid false positives during window detection and semantic segmentation. Then, the neural network is trained using the bounding boxes from these multiple images and annotation information to obtain the object detection network model; the image segmentation network model is trained using the semantic segmentation results from these multiple images and annotation information.
[0089] Among them, object detection network models and image segmentation network models can adopt the structure of neural networks such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), which can specifically include convolutional layers, pooling layers, non-linear activation functions, and upsampling layers.
[0090] In training the image segmentation network model, the cross-entropy loss function can be used to supervise it. The specific loss function is: L=-logPi, where Pi represents the probability that the predicted classification label of a certain pixel in the prediction result is correct compared with the labeled true classification label, and is between 0 and 1.
[0091] In some feasible implementations, if there are multiple target detection objects in the image to be processed, such as Figure 5d As shown, the image to be processed contains three windows, so the determined filling areas include regions 41, 42, and 43. Then, the source image is used to fill these three regions, and the filling effect is as follows. Figure 5e As shown, using a single source image to fill three areas can enhance the overall cohesiveness of the image. Of course, three source images can also be used, with each image corresponding to one area; this embodiment of the invention does not impose such limitations.
[0092] In some feasible implementations, for the overlapping parts of multiple target candidate boxes, an "OR" logic can be used. That is, as long as the semantic segmentation result of one target candidate box indicates that a certain pixel in the overlapping part belongs to the target detection object, then this pixel in the final detection result belongs to the target detection object.
[0093] In this embodiment of the invention, by calling the object detection network model, target candidate boxes including the target detection object can be obtained. Based on the position and size information of the target candidate boxes, the target region image with the corresponding position and size is extracted from the image to be processed. By calling the image segmentation network model to perform binarization segmentation on the target region image, the classification labels of the pixels in the target region image can be obtained. Based on the classification labels, the region composed of pixels in the target region image that do not belong to the border region of the target detection object is used as the filling region, such as the filling region of a window or door. The material image is then filled into the filling region. Semantic segmentation is performed only on the target region image, avoiding the huge computational load caused by semantic segmentation of the entire image. The filling region in the image can be extracted efficiently and accurately, and the image filling can be completed quickly, improving the effect and accuracy of image filling.
[0094] Please see Figure 6 This is a schematic diagram of the structure of an image processing device according to an embodiment of the present invention. The device includes:
[0095] The acquisition module 601 is used to acquire the image to be processed.
[0096] The determination module 602 is used to call the object detection network model to determine the target region image in the image to be processed, wherein the target region image includes the image of the region where the target object is located.
[0097] The processing module 603 is used to call an image segmentation network model to perform semantic segmentation processing on the target region image to obtain the semantic segmentation result of the target region image. The semantic segmentation result is used to indicate whether the pixels in the target region image belong to the bounding box region of the target detection object.
[0098] The determining module 602 is further configured to determine a filling region from the target region image based on the semantic segmentation result, wherein the filling region includes the region image of the region where the target detection object is located, excluding the border region.
[0099] The processing module 603 is also used to fill the filling area with material images.
[0100] Optionally, the semantic segmentation result includes a classification label for the pixel, which indicates whether the pixel belongs to the bounding box region of the target detection object.
[0101] Optionally, the determining module 602 is specifically used for:
[0102] The pixels belonging to the border region of the target detection object in the target region image are obtained based on the semantic segmentation result.
[0103] A first region image is determined, consisting of pixels belonging to the border region of the target detection object.
[0104] The region of the image in the area where the target object is located, excluding the first region image, is determined as the filling region.
[0105] Optionally, the determining module 602 is specifically used for:
[0106] Based on the semantic segmentation result, obtain the pixels in the image of the region where the target detection object is located that do not belong to the border region of the target detection object.
[0107] The second region image, composed of pixels that do not belong to the border region of the target detection object, is used as the filling region.
[0108] Optionally, the processing module 603 is specifically used for:
[0109] The target region image is input into an image segmentation network model for binarization segmentation to obtain the classification labels of the pixels in the target region image.
[0110] The semantic segmentation result of the target region image is determined based on the classification labels of the pixels in the target region image.
[0111] Optionally, the determining module 602 is specifically used for:
[0112] The image to be processed is input into an object detection network model to obtain a target candidate box that includes the target detection object.
[0113] The target region image is determined from the image to be processed based on the position and size information of the target candidate box.
[0114] Optionally, the determining module 602 is specifically used for:
[0115] The target candidate box is expanded according to a preset ratio to obtain the expanded target candidate box.
[0116] Based on the position and size information of the expanded target candidate box, the image region in the image to be processed that corresponds to the position and size of the expanded target candidate box is taken as the target region image.
[0117] Optionally, the determining module 602 is specifically used for:
[0118] The boundary of the target candidate box is moved in the direction of expanding the target candidate box according to a preset ratio to obtain the expanded boundary.
[0119] If the expanded boundary exceeds the boundary of the image to be processed, then the boundary of the image to be processed is used as the boundary of the expanded target candidate box.
[0120] The expanded target candidate box is determined based on the boundary of the expanded target candidate box.
[0121] Optionally, the determining module 602 is specifically used for:
[0122] The image to be processed is input into an object detection network model to obtain the probability distribution, position information, and size information of multiple candidate boxes of the image to be processed. The probability distribution is used to indicate the probability that the candidate box includes the target object.
[0123] Based on the probability distribution of the plurality of candidate boxes, at least one candidate box is determined from the plurality of candidate boxes.
[0124] Based on the probability distribution, position information, and size information of the at least one candidate box, a target candidate box including the target detection object is determined from the at least one candidate box.
[0125] Optionally, the determining module 602 is specifically used for:
[0126] Based on the position and size information of the at least one candidate box, at least one predicted position information of the target detection object is calculated.
[0127] The at least one predicted location information is filtered using a non-maximum suppression filtering strategy to obtain the target predicted location information.
[0128] The candidate boxes for which the predicted location information of the target is calculated in the at least one candidate box are used as target candidate boxes that include the target detection object.
[0129] Optionally, the acquisition module 601 is further configured to acquire a training sample set, which includes multiple images and annotation information, including bounding boxes for the target detection object and semantic segmentation results.
[0130] The processing module 603 is further configured to train the neural network using the multiple images and the bounding boxes in the annotation information to obtain an object detection network model.
[0131] The processing module 603 is further configured to train the neural network using the semantic segmentation results from the multiple images and the annotation information to obtain an image segmentation network model.
[0132] Optionally, the processing module 603 is specifically used for:
[0133] Get the current time information.
[0134] Select a material image from the material library that matches the current time information.
[0135] Fill the filled area with the source image.
[0136] Optionally, the acquisition module 601 is specifically used for:
[0137] The camera is activated in response to a shooting command triggered by the target application.
[0138] The image captured by the shooting device is acquired, including the image displayed in the preview window.
[0139] The image captured by the shooting device is used as the image to be processed.
[0140] It should be noted that the functions of each functional module of the image processing device in the embodiments of the present invention can be specifically implemented according to the methods in the above method embodiments. The specific implementation process can be referred to the relevant descriptions in the above method embodiments, and will not be repeated here.
[0141] Please see Figure 7 This is a schematic diagram of the structure of an electronic device according to an embodiment of the present invention. The electronic device of the present invention includes a power supply module and other structures, and includes a processor 701, a storage device 702, a user interface 703, and a camera 704. The processor 701, the storage device 702, the user interface 703, and the camera 704 can exchange data.
[0142] The storage device 702 may include volatile memory, such as random-access memory (RAM); the storage device 702 may also include non-volatile memory, such as flash memory, solid-state drive (SSD), etc.; the storage device 702 may also include a combination of the above types of memory.
[0143] The user interface 703 may include a display, a touch panel, etc., for outputting data such as images and detecting user touch operations.
[0144] The shooting device 704 may include a camera, such as a front camera, a rear camera, a single-lens camera, or a multi-lens camera.
[0145] The processor 701 may be a central processing unit (CPU). In one embodiment, the processor 701 may also be a graphics processing unit (GPU). The processor 701 may also be a combination of a CPU and a GPU. In one embodiment, the storage device 702 is used to store program instructions. The processor 701 can invoke the program instructions to perform the following operations:
[0146] Obtain the image to be processed.
[0147] The object detection network model is invoked to determine the target region image in the image to be processed, the target region image including the image of the region where the target object is located.
[0148] An image segmentation network model is invoked to perform semantic segmentation on the target region image to obtain the semantic segmentation result of the target region image. The semantic segmentation result is used to indicate whether the pixels in the target region image belong to the bounding box region of the target detection object.
[0149] Based on the semantic segmentation result, a filling region is determined from the target region image, and a source image is filled into the filling region. The filling region includes the region image of the area where the target detection object is located, excluding the border region.
[0150] Optionally, the semantic segmentation result includes a classification label for the pixel, which indicates whether the pixel belongs to the bounding box region of the target detection object.
[0151] Optionally, the processor 701 is specifically used for:
[0152] The pixels belonging to the border region of the target detection object in the target region image are obtained based on the semantic segmentation result.
[0153] A first region image is determined, consisting of pixels belonging to the border region of the target detection object.
[0154] The region of the image in the area where the target object is located, excluding the first region image, is determined as the filling region.
[0155] Optionally, the processor 701 is specifically used for:
[0156] Based on the semantic segmentation result, obtain the pixels in the image of the region where the target detection object is located that do not belong to the border region of the target detection object.
[0157] The second region image, composed of pixels that do not belong to the border region of the target detection object, is used as the filling region.
[0158] Optionally, the processor 701 is specifically used for:
[0159] The target region image is input into an image segmentation network model for binarization segmentation to obtain the classification labels of the pixels in the target region image.
[0160] The semantic segmentation result of the target region image is determined based on the classification labels of the pixels in the target region image.
[0161] Optionally, the processor 701 is specifically used for:
[0162] The image to be processed is input into an object detection network model to obtain a target candidate box that includes the target detection object.
[0163] The target region image is determined from the image to be processed based on the position and size information of the target candidate box.
[0164] Optionally, the processor 701 is specifically used for:
[0165] The target candidate box is expanded according to a preset ratio to obtain the expanded target candidate box.
[0166] Based on the position and size information of the expanded target candidate box, the image region in the image to be processed that corresponds to the position and size of the expanded target candidate box is taken as the target region image.
[0167] Optionally, the processor 701 is specifically used for:
[0168] The boundary of the target candidate box is moved in the direction of expanding the target candidate box according to a preset ratio to obtain the expanded boundary.
[0169] If the expanded boundary exceeds the boundary of the image to be processed, then the boundary of the image to be processed is used as the boundary of the expanded target candidate box.
[0170] The expanded target candidate box is determined based on the boundary of the expanded target candidate box.
[0171] Optionally, the processor 701 is specifically used for:
[0172] The image to be processed is input into an object detection network model to obtain the probability distribution, position information, and size information of multiple candidate boxes of the image to be processed. The probability distribution is used to indicate the probability that the candidate box includes the target object.
[0173] Based on the probability distribution of the plurality of candidate boxes, at least one candidate box is determined from the plurality of candidate boxes.
[0174] Based on the probability distribution, position information, and size information of the at least one candidate box, a target candidate box including the target detection object is determined from the at least one candidate box.
[0175] Optionally, the processor 701 is specifically used for:
[0176] Based on the position and size information of the at least one candidate box, at least one predicted position information of the target detection object is calculated.
[0177] The at least one predicted location information is filtered using a non-maximum suppression filtering strategy to obtain the target predicted location information.
[0178] The candidate boxes for which the predicted location information of the target is calculated in the at least one candidate box are used as target candidate boxes that include the target detection object.
[0179] Optionally, the processor 701 is further configured to:
[0180] Obtain a training sample set, which includes multiple images and annotation information, including bounding boxes for the target detection object and semantic segmentation results.
[0181] The neural network is trained using the multiple images and the bounding boxes in the annotation information to obtain an object detection network model.
[0182] The neural network is trained using the semantic segmentation results from the multiple images and the labeled information to obtain an image segmentation network model.
[0183] Optionally, the processor 701 is specifically used for:
[0184] Get the current time information.
[0185] Select a material image from the material library that matches the current time information.
[0186] Fill the filled area with the source image.
[0187] Optionally, the processor 701 is specifically used for:
[0188] In response to a shooting command triggered by the target application, the shooting device 704 is activated.
[0189] The image captured by the shooting device 704 is acquired, including the image displayed in the preview window.
[0190] The image captured by the shooting device 704 is used as the image to be processed.
[0191] In specific implementations, the processor 701, storage device 702, user interface 703, and imaging device 704 described in the embodiments of the present invention can execute the embodiments of the present invention. Figure 2 , Figure 4 The implementation methods described in the relevant embodiments of the provided deep learning-based image processing methods can also be used to execute the embodiments of the present invention. Figure 6 The implementation methods described in the relevant embodiments of the provided image processing apparatus will not be repeated here.
[0192] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program includes one or more instructions and can be stored in a computer storage medium. When executed, the program can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0193] This application also provides a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the steps described in the embodiments of the above methods.
[0194] The above-disclosed embodiments are only some of the embodiments of this application, and should not be construed as limiting the scope of this application. Therefore, any equivalent changes made in accordance with the claims of this application shall still fall within the scope of this application.
Claims
1. A deep learning-based image processing method, characterized in that, The method includes: When the virtual reality function of the target application is enabled, in response to a shooting command triggered by the target application, an image captured by a shooting device is acquired, the image captured by the shooting device including the image displayed in the preview window; The image captured by the shooting device is used as the image to be processed; An object detection network model is invoked to determine a target region image in the image to be processed. The target region image includes the image of the region where the target object is located. The target object includes an object consisting of an outer border and a middle region surrounded by the outer border. The object includes a window or a door. The image segmentation network model is invoked to perform semantic segmentation on the target region image to obtain the semantic segmentation result of the target region image. The semantic segmentation result is used to indicate whether the pixels in the target region image belong to the bounding box region of the target detection object. Based on the semantic segmentation result, a filling region is determined from the target region image, and a source image is filled into the filling region. The filling region includes the region image of the area where the target detection object is located, excluding the border region. The step of filling the filled area with a material image includes: When the image to be processed includes multiple target detection objects, the determined filling region includes the filling region corresponding to each of the multiple target detection objects; The filling region corresponding to each of the multiple target detection objects is filled using different image regions of the same source image.
2. The method according to claim 1, characterized in that, The semantic segmentation result includes a classification label for the pixel, which indicates whether the pixel belongs to the bounding box region of the target detection object.
3. The method according to claim 1 or 2, characterized in that, Determining the filling region from the target region image based on the semantic segmentation result includes: Based on the semantic segmentation result, obtain the pixels of the border region belonging to the target detection object in the target region image; Determine a first region image composed of pixels belonging to the border region of the target detection object; The region of the image in the area where the target object is located, excluding the first region image, is determined as the filling region.
4. The method according to claim 1 or 2, characterized in that, Determining the filling region from the target region image based on the semantic segmentation result includes: Based on the semantic segmentation result, obtain the pixels in the image of the region where the target detection object is located that do not belong to the bounding region of the target detection object; The second region image, composed of pixels that do not belong to the border region of the target detection object, is used as the filling region.
5. The method according to claim 1, characterized in that, The step of calling an image segmentation network model to perform semantic segmentation processing on the target region image to obtain the semantic segmentation result of the target region image includes: The target region image is input into an image segmentation network model for binarization segmentation to obtain the classification labels of the pixels in the target region image; The semantic segmentation result of the target region image is determined based on the classification labels of the pixels in the target region image.
6. The method according to claim 1, characterized in that, The object detection network model determines the target region image in the image to be processed, including: The image to be processed is input into an object detection network model to obtain a target candidate box that includes the target detection object; The target region image is determined from the image to be processed based on the position and size information of the target candidate box.
7. The method according to claim 6, characterized in that, The step of determining the target region image from the image to be processed based on the position and size information of the target candidate box includes: The target candidate box is expanded according to a preset ratio to obtain the expanded target candidate box; Based on the position and size information of the expanded target candidate box, the image region in the image to be processed that corresponds to the position and size of the expanded target candidate box is taken as the target region image.
8. The method according to claim 7, characterized in that, The step of expanding the target candidate box according to a preset ratio to obtain the expanded target candidate box includes: The boundary of the target candidate box is moved in the direction of expanding the target candidate box according to a preset ratio to obtain the expanded boundary; If the expanded boundary exceeds the boundary of the image to be processed, then the boundary of the image to be processed is taken as the boundary of the expanded target candidate box. The expanded target candidate box is determined based on the boundary of the expanded target candidate box.
9. The method according to any one of claims 6 to 8, characterized in that, The step of inputting the image to be processed into an object detection network model to obtain target candidate boxes including the target detection object includes: The image to be processed is input into an object detection network model to obtain the probability distribution, position information and size information of multiple candidate boxes of the image to be processed. The probability distribution is used to indicate the probability that the candidate box includes the target detection object. Based on the probability distribution of the plurality of candidate boxes, at least one candidate box is determined from the plurality of candidate boxes; Based on the probability distribution, position information, and size information of the at least one candidate box, a target candidate box including the target detection object is determined from the at least one candidate box.
10. The method according to claim 9, characterized in that, The step of determining a target candidate box including the target detection object from the at least one candidate box based on the probability distribution, position information, and size information of the at least one candidate box includes: Based on the position and size information of the at least one candidate box, at least one predicted position information of the target detection object is calculated; The at least one predicted location information is filtered using a non-maximum suppression filtering strategy to obtain the target predicted location information; The candidate boxes for which the predicted location information of the target is calculated in the at least one candidate box are used as target candidate boxes that include the target detection object.
11. The method according to claim 1, characterized in that, Before the invoked object detection network model determines the target region image in the image to be processed, the method further includes: Obtain a training sample set, which includes multiple images and annotation information, including bounding boxes for the target detection object and semantic segmentation results; The neural network is trained using the multiple images and the bounding boxes in the annotation information to obtain an object detection network model; The neural network is trained using the semantic segmentation results from the multiple images and the labeled information to obtain an image segmentation network model.
12. The method according to claim 1, characterized in that, The process of filling the filled area with a source image includes: Get the current time information; Determine the image from the resource library that matches the current time information; Fill the filled area with the source image.
13. An image processing apparatus, characterized in that, The device includes: The acquisition module is configured to, when the virtual reality function of the target application is enabled, in response to a shooting command triggered by the target application, acquire an image captured by a shooting device, the image captured by the shooting device including the image displayed in the preview window; and use the image captured by the shooting device as an image to be processed. The determination module is used to call the object detection network model to determine the target region image in the image to be processed. The target region image includes the image of the region where the target object is located. The target object includes an object composed of an outer border and a middle region surrounded by the outer border. The object includes a window or a door. The processing module is used to call an image segmentation network model to perform semantic segmentation processing on the target region image to obtain the semantic segmentation result of the target region image. The semantic segmentation result is used to indicate whether the pixels in the target region image belong to the bounding box region of the target detection object. The determining module is further configured to determine a filling region from the target region image based on the semantic segmentation result, wherein the filling region includes the region image of the region where the target detection object is located, excluding the border region; The processing module is also used to fill the filling area with material images; Specifically, the processing module is used to: when the image to be processed includes multiple target detection objects, determine the filling area including the filling area corresponding to each of the multiple target detection objects; and use different image areas of the same source image to fill the filling area corresponding to each of the multiple target detection objects respectively.
14. An electronic device, characterized in that, The electronic device includes a processor and a storage device, which are interconnected. The storage device is used to store a computer program, which includes program instructions. The processor is configured to invoke the program instructions to execute the deep learning-based image processing method as described in any one of claims 1 to 12.
15. A computer-readable storage medium, characterized in that, The computer storage medium stores a computer program, which includes program instructions that are executed by a processor to perform the deep learning-based image processing method as described in any one of claims 1 to 12.
16. A computer program product comprising computer instructions, characterized in that, When the computer instructions are executed by a computer processor, they implement the deep learning-based image processing method according to any one of claims 1 to 12.