Image depth map and normal map generation method, virtual live streaming method and device

CN115511937BActive Publication Date: 2026-06-16GUANGZHOU FANGSI INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: GUANGZHOU FANGSI INFORMATION TECH CO LTD
Filing Date: 2022-09-29
Publication Date: 2026-06-16

AI Technical Summary

⚠Technical Problem

Existing deep learning methods suffer from large reconstruction errors in image depth and normal estimation, resulting in low accuracy of the generated depth and normal maps, which affects the performance of applications such as 3D reconstruction, autonomous driving, fine image segmentation, and lighting effect rendering.

⚗Method used

By acquiring the first sample image dataset, depth estimation maps and normal estimation maps are generated using the trained depth map and normal map prediction models. Second sample images that meet the preset conditions are selected for training. The second depth map and normal map prediction models using the U-Net network structure are trained under supervision to generate high-accuracy depth maps and normal maps.

🎯Benefits of technology

It improves the accuracy of depth map and normal map generation, enhances the model's generalization ability and robustness, making it suitable for real-time applications on mobile devices and improving the effects of 3D reconstruction and lighting rendering.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115511937B_ABST

Patent Text Reader

Abstract

The application relates to the technical field of computer vision, and discloses a depth map and normal map generation method, a virtual live broadcast method and device, computer equipment and a storage medium. The method comprises the following steps: acquiring a first sample image data set; inputting each first sample image into a trained first depth map and normal map prediction model to obtain a depth estimation map and a normal estimation map; inputting each depth estimation map into a trained depth map to normal map model to obtain a first normal map; obtaining a plurality of second sample images satisfying a preset condition from the first sample image data set; obtaining a trained second depth map and normal map prediction model according to each second sample image, a second depth map and a second normal map corresponding to the second sample image; acquiring a to-be-predicted image; inputting the to-be-predicted image into the trained second depth map and normal map prediction model to obtain a depth map and a normal map, and the accuracy of generating the depth map and the normal map is improved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the fields of computer vision and live streaming technology, and particularly to a method for generating depth maps and normal maps of images, a virtual live streaming method, an apparatus, a computer device, and a storage medium. Background Technology

[0002] Depth estimation and normal estimation are fundamental techniques in computer vision, widely applied in 3D reconstruction, autonomous driving, fine image segmentation, lighting effects rendering, and facial animation. Depth estimation predicts the distance of each pixel in an RGB image to the camera plane, essentially predicting the depth value of each pixel. Based on the depth values of each pixel, a depth map of the RGB image is obtained, which visually reflects the geometry of object surfaces and the relative positions of objects. Normal estimation predicts the direction of the normal to the plane containing each pixel in an RGB image, i.e., predicting the normal vector value of each pixel, resulting in a normal map of the RGB image. The normal map can be used to calculate the direction of light reflection.

[0003] Currently, image depth and normal estimation mainly rely on deep learning methods. Deep learning methods acquire multiple frames of images, calculate camera pose transformations based on adjacent frames, and then reconstruct adjacent frames using the depth map predicted from a single frame. The reconstruction error is then used to train a neural network model. However, camera pose calculations based on adjacent frames have significant errors, resulting in inaccurate reconstruction errors and low accuracy of the trained neural network model, ultimately leading to low accuracy in the output depth and normal maps. Summary of the Invention

[0004] This application provides a method for generating depth maps and normal maps of images, a virtual live streaming method, an apparatus, a computer device, and a storage medium, which improves the accuracy of generating depth maps and normal maps. The technical solution is as follows:

[0005] In a first aspect, embodiments of this application provide a method for generating a depth map and a normal map of an image, comprising the steps of:

[0006] Obtain a first sample image dataset; the first sample image dataset includes several first sample images;

[0007] Each first sample image is input into a trained first depth map and normal map prediction model to obtain a depth estimation map and a normal estimation map corresponding to each first sample image.

[0008] Each of the depth estimation maps is input into a trained depth map to normal map model to obtain a first normal map corresponding to each of the first sample images;

[0009] Based on the normal estimation map and the first normal map, several second sample images that meet preset conditions are obtained from the first sample image dataset, and a second depth map and a second normal map corresponding to the several second sample images are obtained.

[0010] Each second sample image, the second depth map and the second normal map corresponding to the second sample image are input into the second depth map and normal map prediction model for training, and the trained second depth map and normal map prediction model is obtained.

[0011] Obtain the image to be predicted;

[0012] The image to be predicted is input into the trained second depth map and normal map prediction model to obtain the depth map and normal map corresponding to the image to be predicted.

[0013] Secondly, embodiments of this application provide a virtual live streaming method, which includes the following steps:

[0014] Acquire a scene image, and generate a depth map and normal map corresponding to the scene image using the image depth map and normal map generation method described above;

[0015] Acquire a live stream image, perform foreground segmentation on the live stream image, and obtain the anchor image;

[0016] The anchor image is fused with the depth map and normal map corresponding to the scene image to obtain a fused image;

[0017] The fused image is rendered and displayed in real time.

[0018] Thirdly, embodiments of this application provide an apparatus for generating depth maps and normal maps of images, comprising:

[0019] The dataset acquisition module is used to acquire a first sample image dataset; the first sample image dataset includes several first sample images;

[0020] The sample image input module is used to input each of the first sample images into the trained first depth map and normal map prediction model to obtain the depth estimation map and normal estimation map corresponding to each of the first sample images.

[0021] The depth estimation map input module is used to input each of the depth estimation maps into a trained depth map to normal map model to obtain a first normal map corresponding to each of the first sample images;

[0022] The second sample image acquisition module is used to obtain a number of second sample images that meet preset conditions from the first sample image dataset based on the normal estimation map and the first normal map, and to obtain a second depth map and a second normal map corresponding to the number of second sample images.

[0023] The model training module is used to input each second sample image, the second depth map and the second normal map corresponding to the second sample image into the second depth map and normal map prediction model for training, and obtain the trained second depth map and normal map prediction model.

[0024] The image acquisition module is used to acquire the image to be predicted.

[0025] The depth map acquisition module is used to input the image to be predicted into the trained second depth map and normal map prediction model to obtain the depth map and normal map corresponding to the image to be predicted.

[0026] Fourthly, embodiments of this application provide a virtual live streaming device, comprising:

[0027] The scene image acquisition module is used to acquire scene images and generate depth maps and normal maps corresponding to the scene images using the depth map and normal map generation method described above.

[0028] The live streaming room image acquisition module is used to acquire live streaming room images, perform foreground segmentation on the live streaming room images, and obtain the anchor image;

[0029] The image fusion module is used to fuse the anchor image with the depth map and normal map corresponding to the scene image to obtain a fused image;

[0030] The image rendering module is used to render and display the fused image in real time.

[0031] Fifthly, embodiments of this application provide a computer device, a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method as described in the first or second aspect.

[0032] In a sixth aspect, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the method as described in the first or second aspect.

[0033] This application embodiment obtains a first sample image dataset, which includes several first sample images. Each first sample image is input into a trained first depth map and normal map prediction model to obtain a depth estimation map and a normal estimation map corresponding to each first sample image. Each depth estimation map is input into a trained depth map to normal map model to obtain a first normal map corresponding to each first sample image. Based on the normal estimation map and the first normal map, several second sample images that meet preset conditions are obtained from the first sample image dataset, and second depth maps and second normal maps corresponding to the several second sample images are obtained. Each second sample image, the second depth map, and the second normal map corresponding to the second sample image are input into the second depth map and normal map prediction model for training to obtain a trained second depth map and normal map prediction model. An image to be predicted is obtained. The image to be predicted is input into the trained second depth map and normal map prediction model to obtain the depth map and normal map corresponding to the image to be predicted. In this embodiment, a second sample image is obtained from a first sample image dataset. The second depth map and the second normal map corresponding to the second sample image are used as pseudo-labels for the depth map and normal map. A prediction model for the second depth map and normal map is trained based on the pseudo-labels for the depth map and normal map, thereby obtaining a well-trained prediction model for the second depth map and normal map, which improves the accuracy of generating depth maps and normal maps.

[0034] To better understand and implement this application, the technical solution is described in detail below with reference to the accompanying drawings. Attached Figure Description

[0035] Figure 1 A schematic diagram illustrating an application scenario of the method for generating depth maps and normal maps of images provided in this application embodiment;

[0036] Figure 2 A schematic flowchart illustrating the method for generating depth maps and normal maps of images provided in the first embodiment of this application;

[0037] Figure 3 This is a flowchart illustrating step S40 of the method for generating depth maps and normal maps of images provided in this application embodiment;

[0038] Figure 4 A schematic flowchart of step S401 in the method for generating depth maps and normal maps of images provided in the embodiments of this application;

[0039] Figure 5 A schematic flowchart of step S402 in the method for generating depth maps and normal maps of images provided in the embodiments of this application;

[0040] Figure 6This is a flowchart illustrating step S50 of the method for generating depth maps and normal maps of images provided in this application embodiment;

[0041] Figure 7 A schematic flowchart of step S200 in the method for generating depth maps and normal maps of images provided in the embodiments of this application;

[0042] Figure 8 A flowchart illustrating step S503 of the method for generating depth maps and normal maps of images provided in this application embodiment;

[0043] Figure 9 A flowchart illustrating the virtual live streaming method provided in the second embodiment of this application;

[0044] Figure 10 A schematic diagram of the structure of the depth map and normal map generation apparatus for the image provided in the third embodiment of this application;

[0045] Figure 11 This is a schematic diagram of the structure of the virtual live streaming device provided in the fourth embodiment of this application;

[0046] Figure 12 This is a schematic diagram of the structure of a computer device provided in the fifth embodiment of this application. Detailed Implementation

[0047] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0048] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.

[0049] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."

[0050] Those skilled in the art will understand that the terms "client," "terminal," and "terminal device" as used in this application include both devices that are wireless signal receivers, which are devices that only have wireless signal receiver capabilities without transmission capabilities, and devices that have receiving and transmitting hardware, which have receiving and transmitting hardware capable of bidirectional communication on a bidirectional communication link. Such devices may include: cellular or other communication devices such as personal computers or tablets, which have single-line displays or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which can combine voice, data processing, fax, and / or data communication capabilities; PDA (Personal Digital Assistant), which may include a radio frequency receiver, pager, Internet / intranet access, web browser, notepad, calendar, and / or GPS (Global Positioning System) receiver; and conventional laptop and / or handheld computers or other devices that have and / or include radio frequency receivers. As used herein, "client," "terminal," and "terminal device" can be portable, transportable, installed in a means of transportation (air, sea, and / or land), or suitable and / or configured to operate locally and / or in a distributed manner, operating in any other location on Earth and / or in space. "Client," "terminal," and "terminal device" as used herein can also be a communication terminal, an internet access terminal, or a music / video playback terminal, such as a PDA, a MID (Mobile Internet Device), and / or a mobile phone with music / video playback capabilities, or a smart TV, set-top box, etc.

[0051] The hardware referred to by the names "server," "client," and "service node" in this application is essentially a computer device with the equivalent capabilities of a personal computer. It is a hardware device with the necessary components revealed by the von Neumann architecture, such as a central processing unit (including an arithmetic logic unit and a control unit), memory, input devices, and output devices. The computer program is stored in its memory, and the central processing unit loads the program stored in the secondary storage into the main memory to run it, execute the instructions in the program, and interact with the input and output devices to complete specific functions.

[0052] It should be noted that the concept of "server" used in this application can also be extended to the case of server clusters. Based on the network deployment principles understood by those skilled in the art, the servers should be logically divided. Physically, these servers can be independent of each other but accessible through interfaces, or they can be integrated into a single physical computer or a computer cluster. Those skilled in the art should understand this flexibility and should not use it to constrain the implementation of the network deployment method in this application.

[0053] The depth map and normal map generation method provided in this application can be used to generate depth maps and normal maps for any three-dimensional image. Specifically, it can be used for three-dimensional reconstruction of three-dimensional images and lighting effect rendering based on depth maps and normal maps. This application embodiment takes the generation of depth maps and normal maps for live streaming images, specifically the background lighting effect rendering of live streaming images, as an example for illustration.

[0054] Please see Figure 1 , Figure 1 This is a schematic diagram of an application scenario for the method of generating depth maps and normal maps of images provided in this application embodiment. The application scenario includes the anchor client 101, server 102 and audience client 103 provided in this application embodiment. The anchor client 101 and audience client 103 interact through the server 102.

[0055] Among them, the broadcaster client 101 refers to the end that sends live video, which is usually the client used by the broadcaster (i.e., the live broadcaster user) in the live broadcast.

[0056] Viewer client 103 refers to the end that receives and watches live online videos. It is usually the client used by viewers (i.e., live stream viewers) watching videos in a live stream.

[0057] The hardware referred to by the broadcaster client 101 and the viewer client 103 essentially refers to computer equipment, specifically, such as... Figure 1As shown, it can be a computer device such as a smartphone, smart interactive whiteboard, or personal computer. Both the broadcaster client 101 and the viewer client 103 can access the Internet through known network access methods and establish a data communication link with the server 102.

[0058] Server 102, acting as a business server, can further connect to related audio data servers, video streaming servers, and other servers providing related support, thus forming a logically interconnected service cluster to serve related terminal devices, such as… Figure 1 The broadcaster client 101 and the viewer client 103 shown provide services.

[0059] In this embodiment, the broadcaster client 101 and the viewer client 103 can join the same live broadcast room (i.e., live broadcast channel). The aforementioned live broadcast room refers to a chat room implemented using Internet technology, which typically has audio and video playback control functions. The broadcaster user conducts live broadcasts in the live broadcast room through the broadcaster client 101, and the viewers of the viewer client 103 can log in to the server 102 to enter the live broadcast room to watch the live broadcast.

[0060] Within a live streaming room, hosts and viewers can interact through well-known online communication methods such as voice, video, and text. Typically, the host performs programs for the audience in the form of audio and video streams. During the interaction, resource exchange can also occur; for example, viewer client 103 can send virtual gifts to host client 101 in the same live streaming room. Of course, the application of live streaming rooms is not limited to online entertainment; it can be extended to other related scenarios, such as user matching and interaction, video conferencing, online teaching, product promotion and sales, and any other scenario requiring similar interaction.

[0061] Specifically, the process of watching the live stream is as follows: Viewers can click to access the live streaming application installed on the viewer client 103 and select to enter any live stream room, triggering the viewer client 103 to load the live stream room interface for the viewer. The live stream room interface includes several interactive components. By loading these interactive components, viewers can watch the live stream in the live stream room and engage in various online interactions.

[0062] Currently, in online live streaming, the depth map and normal map of RGB images can be generated and applied to scenarios such as 3D lighting, virtual-real interaction, and various AR effects. This can reduce the cost and complexity of starting a live stream for the host, generate high-quality and efficient interactive content, and improve the retention rate of viewers in the live stream.

[0063] However, the depth maps and normal maps generated by existing technologies are not very accurate, resulting in poor effects when applied to scenarios such as 3D lighting, virtual-real interaction, and various AR effects, which affects the broadcaster's experience and the viewer's experience.

[0064] Please see Figure 2 , Figure 2 The flowchart illustrates the method for generating depth maps and normal maps of images according to the first embodiment of this application. The method includes the following steps:

[0065] S10: Obtain the first sample image dataset; the first sample image dataset includes several first sample images.

[0066] In this embodiment, the first sample image dataset can be the COCO dataset or the Places2 dataset, or it can be a subset of images selected from either the COCO or Places2 datasets. The COCO dataset is a large dataset for object detection, segmentation, and character recognition. The Places2 dataset contains over 10 million images across more than 400 unique scene categories. Each category has 5,000 to 30,000 training images, consistent with the frequency of scenes in the real world.

[0067] S20: Input each first sample image into the trained first depth map and normal map prediction model to obtain the depth estimation map and normal estimation map corresponding to each first sample image.

[0068] The trained first depth map and normal map prediction models can output depth and normal estimates based on any input image. Specifically, the trained first depth map and normal map prediction models can be dense prediction transformer (DPT) models. Due to the large number of parameters and computational complexity of DPT models, they consume significant memory and computational resources, making them unsuitable for mobile applications. However, the image training sets used by DPT models cover a wide range of business scenarios, and DPT models have strong generalization capabilities.

[0069] In this embodiment, only the depth map and normal map output by the DPT model are used as supervision labels for subsequent model training, so that the trained model can cover more business scenarios, improve the model's generalization ability, and enhance the model's robustness.

[0070] Specifically, by inputting the first sample image into the trained first depth map and normal map prediction model, a depth estimation map and a normal estimation map corresponding to the first sample image can be obtained.

[0071] S30: Input each depth estimation map into the trained depth map to normal map model to obtain the first normal map corresponding to each first sample image.

[0072] The trained depth map to normal map model can be a machine learning model or a deep neural network learning model, which can output a normal map based on any input depth map.

[0073] In this embodiment of the application, the depth estimation map of the first sample image is input into a trained depth map to normal map model to obtain a first normal map corresponding to the first sample image.

[0074] S40: Based on the normal estimation map and the first normal map, obtain several second sample images that meet the preset conditions from the first sample image dataset, and obtain the second depth map and the second normal map corresponding to the several second sample images.

[0075] In this embodiment, the similarity between the normal estimation map and the first normal map corresponding to each first sample image can be measured to obtain a similarity measurement result. Based on the similarity measurement result and preset conditions, several second sample images are selected from the first sample image dataset. Specifically, based on the similarity measurement result and preset conditions, it is determined whether the depth estimation map and normal estimation map corresponding to the first sample image are of high quality. First sample images with high-quality depth estimation maps and normal estimation maps are selected as second sample images, thus obtaining second depth maps and second normal maps corresponding to several second sample images.

[0076] S50: Input each second sample image, the corresponding second depth map, and the second normal map into the second depth map and normal map prediction model for training, and obtain the trained second depth map and normal map prediction model.

[0077] In this embodiment of the application, several second sample images and the second depth map and second normal map corresponding to the several second sample images are used as the training set of the second depth map and normal map prediction model. The second depth map and normal map prediction model is trained to obtain the trained second depth map and normal map prediction model.

[0078] Among them, the second depth map and normal map prediction models have a small number of parameters and low computational cost, resulting in low memory and computational resource consumption, making them suitable for mobile devices. Specifically, the second depth map and normal map prediction models adopt a U-Net network structure, including an encoder, a decoder, a depth map prediction head network, and a normal map prediction head network. Specifically, the MobileNet V3 network is used as the encoder, and the encoder's output serves as the input to the decoder. The decoder's output serves as the input to both the depth map and normal map prediction head networks, which each consist of a convolutional network layer and a ReLU layer. The trained second depth map and normal map prediction models have a simple structure, using a relatively small number of convolutional and ReLU layers, allowing them to run in real-time on mobile devices and output highly accurate depth and normal maps.

[0079] S60: Obtain the image to be predicted.

[0080] In this embodiment of the application, the image to be predicted can be any RGB image input by the user.

[0081] S70: Input the image to be predicted into the trained second depth map and normal map prediction model to obtain the depth map and normal map corresponding to the image to be predicted.

[0082] In the embodiments of this application, the trained second depth map and normal map prediction model can obtain the corresponding depth map and normal map based on any input image to be predicted.

[0083] By applying the embodiments of this application, a depth estimation map and a normal estimation map are obtained through a trained first depth map and normal map prediction model. A first normal map is obtained using the depth estimation map and a trained depth map to normal map model. Second sample images are then selected based on the first normal map and the normal estimation map. The second depth map and normal map prediction model is then trained under supervision using the second depth map and the second normal map corresponding to the second sample image, resulting in a well-trained second depth map and normal map prediction model. This improves the model's generalization ability and enhances its robustness. Furthermore, compared to training the model by calculating the camera pose of adjacent frames to obtain reconstruction errors, this application improves the model's training accuracy through supervised training, thereby increasing the accuracy of generating depth maps and normal maps.

[0084] In an optional embodiment, please refer to Figure 3 Step S40, which involves obtaining several second sample images that meet preset conditions from the first sample image dataset based on the normal estimation map and the first normal map, and obtaining the second depth map and second normal map corresponding to the several second sample images, includes steps S401 to S402, as follows:

[0085] S401: Perform a similarity measurement on the normal estimation map and the first normal map corresponding to each first sample image to obtain the first confidence map corresponding to each first sample image;

[0086] S402: Based on the first confidence map, obtain several second sample images that meet the preset conditions from the first sample image dataset, and obtain the second depth map and the second normal map corresponding to the several second sample images.

[0087] In this embodiment, the similarity measure can be calculated using cosine similarity, where the pixel value of each pixel in the first confidence image is the cosine similarity value. Alternatively, it can be calculated using structural similarity (SSIM), where the pixel value of each pixel in the first confidence image is the structural similarity value. The preset condition can be that the average pixel value of all pixels in the first confidence image is greater than a preset threshold, or that the variance of the pixel values of all pixels is less than a preset threshold.

[0088] By measuring the similarity between the normal estimation map and the first normal map corresponding to each first sample image, several second sample images can be automatically and quickly selected from the sample image dataset.

[0089] In an optional embodiment, please refer to Figure 4 Step S401, which involves performing a similarity measurement on the normal estimation map and the first normal map corresponding to each first sample image to obtain the first confidence map corresponding to each first sample image, includes steps S4011 to S4013, as follows:

[0090] S4011: Obtain the normal vector value of each pixel in the normal estimation map corresponding to each first sample image and the normal vector value of the corresponding pixel in the first normal map corresponding to each first sample image;

[0091] S4012: Calculate the cosine of the angle between the normal vector value of each pixel in the normal estimation map and the normal vector value of the corresponding pixel in the first normal map;

[0092] S4013: Use the cosine of the included angle as the pixel value of each pixel to obtain the first confidence map corresponding to each first sample image.

[0093] Cosine similarity refers to using the cosine value of the angle between two vectors as a measure of the difference between them. A cosine value close to 1 and an angle close to 0 indicate that the two vectors are more similar, while a cosine value close to 0 and an angle close to 90 degrees indicate that the two vectors are less similar.

[0094] In this embodiment, a first confidence map is obtained by calculating the cosine similarity between the normal estimation map and the first normal map. Specifically, the cosine of the angle between the normal vector value of each pixel in the normal estimation map and the normal vector value of the corresponding pixel in the first normal map is calculated, and the cosine of the angle is used as the pixel value of the corresponding pixel in the first confidence map.

[0095] By calculating the cosine similarity between the estimated normal map and the first normal map, the first confidence map can be obtained automatically and quickly.

[0096] In an optional embodiment, please refer to Figure 5 Step S402, which involves obtaining several second sample images that meet preset conditions from the first sample image dataset based on the first confidence map, and obtaining the second depth map and second normal map corresponding to the several second sample images, includes steps S4021 to S4022, as follows:

[0097] S4021: Average the pixel values of all pixels in the first confidence map corresponding to each first sample image to obtain the global confidence value corresponding to each first sample image;

[0098] S4022: Iterate through each global confidence value. If the current global confidence value is greater than or equal to a preset threshold, take the first sample image corresponding to the current global confidence value as the second sample image, and obtain the second depth map and the second normal map corresponding to the second sample image.

[0099] In this embodiment, the global confidence value corresponding to each first sample image is compared with a preset threshold, and the second sample image is selected based on the comparison result. Specifically, the first sample image corresponding to a global confidence value greater than or equal to the preset threshold is selected as the second sample image.

[0100] By comparing the global confidence value corresponding to each first sample image with a preset threshold, the second sample image can be determined automatically and quickly.

[0101] In an optional embodiment, please refer to Figure 6 Step S50 involves inputting each second sample image, the corresponding second depth map, and the second normal map into the second depth map and normal map prediction model for training, thereby obtaining the trained second depth map and normal map prediction model. This step includes steps S501 to S504, as follows:

[0102] S501: Input each second sample image into the second depth map and normal map prediction model to obtain the predicted depth map and predicted normal map corresponding to each second sample image;

[0103] S502: Input the predicted depth map and the second depth map into the trained depth map to normal map model respectively to obtain the third normal map and the fourth normal map;

[0104] S503: Measure the similarity between the predicted normal map and the third normal map to obtain the corresponding second confidence map.

[0105] In this embodiment, the normal vector value of each pixel in the predicted normal map corresponding to each second sample image and the normal vector value of the corresponding pixel in the third normal map corresponding to each second sample image are obtained. The cosine value of the angle between the normal vector value of each pixel in the predicted normal map and the normal vector value of the corresponding pixel in the third normal map is calculated. The cosine value of the angle is used as the pixel value of each pixel to obtain the second confidence map corresponding to each second sample image.

[0106] S504: Based on the second depth map, second normal map, third normal map, fourth normal map, predicted depth map, predicted normal map, and second confidence map, train the prediction models of the second depth map and normal map to obtain the trained prediction models of the second depth map and normal map.

[0107] In this embodiment of the application, the second depth map and normal map prediction model and the trained depth map to normal map model are jointly trained to obtain the trained second depth map and normal map prediction model.

[0108] Specifically, a loss function can be constructed based on the second depth map, second normal map, third normal map, fourth normal map, predicted depth map, predicted normal map, and second confidence map. Backpropagation is performed using the loss function value to update the weight parameters of the encoder, decoder, depth map prediction head network, and normal map prediction head network in the second depth map and normal map prediction models in a gradient descent manner, thereby obtaining the trained second depth map and normal map prediction models.

[0109] In an optional embodiment, before step S501, which inputs each second sample image into the second depth map and normal map prediction model to obtain the predicted depth map and predicted normal map corresponding to each second sample image, steps S100 to S200 are included, as follows:

[0110] S100: Input each second sample image into the first neural network learning model to obtain the initial depth map and initial normal map corresponding to each second sample image;

[0111] S200: Train the first neural network learning model based on the initial depth map, initial normal map, second depth map, and second normal map corresponding to each second sample image to obtain the second depth map and normal map prediction model.

[0112] In this embodiment, the second depth map and the second normal map are used as labels for the first neural network learning model. A loss function can be constructed based on the difference between the initial depth map and the second depth map, and the difference between the initial normal map and the second normal map. The loss function value is used for backpropagation to update the weight parameters of the first neural network learning model in the manner of gradient descent, thereby obtaining the prediction model of the second depth map and the normal map.

[0113] In an optional embodiment, please refer to Figure 7 Step S200 involves training the first neural network learning model based on the initial depth map, initial normal map, second depth map, and second normal map corresponding to each second sample image to obtain the second depth map and normal map prediction model. This step includes steps S201 to S205, as follows:

[0114] S201: Average the depth values of each pixel in the initial depth map to obtain the third average depth value; obtain the third regularized depth value based on the depth values of each pixel in the initial depth map and the third average depth value.

[0115] S202: Average the depth values of each pixel in the second depth map to obtain the fourth average depth value; obtain the fourth regularized depth value based on the depth values of each pixel in the second depth map and the fourth average depth value.

[0116] S203: The average of the difference between the third and fourth regularized depth values is used to obtain the eighth loss function; the gradient of the difference between the depth value of each pixel in the initial depth map and the depth value of the corresponding pixel in the second depth map is used to obtain the ninth loss function.

[0117] In this embodiment, the expression for the eighth loss function is as follows;

[0118]

[0119]

[0120]

[0121] in, This represents the third regularization depth value. t(d) represents the fourth regularized depth value, and t(d) represents the third average depth value. * ) represents the fourth average depth value, and d represents the depth value of each pixel in the initial depth map. * This represents the depth value of each pixel in the second depth map, where N represents the number of pixels.

[0122] S204: The tenth loss function is obtained by averaging the difference between the normal vector value of each pixel in the initial normal map and the normal vector value of the corresponding pixel in the second normal map; the eleventh loss function is obtained by averaging the cosine of the angle between the normal vector value of each pixel in the initial normal map and the normal vector value of the corresponding pixel in the second normal map.

[0123] S205: Train the first neural network learning model based on the eighth, ninth, tenth, and eleventh loss functions to obtain the second depth map and normal map prediction model.

[0124] In this embodiment, the eighth, ninth, tenth, and eleventh loss functions are calculated using the initial depth map, the initial normal map, the second depth map, and the second normal map, thereby optimizing the network parameters of the first neural network learning model and obtaining the second depth map and normal map prediction model.

[0125] In an optional embodiment, please refer to Figure 8 Step S504 involves training the prediction models for the second depth map and normal map based on the second depth map, second normal map, third normal map, fourth normal map, predicted depth map, predicted normal map, and second confidence map to obtain the trained prediction models for the second depth map and normal map. This step includes S5031 to S5039, as follows:

[0126] S5031: The first loss function is obtained by averaging the difference between the normal vector value of each pixel in the second normal map and the normal vector value of the corresponding pixel in the third normal map, and the product of the difference between the normal vector value of each pixel in the second confidence map and the pixel value of the corresponding pixel in the second confidence map.

[0127] S5032: The second loss function is obtained by averaging the product of the cosine of the angle between the normal vector value of each pixel in the second normal map and the normal vector value of the corresponding pixel in the third normal map and the pixel value of the corresponding pixel in the second confidence map.

[0128] S5033: The average of the difference between the normal vector value of each pixel in the third normal map and the normal vector value of the corresponding pixel in the fourth normal map is used to obtain the third loss function;

[0129] S5034: The average of the difference between the normal vector value of each pixel in the second normal map and the normal vector value of the corresponding pixel in the predicted normal map is used to obtain the fourth loss function;

[0130] S5035: The fifth loss function is obtained by averaging the cosine of the angle between the normal vector value of each pixel in the second normal map and the normal vector value of the corresponding pixel in the predicted normal map.

[0131] S5036: Average the depth values of each pixel in the predicted depth map to obtain a first average depth value; obtain a first regularized depth value based on the depth values of each pixel in the predicted depth map and the first average depth value.

[0132] S5037: Average the depth values of each pixel in the second depth map to obtain the second average depth value; obtain the second regularized depth value based on the depth values of each pixel in the second depth map and the second average depth value.

[0133] S5038: The average of the difference between the first regularized depth value and the second regularized depth value is used to obtain the sixth loss function; the gradient of the difference between the depth value of each pixel in the predicted depth map and the depth value of the corresponding pixel in the second depth map is used to obtain the seventh loss function.

[0134] In the embodiments of this application, the process of solving the sixth and seventh loss functions in steps S5036 to S5038 can be referred to the process of solving the eighth and ninth loss functions in steps S201 to S203, and will not be repeated here.

[0135] S5039: Train the second depth map and normal map prediction models based on the first loss function, the second loss function, the third loss function, the fourth loss function, the fifth loss function, the sixth loss function, and the seventh loss function to obtain the trained second depth map and normal map prediction models.

[0136] In this embodiment, the weight parameters of the encoder, decoder, depth map prediction head network, and normal map prediction head network in the trained depth map to normal map model and the second depth map and normal map prediction model are updated using the first loss function, the second loss function, and the third loss function. The weight parameters of the encoder, decoder, and depth map prediction head network in the second depth map and normal map prediction model are updated using the fourth loss function and the fifth loss function. The weight parameters of the encoder, decoder, and normal map prediction head network in the second depth map and normal map prediction model are updated using the sixth loss function and the seventh loss function, thereby obtaining the trained second depth map and normal map prediction model.

[0137] In an optional embodiment, before step S30, which inputs each first depth map into a trained depth map to normal map model to obtain the second normal map corresponding to each first sample image, steps S301 to S302 are included, as follows:

[0138] S301: Obtain the second sample image dataset; the second sample image dataset includes several third sample depth maps and corresponding third sample normal maps;

[0139] S302: Input each third sample depth map into the second neural network learning model to obtain the corresponding sample normal map. Construct a loss function using the sample normal map and the third sample normal map. Update the weight parameters of the second neural network learning model according to the loss function to obtain the trained depth map to normal map model.

[0140] In this embodiment, the second sample image dataset can be the Taskonomy indoor dataset, which contains 136 indoor models and 1 million data pairs, including third sample depth maps and corresponding third sample normal maps. A second neural network learning model is trained based on the Taskonomy indoor dataset to obtain a depth map-to-normal map model that predicts normal maps based on depth maps.

[0141] Specifically, the third sample depth map is input into the second neural network learning model to obtain the sample normal map. A loss function is constructed using the sample normal map and the third sample normal map. Based on the loss function, the weight parameters of the second neural network learning model are updated, thereby obtaining the trained depth map to normal map model.

[0142] Specifically, the twelfth loss function is obtained by averaging the difference between the normal vector value of each pixel in the sample normal map and the normal vector value of the corresponding pixel in the third sample normal map; the thirteenth loss function is obtained by averaging the cosine of the angle between the normal vector value of each pixel in the sample normal map and the normal vector value of the corresponding pixel in the third sample normal map; and the weight parameters of the second neural network learning model are updated based on the twelfth and thirteenth loss functions to obtain the trained depth map to normal map model.

[0143] The second neural network learning model is based on Partial Conv convolutional layers and is designed with a simple U-Net network structure. After training, it is tested on a test set to ensure that the depth map to normal map model is usable and robust.

[0144] Please see Figure 9 This is a flowchart illustrating the virtual live streaming method provided in the second embodiment of this application. The method can be executed by a broadcaster client and includes the following steps:

[0145] S100: Obtain the scene image, and generate the depth map and normal map corresponding to the scene image using the depth map and normal map generation methods described above.

[0146] The scene image can be a scene image captured by the broadcaster using the broadcaster's client camera, or a scene image pre-stored by the broadcaster's client. Specifically, the scene image can be an indoor scene image, including lighting, people, tables, chairs, and sofas, or an outdoor scene image, including natural light, buildings, mountains, and rivers. In this embodiment, the broadcaster's client acquires the scene image and runs a trained second depth map and normal map prediction model to obtain the depth map and normal map corresponding to the scene image.

[0147] S200: Acquire the live stream image, perform foreground segmentation on the live stream image, and obtain the anchor image.

[0148] Foreground segmentation involves separating the foreground from the background in an image. Foreground segmentation methods are existing technologies and will not be elaborated upon here. In this embodiment, the live stream image can be a cropped image of the live stream during the broadcast, or a cropped preview image of the live stream before the broadcast begins. Foreground segmentation is performed on the live stream image to obtain the broadcaster image.

[0149] S300: Merges the depth map and normal map corresponding to the anchor image and the scene image to obtain a merged image;

[0150] S400: Real-time rendering and display of merged images.

[0151] In this embodiment, the fused image is rendered in real time using a renderer, which can simulate the realistic effects in the scene image. Specifically, if the scene image includes lights, the actual lighting effects can be simulated, including the color, direction, and type of the lights.

[0152] The live streamer's client can apply the depth map and normal map corresponding to the scene image to the virtual live stream scene, such as 3D lighting, virtual-real interaction, and various AR effects. For 3D lighting, the streamer does not need to set up real background lights, thereby reducing the cost and complexity of starting a live stream, and generating high-quality, efficient interactive content, improving the retention rate of viewers in the live stream room.

[0153] Please see Figure 10 This is a schematic diagram of the structure of the depth map and normal map generation apparatus for images provided in the third embodiment of this application. This apparatus can be implemented as all or part of a computer device through software, hardware, or a combination of both. The apparatus 9 includes:

[0154] The dataset acquisition module 91 is used to acquire the first sample image dataset; the first sample image dataset includes several first sample images;

[0155] The sample image input module 92 is used to input each first sample image into the trained first depth map and normal map prediction model to obtain the depth estimation map and normal estimation map corresponding to each first sample image;

[0156] The depth estimation map input module 93 is used to input each depth estimation map into the trained depth map to normal map model to obtain the first normal map corresponding to each first sample image;

[0157] The second sample image acquisition module 94 is used to obtain several second sample images that meet preset conditions from the first sample image dataset based on the normal estimation map and the first normal map, and to obtain the second depth map and the second normal map corresponding to the several second sample images.

[0158] The model training module 95 is used to input each second sample image, the second depth map and the second normal map corresponding to the second sample image into the second depth map and normal map prediction model for training, and to obtain the trained second depth map and normal map prediction model.

[0159] The image acquisition module 96 is used to acquire the image to be predicted.

[0160] The depth map acquisition module 97 is used to input the image to be predicted into the trained second depth map and normal map prediction model to obtain the depth map and normal map corresponding to the image to be predicted.

[0161] It should be noted that the image depth map and normal map generation apparatus provided in the above embodiments is only illustrated by the division of the above functional modules when executing the image depth map and normal map generation method. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the image depth map and normal map generation apparatus and the image depth map and normal map generation method provided in the above embodiments belong to the same concept, and the implementation process is detailed in the method embodiments, which will not be repeated here.

[0162] Please see Figure 11 This is a schematic diagram of the structure of the virtual live streaming device provided in the fourth embodiment of this application. This device can be implemented as all or part of a computer device through software, hardware, or a combination of both. The device 10 includes:

[0163] The lighting scene image acquisition module 101 is used to acquire a lighting scene image and input the lighting scene image into a trained second depth map and normal map prediction model to obtain the depth map and normal map corresponding to the lighting scene image; wherein, the trained second depth map and normal map prediction model is the trained second depth map and normal map prediction model in the image depth map and normal map generation method of any one of claims 1 to 4 or claims 6 to 9.

[0164] The live room image acquisition module 102 is used to acquire live room images, perform foreground segmentation on the live room images, and obtain the anchor image;

[0165] The image fusion module 103 is used to fuse the depth map and normal map corresponding to the anchor image and the lighting scene image to obtain a fused image;

[0166] Image rendering module 104 is used to perform real-time lighting rendering on the merged image.

[0167] It should be noted that the virtual live streaming device provided in the above embodiments is only illustrated by the division of the above functional modules when executing the virtual live streaming method. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the virtual live streaming device and the virtual live streaming method provided in the above embodiments belong to the same concept, and the implementation process can be found in the method embodiments, which will not be repeated here.

[0168] Please see Figure 12 This is a schematic diagram of the structure of the computer device provided in the fifth embodiment of this application. Figure 12 As shown, the computer device 21 may include: a processor 210, a memory 211, and a computer program 212 stored in the memory 211 and capable of running on the processor 210, such as a live streaming control program for team interaction; when the processor 210 executes the computer program 212, it implements the steps in the above embodiments.

[0169] The processor 210 may include one or more processing cores. The processor 210 connects to various parts within the computer device 21 using various interfaces and lines. It executes various functions of the computer device 21 and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 211, and by accessing data in the memory 211. Optionally, the processor 210 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 210 may integrate one or more of the following: a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the content required to be displayed on the touch screen; and the modem handles wireless communication. It is understood that the modem may also be implemented as a separate chip without being integrated into the processor 210.

[0170] The memory 211 may include random access memory (RAM) or read-only memory. Optionally, the memory 211 may include a non-transitory computer-readable storage medium. The memory 211 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 211 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch instructions), instructions for implementing the various method embodiments described above, etc.; the data storage area may store data involved in the various method embodiments described above, etc. Optionally, the memory 211 may also be at least one storage device located remotely from the aforementioned processor 210.

[0171] This application also provides a computer storage medium that can store multiple instructions. These instructions are applicable to being loaded by a processor and executed by the method steps of the above embodiments. For details of the execution process, please refer to the specific description of the above embodiments, which will not be repeated here.

[0172] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0173] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0174] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0175] In the embodiments provided by this invention, it should be understood that the disclosed apparatus / terminal devices and methods can be implemented in other ways. For example, the apparatus / terminal device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0176] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0177] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0178] If integrated modules / units are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of the present invention can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms.

[0179] This invention is not limited to the above-described embodiments. If any modifications or variations to this invention do not depart from the spirit and scope of this invention, and if such modifications and variations fall within the scope of the claims and equivalent technologies of this invention, then this invention also intends to include such modifications and variations.

Claims

1. A method for generating a depth map and normal map of an image, characterized in that, The method includes the following steps: Obtain a first sample image dataset; the first sample image dataset includes several first sample images; Each first sample image is input into a trained first depth map and normal map prediction model to obtain a depth estimation map and a normal estimation map corresponding to each first sample image. Each of the depth estimation maps is input into a trained depth map to normal map model to obtain a first normal map corresponding to each of the first sample images; Based on the normal estimation map and the first normal map, several second sample images that meet preset conditions are obtained from the first sample image dataset, and a second depth map and a second normal map corresponding to the several second sample images are obtained, including: performing a similarity measurement on the normal estimation map and the first normal map corresponding to each first sample image to obtain a first confidence map corresponding to each first sample image; based on the first confidence map, several second sample images that meet preset conditions are obtained from the first sample image dataset, and a second depth map and a second normal map corresponding to the several second sample images are obtained; Each second sample image, the second depth map and the second normal map corresponding to the second sample image are input into the second depth map and normal map prediction model for training, and the trained second depth map and normal map prediction model is obtained. Obtain the image to be predicted; The image to be predicted is input into the trained second depth map and normal map prediction model to obtain the depth map and normal map corresponding to the image to be predicted.

2. The method for generating depth maps and normal maps of an image according to claim 1, characterized in that: The step of performing a similarity measurement on the normal estimation map and the first normal map corresponding to each first sample image to obtain a first confidence map corresponding to each first sample image includes: Obtain the normal vector value of each pixel in the normal estimation map corresponding to each first sample image and the normal vector value of the corresponding pixel in the first normal map corresponding to each first sample image; Calculate the cosine of the angle between the normal vector value of each pixel in the normal estimation map and the normal vector value of the corresponding pixel in the first normal map; The cosine value of the included angle is used as the pixel value of each pixel to obtain the first confidence map corresponding to each first sample image.

3. The method for generating depth maps and normal maps of an image according to claim 1, characterized in that: The step of obtaining a plurality of second sample images that meet preset conditions from the first sample image dataset based on the first confidence map, and obtaining a second depth map and a second normal map corresponding to the plurality of second sample images, includes: The average of the pixel values of all pixels in the first confidence map corresponding to each first sample image is calculated to obtain the global confidence value corresponding to each first sample image. Iterate through each of the global confidence values. If the current global confidence value is greater than or equal to a preset threshold, use the first sample image corresponding to the current global confidence value as the second sample image to obtain the second depth map and the second normal map corresponding to the second sample image.

4. The method for generating depth maps and normal maps of an image according to any one of claims 1 to 3, characterized in that: The step of inputting each second sample image, the corresponding second depth map, and the second normal map into the second depth map and normal map prediction model for training, and obtaining the trained second depth map and normal map prediction model, includes: Each second sample image is input into the second depth map and normal map prediction model to obtain the predicted depth map and predicted normal map corresponding to each second sample image; The predicted depth map and the second depth map are respectively input into the trained depth map to normal map model to obtain the third normal map and the fourth normal map; A similarity measure is performed on the predicted normal map and the third normal map to obtain the corresponding second confidence map; The second depth map and normal map prediction model is trained based on the second depth map, the second normal map, the third normal map, the fourth normal map, the predicted depth map, the predicted normal map, and the second confidence map to obtain the trained second depth map and normal map prediction model.

5. The method for generating depth maps and normal maps of an image according to claim 4, characterized in that: The step of training the second depth map and normal map prediction model based on the second depth map, the second normal map, the third normal map, the fourth normal map, the predicted depth map, the predicted normal map, and the second confidence map to obtain the trained second depth map and normal map prediction model includes: The first loss function is obtained by averaging the product of the difference between the normal vector value of each pixel in the second normal map and the normal vector value of the corresponding pixel in the third normal map and the pixel value of the corresponding pixel in the second confidence map. The second loss function is obtained by averaging the product of the cosine of the angle between the normal vector value of each pixel in the second normal map and the normal vector value of the corresponding pixel in the third normal map and the pixel value of the corresponding pixel in the second confidence map. The third loss function is obtained by averaging the difference between the normal vector value of each pixel in the third normal map and the normal vector value of the corresponding pixel in the fourth normal map. The fourth loss function is obtained by averaging the difference between the normal vector value of each pixel in the second normal map and the normal vector value of the corresponding pixel in the predicted normal map. The fifth loss function is obtained by averaging the cosine of the angle between the normal vector value of each pixel in the second normal map and the normal vector value of the corresponding pixel in the predicted normal map. The depth values of each pixel in the predicted depth map are averaged to obtain a first average depth value; a first regularized depth value is obtained based on the depth values of each pixel in the predicted depth map and the first average depth value. The depth values of each pixel in the second depth map are averaged to obtain a second average depth value; a second regularized depth value is obtained based on the depth values of each pixel in the second depth map and the second average depth value. The sixth loss function is obtained by averaging the difference between the first regularized depth value and the second regularized depth value; the seventh loss function is obtained by calculating the gradient of the difference between the depth value of each pixel in the predicted depth map and the depth value of the corresponding pixel in the second depth map. The second depth map and normal map prediction model is trained based on the first loss function, the second loss function, the third loss function, the fourth loss function, the fifth loss function, the sixth loss function, and the seventh loss function to obtain the trained second depth map and normal map prediction model.

6. The method for generating depth maps and normal maps of an image according to claim 5, characterized in that: Before the step of inputting each second sample image into the second depth map and normal map prediction model to obtain the predicted depth map and predicted normal map corresponding to each second sample image, the following steps are included: Each second sample image is input into the first neural network learning model to obtain the initial depth map and initial normal map corresponding to each second sample image; The first neural network learning model is trained based on the initial depth map, the initial normal map, the second depth map, and the second normal map corresponding to each second sample image to obtain a second depth map and normal map prediction model.

7. The method for generating depth maps and normal maps of an image according to claim 6, characterized in that: The step of training the first neural network learning model based on the initial depth map, the initial normal map, the second depth map, and the second normal map corresponding to each second sample image to obtain a second depth map and normal map prediction model includes: The depth values of each pixel in the initial depth map are averaged to obtain a third average depth value; a third regularized depth value is obtained based on the depth values of each pixel in the initial depth map and the third average depth value. The depth values of each pixel in the second depth map are averaged to obtain a fourth average depth value; a fourth regularized depth value is obtained based on the depth values of each pixel in the second depth map and the fourth average depth value. The eighth loss function is obtained by averaging the difference between the third regularized depth value and the fourth regularized depth value; the ninth loss function is obtained by calculating the gradient between the depth value of each pixel in the initial depth map and the depth value of the corresponding pixel in the second depth map. The tenth loss function is obtained by averaging the difference between the normal vector value of each pixel in the initial normal map and the normal vector value of the corresponding pixel in the second normal map; the eleventh loss function is obtained by averaging the cosine of the angle between the normal vector value of each pixel in the initial normal map and the normal vector value of the corresponding pixel in the second normal map. The first neural network learning model is trained based on the eighth loss function, the ninth loss function, the tenth loss function, and the eleventh loss function to obtain the second depth map and normal map prediction model.

8. The method for generating depth maps and normal maps of an image according to claim 1, characterized in that: Before the step of inputting each first depth map into a trained depth map-to-normal map model to obtain the second normal map corresponding to each first sample image, the following steps are included: Obtain the second sample image dataset; the second sample image dataset includes several third sample depth maps and corresponding third sample normal maps; Each of the third sample depth maps is input into the second neural network learning model to obtain the corresponding sample normal map. A loss function is constructed using the sample normal map and the third sample normal map. Based on the loss function, the weight parameters of the second neural network learning model are updated to obtain the trained depth map to normal map model.

9. A virtual live streaming method, characterized in that, The method includes the following steps: A scene image is acquired, and a depth map and normal map corresponding to the scene image are generated using the image depth map and normal map generation method according to any one of claims 1 to 8. Acquire a live stream image, perform foreground segmentation on the live stream image, and obtain the anchor image; The anchor image is fused with the depth map and normal map corresponding to the scene image to obtain a fused image; The fused image is rendered and displayed in real time.

10. An apparatus for generating a depth map and normal map of an image, characterized in that, include: The dataset acquisition module is used to acquire the first sample image dataset; The first sample image dataset includes several first sample images; The sample image input module is used to input each of the first sample images into the trained first depth map and normal map prediction model to obtain the depth estimation map and normal estimation map corresponding to each of the first sample images. The depth estimation map input module is used to input each of the depth estimation maps into a trained depth map to normal map model to obtain a first normal map corresponding to each of the first sample images; The second sample image acquisition module is configured to obtain a plurality of second sample images that meet preset conditions from the first sample image dataset based on the normal estimation map and the first normal map, and to obtain a second depth map and a second normal map corresponding to the plurality of second sample images, including: performing a similarity measurement on the normal estimation map and the first normal map corresponding to each first sample image to obtain a first confidence map corresponding to each first sample image; and obtaining a plurality of second sample images that meet preset conditions from the first sample image dataset based on the first confidence map, and to obtain a second depth map and a second normal map corresponding to the plurality of second sample images. The model training module is used to input each second sample image, the second depth map and the second normal map corresponding to the second sample image into the second depth map and normal map prediction model for training, and obtain the trained second depth map and normal map prediction model. The image acquisition module is used to acquire the image to be predicted. The depth map acquisition module is used to input the image to be predicted into the trained second depth map and normal map prediction model to obtain the depth map and normal map corresponding to the image to be predicted.

11. A virtual live streaming device, characterized in that, include: A scene image acquisition module is used to acquire a scene image and generate a depth map and normal map corresponding to the scene image using the depth map and normal map generation method of any one of claims 1 to 8. The live streaming room image acquisition module is used to acquire live streaming room images, perform foreground segmentation on the live streaming room images, and obtain the anchor image; The image fusion module is used to fuse the anchor image with the depth map and normal map corresponding to the scene image to obtain a fused image; The image rendering module is used to render and display the fused image in real time.

12. A computer device, comprising: A processor, a memory, and a computer program stored in the memory and executable on the processor, characterized in that, when the processor executes the computer program, it implements the steps of the method as claimed in any one of claims 1 to 8 or the method as claimed in claim 9.

13. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 8 or the method as described in claim 9.