Image processing method, apparatus, device, and storage medium
By using an automatic generation method for depth estimation and 3D mesh reconstruction, the problems of insufficient multi-view content and inadequate depth estimation accuracy in naked-eye 3D display technology have been solved, achieving efficient and accurate naked-eye 3D image generation and improving stereoscopic effect and image quality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SUZHOU ZHIJUXINLIAN MICROELECTRONICS CO LTD
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-19
AI Technical Summary
Existing glasses-free 3D display technology lacks high-quality multi-view content. Traditional production methods are cumbersome and costly, making them difficult to promote. Furthermore, depth estimation methods are not accurate enough in complex scenes, resulting in distorted 3D effects and noise.
Through depth estimation, optimization processing, and 3D mesh reconstruction, a pre-trained deep learning model is used to generate a depth map. Combined with camera spatial inverse transformation, an interlaced map for naked-eye 3D display is automatically generated, including non-uniform mapping and neighborhood threshold segmentation, to improve the depth accuracy and sense of hierarchy in the foreground region.
It achieves fully automated generation from a single 2D image to a naked-eye 3D interlaced image, improving processing efficiency and the accuracy of 3D meshes, enhancing stereoscopic effect and image precision, reducing rendering artifacts, and improving the visual effect of naked-eye 3D display.
Smart Images

Figure CN121908001B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer vision technology, and in particular to an image processing method, apparatus, device, and storage medium. Background Technology
[0002] With the rapid development of display technology, glasses-free 3D technology has attracted much attention for its ability to provide users with an immersive visual experience without the need for any auxiliary devices (such as 3D head-mounted displays). This technology presents multi-view images with parallax on glasses-free 3D display devices, allowing the user's left and right eyes to receive different images, thereby forming stereoscopic vision in the brain.
[0003] Currently, most mainstream glasses-free 3D display devices rely on lenticular or lenticular lens technology, which depends on specially processed multi-view image sequences or interlaced images. However, the scarcity of high-quality multi-view content has become a key bottleneck restricting the popularization and application of glasses-free 3D technology. Traditional 3D content production mainly relies on manually designed and produced virtual scenes, such as 3D modeling and rendering using engines like Unity, or shooting with expensive multi-view cameras. The process is cumbersome and costly, limiting its promotion in ordinary consumer applications. Summary of the Invention
[0004] In view of the above, embodiments of this application provide an image processing method, apparatus, device, and storage medium to solve at least one problem existing in the prior art.
[0005] In a first aspect, an image processing method is provided, the method comprising:
[0006] Depth estimation is performed on the input two-dimensional image to generate a first depth map, wherein the first depth map is used to characterize the distance of the spatial point corresponding to each pixel in the two-dimensional image relative to the camera that acquired the two-dimensional image;
[0007] The first depth map is optimized to obtain an optimized depth map;
[0008] Based on the inverse camera space transformation, a 3D mesh is reconstructed using the optimized depth map.
[0009] Based on the three-dimensional mesh, an interlaced map is generated for playback on glasses-free three-dimensional display devices.
[0010] In conjunction with the first aspect, in an optional implementation, the step of performing depth estimation on the input two-dimensional image to generate a first depth map includes:
[0011] A pre-trained deep learning model is used to predict the depth information of each pixel in the two-dimensional image, and the prediction results are linearly normalized to a predetermined interval to obtain a first depth map.
[0012] During the training process of the deep learning model, the depth information predicted by the model and the depth information in the training labels are both linearly normalized to a predetermined interval before being input into the loss function, and then the same non-uniform mapping is performed.
[0013] The non-uniform mapping is such that, for any two depth information intervals of equal length, the length of the interval formed by the corresponding values closer to the camera after mapping within the predetermined interval is greater than the length of the interval formed by the corresponding values farther from the camera after mapping.
[0014] In conjunction with the first aspect, in one optional implementation, the depth information is an inverse depth value that is negatively correlated with the depth value, and the larger the inverse depth value, the closer the distance to the camera;
[0015] The non-uniform mapping uses a non-linear function to map depth values that are transformed from inverse depth values and normalized to the predetermined interval.
[0016] The nonlinear function is a monotonically increasing power function. During the training phase, the power exponent of the power function is set to a predetermined value that is greater than 0 and less than 1; during the inference phase, the power exponent of the power function is set to 1.
[0017] In conjunction with the first aspect, in an optional implementation, the optimization processing of the first depth map to obtain an optimized depth map includes:
[0018] For each pixel in the first depth map, a neighborhood threshold is determined based on the statistics of the depth information in the neighborhood of the pixel.
[0019] Based on the neighborhood threshold, the first depth map is divided into a foreground region and a background region to obtain a second depth map;
[0020] Image completion processing is performed on the background hole region after removing the foreground region in the second depth map to obtain the optimized depth map.
[0021] In conjunction with the first aspect, in an optional implementation, the depth information is an inverse depth value; the step of dividing the first depth map into a foreground region and a background region based on the neighborhood threshold to obtain a second depth map includes:
[0022] For each pixel in the first depth map, based on the neighborhood threshold corresponding to the pixel, it is determined whether the pixel belongs to the foreground region or the background region, and the pixel determined to belong to the foreground region is assigned the maximum inverse depth value in the neighborhood where the pixel is located, thus obtaining the third depth map.
[0023] For each pixel in the third depth map, if there is a pixel belonging to the foreground region in the neighborhood of the pixel, then the inverse depth value of all pixels in the neighborhood of the pixel is adjusted to the maximum inverse depth value of the neighborhood, so as to expand the foreground region.
[0024] Gradient constraint optimization is performed on the expanded third depth map to obtain a second depth map; the gradient constraint optimization is used to minimize the difference in depth information of each pixel between the adjusted depth map and the unadjusted depth map, and at the same time minimize the gradient magnitude of the adjusted depth map in the neighborhood of each pixel.
[0025] In conjunction with the first aspect, in an optional implementation, for each pixel in the first depth map, the difference between the maximum and minimum inverse depth values in the neighborhood of the pixel is determined as the neighborhood range corresponding to the pixel.
[0026] The step of determining whether a pixel belongs to a foreground region or a background region based on a neighborhood threshold corresponding to the pixel includes:
[0027] If the neighborhood range of a pixel is greater than a preset range threshold, the pixel is determined to belong to either the foreground region or the background region based on the neighborhood threshold.
[0028] In conjunction with the first aspect, in an optional implementation, the reconstruction of the 3D mesh based on the camera space inverse transform and the optimized depth map includes:
[0029] Based on the camera space inverse transform, the optimized depth map is converted into a distance map; wherein, the distance map is used to characterize the distance from the three-dimensional spatial point corresponding to each pixel to the camera position along the camera ray direction;
[0030] Based on the distance values of each pixel in the distance map, the position of the three-dimensional spatial point corresponding to each pixel on the camera ray is determined, and the mesh vertices are obtained;
[0031] A foreground 3D mesh is constructed based on the mesh vertices belonging to the foreground region, and a background 3D mesh is constructed based on the mesh vertices belonging to the background region.
[0032] In conjunction with the first aspect, in an optional implementation, the step of converting the optimized depth map into a distance map based on the camera space inverse transform includes:
[0033] Acquire the intrinsic parameter information of the camera that acquires the two-dimensional image and the size of the two-dimensional image, wherein the intrinsic parameter information includes the camera position and the camera field of view.
[0034] For each pixel, perform the following operations:
[0035] Based on the position of the pixel in the two-dimensional image, the size of the two-dimensional image, and the camera field of view, a camera ray is determined with the camera position as the starting point.
[0036] Based on the intersection point of the camera ray and the reference plane, a first distance from the camera position to the intersection point is determined; wherein, the reference plane is perpendicular to the camera optical axis and corresponds to the plane where the naked-eye 3D screen is located;
[0037] Based on the depth information of the pixel in the optimized depth map, determine the second distance from the spatial point corresponding to the pixel to the reference plane along the camera ray;
[0038] Subtract the first distance from the second distance to obtain the distance value corresponding to the pixel;
[0039] The distance map is generated based on the distance values corresponding to all pixels.
[0040] In conjunction with the first aspect, in an optional implementation, generating an interlaced map for playback on a glasses-free 3D display device based on the three-dimensional mesh includes:
[0041] Based on the display parameters of the target glasses-free 3D display device, the foreground 3D mesh and the background 3D mesh are rendered from multiple perspectives, and the rendered images from multiple perspectives are combined into an interlaced image for playback on the target glasses-free 3D display device.
[0042] Secondly, an image processing apparatus is provided, the apparatus comprising:
[0043] The first generation module is used to perform depth estimation on the input two-dimensional image and generate a first depth map, wherein the first depth map is used to characterize the distance of the spatial point corresponding to each pixel in the two-dimensional image relative to the camera that acquired the two-dimensional image.
[0044] An optimization processing module is used to optimize the first depth map to obtain an optimized depth map;
[0045] The 3D reconstruction module is used to reconstruct a 3D mesh based on the inverse camera space transformation and the optimized depth map.
[0046] The second generation module is used to generate an interlaced map for playback on a naked-eye 3D display device based on the three-dimensional mesh.
[0047] Thirdly, an electronic device is provided, including a processor, a memory, and an executable program stored in the memory and operable by the processor, wherein the processor, when running the executable program, performs steps of the image processing method provided in any of the first aspects.
[0048] Fourthly, a computer-readable storage medium is provided having an executable program stored thereon, which, when executed by a processor, implements the steps of the image processing method provided in any of the first aspects.
[0049] This application provides an image processing method, apparatus, device, and storage medium. Through depth estimation, optimization processing, 3D mesh reconstruction, and interlacing map generation, a complete automated pipeline for generating 3D displays from 2D images is formed. Compared to traditional 3D imaging algorithms, this embodiment only requires a 2D image as input, without relying on manual intervention or repair. It can achieve fully automated generation of 3D interlacing maps from a single 2D image and can generate interlacing maps in batches, resulting in high processing efficiency. Simultaneously, through depth optimization and mesh reconstruction based on camera space inverse transform, the accuracy of the generated 3D mesh is improved, thereby enhancing the stereoscopic effect and image precision of the 3D image. Attached Figure Description
[0050] Figure 1 This is one of the schematic flowcharts illustrating an image processing method according to an embodiment;
[0051] Figure 2 This is a second schematic flowchart illustrating an image processing method according to an embodiment;
[0052] Figure 3 The diagram shows a comparison of the depth estimation results for scenes with large depth of field;
[0053] Figure 4 This is a third schematic flowchart illustrating an image processing method according to an embodiment;
[0054] Figure 5 A comparative diagram showing the foreground-background connection artifacts and their optimization effects is presented.
[0055] Figure 6 This is a fourth schematic flowchart illustrating an image processing method according to an embodiment;
[0056] Figure 7 A comparative diagram showing the jagged edges of depth maps and their optimization effects is presented.
[0057] Figure 8 A comparative diagram showing the effect of different color transition directions on the degree of artifacts in the foreground-background boundary area is presented.
[0058] Figure 9 This is a fifth schematic flowchart illustrating an image processing method according to an embodiment;
[0059] Figure 10 This is a schematic flowchart of an image processing method according to an embodiment, number six.
[0060] Figure 11 This diagram illustrates the geometric relationships generated by the inverse camera parameters and the calculation of vertex positions.
[0061] Figure 12 The side view effect comparison is shown between geometry generated directly using depth map coordinates and geometry generated using inverse camera parameters;
[0062] Figure 13 The diagram illustrates two camera array modes in multi-view rendering on a naked-eye 3D screen.
[0063] Figure 14 This is a schematic flowchart of an image processing method according to an embodiment, number seven.
[0064] Figure 15 This is a schematic diagram of the structure of an image processing apparatus according to one embodiment. Detailed Implementation
[0065] To make the technical solution and beneficial effects of this application more apparent and understandable, a detailed description is provided below by listing specific embodiments. The accompanying drawings are not necessarily drawn to scale, and local features may be enlarged or reduced to more clearly show the details of the local features; unless otherwise defined, the technical and scientific terms used herein have the same meanings as those in the technical field to which this application pertains.
[0066] The embodiments in this application are not exhaustive, but merely illustrative of some embodiments, and are not intended to limit the scope of protection of this disclosure. Unless otherwise specified, each step in a particular embodiment can be implemented as an independent embodiment, and the steps can be arbitrarily combined. For example, a solution after removing some steps in a particular embodiment can also be implemented as an independent embodiment, and the order of the steps in a particular embodiment can be arbitrarily interchanged. Furthermore, the optional implementation methods in a particular embodiment can be arbitrarily combined; moreover, the embodiments can be arbitrarily combined, for example, some or all steps of different embodiments can be arbitrarily combined, and a particular embodiment can be arbitrarily combined with the optional implementation methods of other embodiments.
[0067] In each embodiment of this application, unless otherwise specified or in case of logical conflict, the terminology and / or descriptions of the embodiments are consistent and can be referenced by each other. Technical features in different embodiments can be combined to form new embodiments based on their inherent logical relationships.
[0068] In the description of the embodiments of this application, it should be understood that the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, features defined with "first" and "second" may explicitly or implicitly include one or more of the stated features. In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0069] In existing technologies, researchers have explored various methods to transform 2D images into content suitable for glasses-free 3D devices. One type of method relies on manual, interactive depth map drawing and scene layering, requiring professionals to perform depth annotation and region segmentation on the images. While this method can achieve good stereoscopic effects, it is highly dependent on manual intervention, has low processing efficiency, and is difficult to support large-scale applications. Another type of method automatically estimates depth based on traditional computer vision algorithms such as stereo matching or Structure from Motion (SfM). However, these methods often lack sufficient accuracy and robustness in depth estimation when dealing with complex scenes, weakly textured regions, or occlusions, easily generating noise and holes. This results in visual defects such as distortion and jitter in the final 3D effect, making it difficult to meet the requirements of high-quality glasses-free 3D displays.
[0070] To facilitate understanding, a brief explanation of some of the terms used in this application will be provided first.
[0071] Visual large model: A deep learning model with a huge number of parameters that can learn general visual features and knowledge from a large amount of image or video data to complete a variety of complex visual understanding and generation tasks.
[0072] Depth map optimization refers to enhancing and correcting the original depth estimation map through algorithms to improve its accuracy, smoothness, and fit with the scene structure, thereby obtaining higher quality depth information.
[0073] Mesh reconstruction: The process of generating a three-dimensional surface mesh model composed of vertices, edges and faces based on input data such as point clouds, depth maps or multi-view images is a key step in three-dimensional digitization.
[0074] Naked-eye 3D: A display technology that allows viewers to directly perceive three-dimensional images with a stereoscopic depth effect without wearing any special glasses or helmets or other auxiliary equipment.
[0075] Figure 1 This is a flowchart illustrating an image processing method according to one embodiment. Figure 1 As shown, the method includes the following steps:
[0076] S110: Perform depth estimation on the input two-dimensional image to generate a first depth map, wherein the first depth map is used to characterize the distance of the spatial point corresponding to each pixel in the two-dimensional image relative to the camera that acquired the two-dimensional image;
[0077] S120: Optimize the first depth map to obtain an optimized depth map;
[0078] S130: Reconstructing a 3D mesh based on camera spatial inverse transformation and optimized depth map;
[0079] S140: Generates an interlaced graph for playback on glasses-free 3D display devices based on a 3D mesh.
[0080] The image processing method provided in this application can be applied to any electronic device with image processing capabilities, such as a server or a terminal with image processing software installed. The server can be a single server, a server cluster consisting of multiple servers, or a cloud computing service center; the terminal can be a PC (Personal Computer), smartphone, PDA (Personal Digital Assistant), tablet computer, or glasses-free 3D display device. This application does not impose any limitations on the specific type of the aforementioned electronic device.
[0081] In some examples, electronic devices can acquire two-dimensional images in various ways. For instance, in response to a user upload operation, the electronic device can acquire a single image uploaded by the user or multiple two-dimensional images imported in batches. Another example is that a two-dimensional image can be any frame from a video; the electronic device can import a user-specified video, decode it, and extract the image sequence frame by frame for processing. Additionally, the electronic device can connect to a camera to acquire two-dimensional images in real time and process the acquired images in real time.
[0082] It is worth noting that this application does not require stereo matching or multi-view reconstruction based on two-dimensional images from different perspectives, but can automatically generate naked-eye 3D effects with only a single two-dimensional image.
[0083] Depth estimation is used to assign depth information to each pixel to generate a first depth map. In step S110 above, a pre-trained deep learning model can be used to estimate the depth of the two-dimensional image. The model input is a two-dimensional image, and the output is a depth information map with the same size as the input image. The depth information output by the model can be a depth value or an inverse depth value, depending on the depth representation used during model training. The depth value represents the actual distance between the spatial point corresponding to the pixel and the camera; a smaller depth value indicates that the point is closer to the camera. The inverse depth value is negatively correlated with the depth value and can be the reciprocal of the depth value or a linear transformation thereof; a larger inverse depth value indicates that the spatial point corresponding to the pixel is closer to the camera.
[0084] Considering that the depth information output by the model may have inconsistent ranges (for example, the maximum and minimum values of different images may differ significantly), it needs to be normalized. For example, the original depth information (e.g., inverse depth value) output by the model is linearly normalized and mapped to a predetermined interval (e.g., [0,1]) to obtain the first depth map.
[0085] It should be noted that if the model has already been trained and directly outputs the normalized depth information, then step S110 can directly use the depth map output by the model as the first depth map without additional normalization processing.
[0086] In step S120 above, to improve the quality of the depth map, the first depth map can be optimized to provide more accurate input for subsequent 3D reconstruction. For example, smoothing algorithms such as median filtering and Gaussian filtering can be used to remove isolated noise points; or image inpainting algorithms can be used to fill holes in the depth map caused by occlusion or weak texture areas. Through one or more of the above optimization methods, an optimized depth map with improved quality is obtained.
[0087] Inverse camera spatial transformation is the process of using the camera imaging parameters when acquiring a 2D image to map the pixel coordinates in the 2D image into a spatial ray (i.e., a camera ray) originating from the camera position.
[0088] In step S130 above, the spatial ray corresponding to each two-dimensional pixel can be obtained based on the inverse process of camera space. Combined with the depth information in the optimized depth map, the two-dimensional pixels are mapped to three-dimensional space to construct a renderable three-dimensional mesh model. For example, a spatial ray originating from the camera position can be determined for each pixel. Combining the depth information of each pixel in the optimized depth map, the position of each pixel in three-dimensional space can be calculated. All the three-dimensional spatial points corresponding to all pixels are connected according to pixel adjacency relationships to form a triangular mesh, generating a continuous three-dimensional mesh model.
[0089] In step S140 above, the reconstructed 3D mesh can be rendered from multiple perspectives to generate multiple viewpoint images. Based on the optical parameters of the naked-eye 3D screen, these multiple viewpoint images are synthesized into an interlaced image, which can be used by the naked-eye 3D display device to play stereoscopic effects. The optical parameters of the naked-eye 3D screen may include key imaging parameters such as screen resolution, multi-view camera arrangement, number of viewpoints, optimal viewing distance, and / or lens spacing. In this embodiment, the naked-eye 3D screen is the screen of the naked-eye 3D display device.
[0090] The image processing method provided in this application forms a complete automated processing pipeline from 2D images to naked-eye 3D display through depth estimation, optimization processing, 3D mesh reconstruction, and interlacing map generation. Compared with traditional naked-eye 3D imaging algorithms, this embodiment only requires a 2D image as input, without relying on manual intervention or repair. It can achieve fully automated generation from a single 2D image to a naked-eye 3D interlacing map, and can generate interlacing maps in batches, resulting in high processing efficiency. At the same time, through depth optimization and mesh reconstruction based on camera space inverse transformation, the accuracy of generating 3D meshes is improved, thereby enhancing the stereoscopic effect and image precision of naked-eye 3D images.
[0091] With the development of deep learning technology, especially the widespread adoption of large-scale visual models, new solutions have been provided for depth estimation of single images. These models, trained on large-scale datasets, are able to predict more accurate and dense depth information from a single image. Depth estimation in related technologies mainly relies on two methods: one is a regression-based end-to-end visual model that directly maps the input image to a depth map; the other is a generative method based on diffusion models, which learns a mapping from the image distribution to the conditional distribution of the depth map. However, these methods share a common problem: insufficient ability to distinguish subtle depth changes in foreground regions (such as faces), resulting in foreground objects often appearing flat and lacking a sense of depth in 3D reconstruction results.
[0092] The inventors' analysis revealed that the root cause of the problem lies in the fact that during training, the significant depth difference between the foreground and background dominates the optimization direction of the loss function, while the depth difference between adjacent regions within the foreground is relatively small, making it easy for the network to ignore these local details. Some depth estimation methods use inverse depth as the fitting target during training, which improves the prediction accuracy of the foreground region to some extent. However, when the scene has a large depth range, especially when it contains complex elements such as ground and sky, the estimation of foreground depth still has significant bias, and the foreground's layer richness is poor.
[0093] In glasses-free 3D display applications, the foreground region, which produces a prominent visual effect on the screen, is the part that viewers are most sensitive to in perceiving 3D space. A large error in foreground depth estimation directly affects the perceived stereoscopic realism. One possible implementation is to amplify the foreground error by assigning higher weights to regions closer to the camera (i.e., pixels with larger depth values) in the loss function, allowing the neural network to focus more on improving foreground estimation accuracy during training. However, this method faces a new challenge: the model typically predicts a relative inverse depth, the numerical range of which is not fixed across different images. For glasses-free 3D displays, the distinction between foreground and background is a relative concept, prevalent in various visual scenes from microscopic to macroscopic. Therefore, the goal of algorithm design should be to adaptively identify and accurately reconstruct foreground and background layers in any image, independent of the absolute physical scale of the scene, thus possessing good versatility.
[0094] To improve the depth accuracy and layer richness of the foreground region, which is most sensitive to 3D perception, this application has made targeted optimizations to the depth estimation method based on the deep learning large vision model.
[0095] In some embodiments, such as Figure 2 As shown, an image processing method is provided, which may include the following steps:
[0096] S200: A pre-trained deep learning model is used to predict the depth information of each pixel in a two-dimensional image, and the prediction results are linearly normalized to a predetermined interval to obtain the first depth map. During the training process of the deep learning model, the depth information predicted by the model and the depth information in the training labels are both linearly normalized to a predetermined interval before being input into the loss function, and then the same non-uniform mapping is performed.
[0097] Non-uniform mapping ensures that for any two depth information intervals of equal length, the length of the interval formed by the values closer to the camera after mapping is greater than the length of the interval formed by the values farther from the camera after mapping.
[0098] In some examples, step S200 above can be used as Figure 1 Step S110 is one optional implementation, and can also be implemented separately to constitute an independent embodiment. For related descriptions of the two-dimensional image, the first depth map, etc., please refer to... Figure 1 The related parts in the embodiments involved will not be described again here.
[0099] Deep learning models can employ encoder-decoder structures, such as those based on convolutional neural networks or Transformers. The prediction results include depth information predicted for each pixel in a two-dimensional image. The depth information can be depth values or inverse depth values, depending on the requirements of the model task.
[0100] Linear normalization is used to map the prediction result to a predetermined interval, such as [0, 1], through a linear transformation.
[0101] As an example, linear normalization can be implemented using the minimum-maximum normalization method. For instance, if the model outputs inverse depth values, for each pixel, the depth value I (i.e., the original depth value) converted from the inverse depth value is calculated, and the linearly normalized depth value of that pixel is then calculated. ,in, and These are the minimum and maximum values among the depth values converted from inverse depth values, respectively. If the model outputs depth values, linear normalization can be directly used to map them to a predetermined interval.
[0102] During the training phase, the depth information predicted by the model and the depth information in the training labels are first linearly normalized to a predetermined interval, then subjected to the same nonlinear mapping, and then the loss function is calculated. The network parameters are then optimized through backpropagation.
[0103] Training labels can be acquired by sensors (such as depth cameras or LiDAR) or generated by computer graphics rendering. The specific form of the depth information in the training labels is consistent with the specific form of the depth information predicted by the model. Specifically, if the model predicts a depth value, the training labels use the corresponding depth map; if the model predicts an inverse depth value, the original depth map obtained from acquisition or rendering needs to be transformed by taking the inverse of each pixel to obtain the training labels. The loss function is used to measure the difference between the depth predicted by the model and the true depth, and can be L1 loss, L2 loss, scale-invariant loss, or gradient matching loss.
[0104] Considering the limited effective depth range of naked-eye 3D screens, areas near the background struggle to achieve a realistic sense of depth, resulting in a visual effect closer to planar textures. Therefore, in naked-eye 3D applications, the foreground region is far more important than the background. To address this, this embodiment introduces nonlinear mapping during the model training phase to enhance the model's ability to perceive subtle depth changes in the foreground region. Nonlinear mapping transforms the data using a nonlinear function, causing changes in the foreground region in the input space to be non-uniformly stretched in the output space, while changes in the background region are non-uniformly compressed in the output space.
[0105] Here, the foreground region is the set of pixels in the 2D image corresponding to spatial points relatively close to the camera. In depth space, the foreground region consists of pixels with smaller depth values; in inverse depth space, it consists of pixels with larger inverse depth values. The background region is the set of pixels in the 2D image corresponding to spatial points relatively far from the camera, and is the opposite of the foreground region.
[0106] In this embodiment, the non-uniform mapping has a greater local magnification factor in regions with smaller values (when the depth information is a depth value, smaller values correspond to the vicinity of the camera) or regions with larger values (when the depth information is an inverse depth value, larger values correspond to the vicinity of the camera). This causes depth value changes closer to the camera to occupy a longer interval in the mapped space, while depth value changes farther away from the camera are relatively compressed. Through this non-uniform mapping, subtle depth differences in the foreground region can be effectively amplified in the loss function, thereby guiding the network to pay more attention to the detailed learning of the foreground region during training.
[0107] As an example, the depth information uses inverse depth values. Since the inverse depth value itself is negatively correlated with distance (the larger the value, the closer the distance), combined with the characteristic that the non-uniform mapping has a greater local magnification in the foreground region (i.e., the region with a larger inverse depth value), it can further enhance the weight of depth differences in the foreground region during the training process, thereby effectively highlighting the three-dimensional sense of foreground objects in naked-eye 3D display.
[0108] It should be noted that the aforementioned nonlinear mapping is only used in the calculation of the loss function during the training phase to guide the model's optimization direction. During the inference phase (e.g., executing step S110), only the depth information output by the model is linearly normalized to obtain the depth map; the nonlinear mapping is no longer applied. This asymmetric design between training and inference ensures both the model's ability to learn foreground details and suppresses geometric distortion of depth values during inference.
[0109] In some embodiments, the depth information is an inverse depth value that is negatively correlated with the depth value, and the larger the inverse depth value, the closer it is to the camera; the non-uniform mapping uses a non-linear function to map the depth values that are converted and normalized to a predetermined interval from the inverse depth value; wherein, the non-linear function is a monotonically increasing power function, and during the training phase, the power exponent of the power function is set to a predetermined value that is greater than 0 and less than 1; during the inference phase, the power exponent of the power function is set to 1.
[0110] In this embodiment, during the training phase, the inverse depth values of each pixel predicted by the model are first converted into depth values and linearly normalized, mapping the depth values to the [0, 1] interval. Then, a monotonically increasing power function is used to map the linearly normalized depth values. This function must satisfy the following conditions: it is sensitive to changes in input values near the minimum depth value (i.e., foreground), allocating more output intervals; it is insensitive to changes in input values near the maximum depth value (i.e., background), compressing its output interval. Through this non-linear mapping, subtle depth differences in the foreground region are amplified during training, thereby guiding the network to pay more attention to learning foreground details. For example, the power function is: ,in, For example, The value is 0.3.
[0111] when When = 1, the power function degenerates into standard linear normalization, and the foreground and background are uniformly mapped. When At that time, the power function exhibits nonlinearity. With... As the depth decreases, the curve becomes steeper near the minimum depth value, meaning that small depth changes in the foreground are amplified and occupy a longer span within the predetermined interval; while large changes in the background area are compressed into a smaller output interval.
[0112] Through the training mechanism described above, the deep learning model can significantly improve the depth level richness of the foreground region while maintaining the overall depth estimation accuracy, providing higher quality depth input for subsequent 3D reconstruction and naked-eye 3D display.
[0113] In this embodiment, without changing the existing network training architecture, only a nonlinear mapping needs to be introduced at the output to improve foreground accuracy. During the training phase, a small power exponent (e.g., 0.3) is used to effectively amplify the depth variations in the foreground region within the normalized interval, thereby guiding the network to focus more on learning foreground details. During the inference phase, the power exponent is directly set to 1, causing the nonlinear mapping to degenerate into a linear identity mapping. Only the depth values output by the model are linearly normalized, avoiding depth distortion caused by nonlinear stretching. This approach does not require adding additional trainable network layers or retraining the network to adapt to new data distributions. By simply fine-tuning the output of the existing depth estimation model, the depth accuracy and layer richness of the foreground region can be effectively improved, resulting in a superior visual effect for the generated naked-eye 3D compared to existing solutions.
[0114] Figure 3This diagram illustrates a comparison of depth estimation results for scenes with large depth of field. From left to right: the original image, the result generated by current mainstream depth estimation methods, the depth estimation result obtained by our proposed solution after foreground accuracy optimization, and corresponding magnified images of their respective local regions. To protect privacy, the eye area of the face in the original image has been pixelated. The comparison shows that our proposed solution achieves higher depth contrast in faces, effectively enhancing the sense of depth.
[0115] In some embodiments, an image processing method is provided for optimizing a depth map to suppress rendering artifacts introduced by depth jumps between the foreground and background, such as... Figure 4 As shown, the method includes the following steps:
[0116] S121: For each pixel in the first depth map, determine the neighborhood threshold based on the statistics of the depth information in the neighborhood of that pixel.
[0117] In the process of generating 3D meshes, adjacent pixels in the image can be used as vertices to construct triangular patches. For example, a 2×2 pixel neighborhood can be regarded as the vertices of two triangles, and combined with the depth information of the pixels to form 3D coordinates. However, when there is a significant depth jump between the foreground and background regions, the 3D mesh that directly connects the two regions will produce severe rendering artifacts.
[0118] To identify and process the aforementioned depth jumps, a pixel-by-pixel neighborhood analysis can be performed on the first depth map to determine the neighborhood threshold corresponding to each pixel. The neighborhood threshold corresponding to any pixel is used to determine whether the pixel belongs to the foreground region or the background region.
[0119] The depth information in the first depth map can be either inverse depth values or depth values. For example, when using inverse depth values, a larger inverse depth value indicates that the spatial point corresponding to the pixel is closer to the camera, while a smaller inverse depth value indicates a greater distance. For each pixel, the inverse depth values of all pixels within its preset-size neighborhood (e.g., a 3×3 window or a 5×5 window) can be obtained, and the statistics of this neighborhood can be calculated as the neighborhood threshold for that pixel. The statistics can be the average or median of the inverse depth values within the neighborhood, or calculated based on the maximum and minimum values. For example, the maximum value B and the minimum value A among the inverse depth values of all pixels within the neighborhood of the pixel can be calculated according to the formula... The calculated inverse depth value is used as the neighborhood threshold for that pixel, where, The adjustment coefficient has a value range greater than 0 and less than 0.5. For example, The value is 0.2.
[0120] S122: Divide the first depth map into foreground and background regions based on the neighborhood threshold to obtain the second depth map.
[0121] For each pixel, its inverse depth value is compared with its corresponding neighborhood threshold. If the inverse depth value is greater than the neighborhood threshold, the pixel is determined to belong to the foreground region; if the inverse depth value is less than or equal to the neighborhood threshold, the pixel is determined to belong to the background region. By comparing pixels one by one, all pixels in the first depth map are divided into foreground and background regions, resulting in a second depth map. Here, the second depth map can be a marker map, where foreground and background pixels are assigned different marker values.
[0122] S123: Perform image completion processing on the background hole area after removing the foreground area in the second depth map to obtain the optimized depth map.
[0123] In this embodiment, threshold segmentation breaks the connection between the foreground and background regions, eliminating connection artifacts caused by depth jumps. Although connection artifacts are eliminated, holes in the background region are exposed from new perspectives. Therefore, image inpainting techniques can be used to fill these holes. Image inpainting can infer and fill the depth information within the holes based on the background information around them, maintaining the continuity and integrity of the background region.
[0124] In some examples, deep learning-based image inpainting models can be used to fill in background holes. For instance, image inpainting models such as Stable Diffusion can be used, taking a second depth map and a foreground region mask (a foreground region mask is used to identify the pixel positions in the image that belong to the foreground region, such as a binary image where foreground pixels are marked as 1 and background pixels as 0) as input, to generate inpainted background depth information. This image inpainting model, through depth distribution priors learned from a large amount of image data, can generate depth information for hole regions that is semantically consistent and texturally continuous with the surrounding background.
[0125] Figure 5 This diagram illustrates a comparison of foreground-background connection artifacts and their optimization effects. The left image shows the rendering artifacts resulting from a direct connection between the foreground and background regions, with noticeable stringy artifacts appearing at the foreground-background boundary. The right image shows the effect after disconnecting the foreground and background using the proposed solution and filling in the background holes. To protect privacy, the eye area of the face in the original image has been pixelated. The comparison demonstrates that the artifacts are effectively eliminated, the background area remains intact and continuous, and the overall visual effect is significantly improved.
[0126] In some examples, steps S121 to S123 above can be used as Figure 1Step S120 is one optional implementation, and can also be implemented separately to constitute an independent embodiment. For related descriptions of pixels, the first depth map, etc., please refer to... Figure 1 , Figure 2 The related parts in the embodiments involved will not be described again here.
[0127] In this embodiment, the first depth map is divided into foreground and background regions using a threshold method, eliminating connection artifacts caused by depth jumps. Then, image inpainting technology is used to fill the background holes after the foreground is removed, effectively filling the holes and obtaining a higher quality optimized depth map, which provides a more accurate depth input for subsequent 3D mesh reconstruction.
[0128] While thresholding solves the artifacts between foreground and background, two problems remain: first, jagged edges appear at foreground boundaries; second, the foreground region itself exhibits noticeable stringy artifacts. Optimizing the generated geometry is very time-consuming and often yields unsatisfactory results. The inventors discovered that, from an imaging perspective, the pixel values obtained after passing through a low-pass filter in a real natural scene result in a slow color transition at the edges, without obvious jagged edges. Depth estimation follows the same principle; whether it's a depth map captured by a sensor or a manually rendered composite, the depth values at the edges also transition slowly. This slow depth change, when converted to a 3D mesh, leads to jagged artifacts at the edges.
[0129] Therefore, in some embodiments, an image processing method is provided for optimizing the depth map to suppress jagged edges and stringy artifacts in the foreground region, such as... Figure 6 As shown, the method includes the following steps:
[0130] S1211: For each pixel in the first depth map, determine whether the pixel belongs to the foreground region or the background region based on the neighborhood threshold corresponding to the pixel, and assign the maximum inverse depth value in the neighborhood of the pixel to the pixel determined to belong to the foreground region, so as to obtain the third depth map.
[0131] To suppress the aforementioned jagged artifacts, a threshold filtering process can be applied to the first depth map. The depth information in the first depth map is the inverse depth value; a larger inverse depth value indicates that the spatial point corresponding to the pixel is closer to the camera, and a smaller inverse depth value indicates a greater distance. For each pixel, its inverse depth value is compared with its corresponding neighborhood threshold: if the pixel's inverse depth value is greater than the neighborhood threshold, the pixel is determined to belong to the foreground region; if the inverse depth value is less than or equal to the neighborhood threshold, the pixel is determined to belong to the background region. For pixels determined to be in the background region, their original depth values can be kept unchanged, or they can be assigned values as needed for subsequent processing. After all pixels undergo the above determination and assignment, a third depth map is obtained.
[0132] In this embodiment, through the above operations, the depth value of the boundary region between the foreground and background (i.e., the edge region) is uniformly increased to the foreground level, so that the originally slowly transitioning depth boundary becomes a step change, thereby suppressing jagged artifacts when generating the mesh.
[0133] Figure 7 This diagram illustrates a comparison of jagged edges in depth maps and their optimization effects. The left side shows the original depth map, where the slow transition of depth values at the foreground-background boundary results in noticeable jagged edges when generating a 3D mesh. The right side shows the optimized depth map after threshold filtering, and the mesh generated based on this optimized depth map. It's evident that the boundary transition changes from a slow transition to a step change, eliminating the obvious jagged artifacts in the generated 3D mesh. (For privacy protection, ...) Figure 7 The eye area of the face in the original image was pixelated.
[0134] Before thresholding, if the depth itself is not accurate enough, the result of thresholding will still be inaccurate. Edge regions are inherently areas of depth uncertainty. Low-pass filters calculate the pixel depth by weighting the depth of the foreground and background. This uncertainty results in noticeable wire-like artifacts when converted to a mesh. The inventors discovered that the severity of wire-like artifacts and geometric distortion artifacts is closely related to the direction of color transition between the foreground and background. Figure 8 This diagram illustrates the impact of different color transition directions on the degree of artifacts at the foreground-background boundary. (Example:) Figure 8 As shown, the left side shows the case where the foreground and background colors intermingle, the middle shows the case where the foreground color transitions to the background, and the artifacts are extremely severe in this case; the right side shows the case where the background color transitions to the foreground, and the artifacts are almost invisible in this case. This phenomenon indicates that the severity of artifacts is closely related to the direction of color transition at the boundary where the depth value jumps from "low to high".
[0135] To suppress drastic changes in the depth map, steps S1212 and S1213 can be used to expand the foreground region and optimize the gradient constraint.
[0136] S1212: For each pixel in the third depth map, if there is a pixel belonging to the foreground region in the neighborhood of the pixel, then the inverse depth value of all pixels in the neighborhood of the pixel is adjusted to the maximum inverse depth value of the neighborhood in order to expand the foreground region.
[0137] For example, for each pixel, it can be detected whether there are pixels that have been marked as foreground regions (i.e., pixels that were determined to be foreground and assigned a value in step S1211) within the N×N neighborhood of that pixel. If so, the inverse depth values of all pixels in the entire neighborhood of that pixel are adjusted to the maximum inverse depth value within that neighborhood. If there are no pixels belonging to the foreground region in the neighborhood of that pixel, no outward expansion processing is performed on that pixel and its neighborhood, and the original inverse depth value is maintained.
[0138] By expanding the foreground region outwards, this operation raises the inverse depth value of part of the background region to the foreground level. Background pixels that were originally located near the foreground edge are included in the foreground range, so that the depth transition area between the foreground and background is covered by the foreground depth information. This morphological operation effectively eliminates the wire-like artifacts caused by areas with uncertain depth.
[0139] S1213: Perform gradient constraint optimization on the expanded third depth map to obtain the second depth map; gradient constraint optimization is used to minimize the difference in depth information between the adjusted depth map and the unadjusted depth map at each pixel, and at the same time minimize the gradient magnitude of the adjusted depth map in the neighborhood of each pixel.
[0140] While the outward expansion process eliminates wire-like artifacts, it may introduce local abrupt changes in depth. To maintain the smoothness and geometric consistency of the depth map, this step performs gradient constraint optimization on the expanded depth map.
[0141] In one alternative implementation, this can be achieved by minimizing the following energy function. Achieve the gradient constraint optimization:
[0142] ;
[0143] in, This represents the depth value at pixel p in the adjusted depth map. Ω represents the inverse depth value at pixel p in the depth map before adjustment (i.e., the third depth map after expansion); Ω represents the entire image domain; N is the neighborhood of pixel p (e.g., a 3×3 or 5×5 window); and q is the pixel within the neighborhood N of pixel p. This is the penalty coefficient, which can be adjusted according to the actual depth range. When the depth information is mapped to the range [0, 1], The value can be 0.0015.
[0144] This is a data fidelity item used to minimize the difference in depth information at each pixel between the adjusted depth map and the original depth map, so as to ensure that the adjusted depth map is close to the original depth map in terms of pixel values, that is, to preserve the original geometric details as much as possible. The smoothing constraint term minimizes the gradient magnitude of the adjusted depth map within the neighborhood of each pixel, ensuring smooth changes in depth information within local regions. This suppresses potential local abrupt changes introduced by the outward expansion processing of the foreground region, maintaining the geometric continuity of the depth map. Thus, by minimizing this energy function, a second depth map that retains the outward expansion effect while exhibiting good local consistency can be obtained.
[0145] Figure 8 The right side shows the effect after optimization using the aforementioned steps S1211 to S1213. It can be seen that the wire-like artifacts in the boundary area between the foreground and background are significantly weakened. This optimization effectively improves the visual effect of naked-eye 3D display.
[0146] In some examples, steps S1211 to S1213 above can be used as Figure 4 Step S122, which involves dividing the first depth map into foreground and background regions based on a neighborhood threshold, is an optional implementation method. It can also be implemented independently to form a separate embodiment. For related descriptions of pixels, the first depth map, and the neighborhood threshold, please refer to [link to relevant documentation]. Figure 1 , Figure 2 , Figure 4 The related parts in the embodiments involved will not be described again here.
[0147] In this embodiment, depth map optimization, combining threshold filtering, foreground expansion, and gradient constraint optimization, effectively eliminates jagged edges in the foreground region during subsequent mesh generation, as well as wire-like artifacts caused by abrupt depth changes between the foreground and background. Simultaneously, gradient constraint optimization suppresses local distortions introduced by expansion while preserving geometric details. This optimization provides a high-quality depth data foundation for subsequent 3D mesh reconstruction and naked-eye 3D display, contributing to improved stereoscopic visual effects in the final display.
[0148] In some embodiments, to improve processing efficiency, step S1211 may include:
[0149] For each pixel in the first depth map, the difference between the maximum and minimum inverse depth values in the pixel's neighborhood is determined as the neighborhood range of the pixel. If the neighborhood range of the pixel is greater than a preset range threshold, the pixel is determined to belong to the foreground region or the background region based on the neighborhood threshold.
[0150] If the neighborhood range of a pixel is greater than a preset range threshold, it indicates that the pixel is located in a region with drastic depth changes and requires threshold filtering. If the neighborhood range of a pixel is less than or equal to the preset range threshold, it indicates that the region where the pixel is located is flat and threshold filtering is not required. This ensures that for flat regions (such as large walls, the sky, or the ground), the depth changes are gradual and there are no significant depth jumps, thus avoiding jagged artifacts during mesh generation. By introducing a pre-judgment of the neighborhood range into the threshold filtering process, the computational load can be effectively reduced while maintaining optimization effectiveness, allowing processing resources to be concentrated on the edge regions that truly need optimization, thereby improving overall processing efficiency.
[0151] Natural images are obtained through perspective projection transformation by the camera, a process that inherently involves a non-linear mapping. If the reconstructed mesh model is then directly rendered using perspective projection again, it's equivalent to applying two perspective transformations to real-world objects, resulting in geometric distortion in the final image. Furthermore, neglecting perspective projection compensation means that when the background fills the entire screen, foreground objects protruding forward will exceed the boundary of the projection frustum, causing parts of them to remain unrendered and disrupting the visual integrity of the image. Simultaneously, this distortion amplifies the foreground's occlusion effect on the background, impacting the scene's stereoscopic presentation.
[0152] To eliminate perspective projection distortion and adapt to multi-view rendering requirements, some embodiments provide an image processing method, such as... Figure 9 As shown, the method includes the following steps:
[0153] S131: Based on the camera space inverse transformation, the optimized depth map is converted into a distance map; whereby the distance map is used to represent the distance from the 3D spatial point corresponding to each pixel to the camera position along the camera ray direction.
[0154] Based on the inverse camera space transform, the depth information (e.g., inverse depth value) in the optimized depth map is converted into the actual distance along the camera ray direction to generate a distance map. Here, the camera ray is a ray originating from the camera position and pointing to the spatial direction corresponding to the pixel; the actual distance along the camera ray direction to the camera position is the distance from the camera position to the spatial point corresponding to that pixel along the ray direction.
[0155] S132: Based on the distance values of each pixel in the distance map, determine the position of the 3D spatial point corresponding to each pixel on the camera ray, and obtain the mesh vertices.
[0156] In the generation of multi-view images for naked-eye 3D displays, camera arrays typically employ two modes: lens-shifted and orientation-shifted. For standardized processing, a reference plane (called the zero plane) is defined, with the camera positioned on the negative half-axis of the optical axis at a preset distance from the zero plane. The coverage area of the camera's field of view within the zero plane can be calculated based on the camera's field of view angle.
[0157] To ensure that the generated geometry is rendered in a normal viewpoint and matches the original image Figure 1 To obtain the desired distance, we need to calculate the distance of each camera ray, subtract the depth of the corresponding pixel from this distance, and then move the camera position along the normalized ray direction by the distance corresponding to that pixel to obtain the position P of the mesh vertex.
[0158] S133: Construct a foreground 3D mesh based on the mesh vertices belonging to the foreground region, and construct a background 3D mesh based on the mesh vertices belonging to the background region.
[0159] After obtaining the mesh vertices corresponding to all pixels, the vertices are divided into a foreground vertex set and a background vertex set according to the pre-determined foreground and background regions (e.g., based on the threshold segmentation result in step S122); triangular meshes are constructed for the foreground vertices and background vertices according to the pixel adjacency relationship to obtain the foreground 3D mesh and the background 3D mesh.
[0160] In some examples, steps S131 to S133 above can be used as Figure 1 Step S130 is one optional implementation, and can also be implemented separately to constitute an independent embodiment. For related descriptions of camera space inverse transformation, pixels, depth maps, etc., please refer to... Figure 1 , Figure 2 , Figure 4 , Figure 6 The related parts in the embodiments involved will not be described again here.
[0161] In this embodiment, by reconstructing the foreground and background as independent meshes, it is easier to perform differentiated rendering of the foreground and background in subsequent processing. For example, in naked-eye 3D display, a stronger out-of-screen effect can be applied to the foreground while maintaining the continuity of the background. In addition, the structure of independent meshes facilitates parallel computing, which helps to improve rendering efficiency.
[0162] In some embodiments, such as Figure 10 As shown, step S131 above may include the following steps:
[0163] S1311: Obtain the intrinsic parameter information of the camera that acquires the 2D image and the size of the 2D image. The intrinsic parameter information includes the camera position and the camera field of view.
[0164] Camera position characterizes the location of the camera acquiring 2D images in 3D space. Starting from this position, the camera ray direction corresponding to each pixel can be determined. Camera field of view characterizes the camera's field of view, and image size maps pixel coordinates to spatial directions.
[0165] S1312: For each pixel, execute steps S13121 to S13124.
[0166] S13121: Determine the camera ray originating from the camera position based on the pixel's position in the 2D image, the size of the 2D image, and the camera's field of view.
[0167] The pixel coordinates are converted to normalized coordinates based on the pixel's position in the 2D image and the image's dimensions. The deflection angle of the ray relative to the camera's optical axis is then calculated using the camera's field of view, yielding the direction vector of the camera ray. Each camera ray corresponds to a spatial straight line originating from the camera position and passing through the corresponding point on the imaging plane of that pixel.
[0168] S13122: Determine the first distance from the camera position to the intersection point based on the intersection point of the camera ray and the reference plane; wherein, the reference plane is perpendicular to the camera optical axis and corresponds to the plane where the naked-eye 3D screen is located.
[0169] By solving for the intersection point of the camera ray and the reference plane, the distance from the camera position to that intersection point can be obtained, denoted as the first distance. For example, let the reference plane be the plane with Z=0, which can be called the "zero plane". The camera can be located on the negative half-axis of the Z-axis, and the distance between the camera and the zero plane is the first distance.
[0170] S13123: Based on the depth information of the pixel in the optimized depth map, determine the second distance from the spatial point corresponding to the pixel to the reference plane along the camera ray.
[0171] For any pixel, the depth information of that pixel is used to determine the distance of the corresponding spatial point relative to the camera. This distance is the distance between the spatial point and the camera position (i.e., the camera origin).
[0172] Based on the depth information of the pixel, combined with the angle between the camera ray and the camera optical axis and the position of the reference plane, a second distance from the spatial point corresponding to the pixel to the reference plane along the camera ray is obtained through geometric transformation.
[0173] Specifically, the angle between the camera ray corresponding to the pixel and the camera optical axis can be determined based on the pixel's position in the image and the camera's field of view. The distance from the reference plane to the camera is subtracted from the distance of the spatial point corresponding to the pixel relative to the camera to obtain the offset of the spatial point relative to the reference plane along the optical axis. The offset is then projected onto the camera ray direction using the angle to obtain the second distance from the spatial point to the reference plane along the camera ray.
[0174] S13124: Subtract the first distance from the second distance to obtain the distance value corresponding to the pixel.
[0175] Subtracting the first distance from the second distance yields the actual distance from the camera position along the camera ray direction to the corresponding 3D point in space. By subtracting the second distance from the first distance, the perspective projection effect during the initial imaging is offset, ensuring the geometric accuracy of subsequent rendering.
[0176] S1313: Generate a distance map based on the distance values corresponding to all pixels.
[0177] A distance map is constructed based on the distance values corresponding to all pixels. This distance map records the actual distance from the 3D spatial point corresponding to each pixel to the camera position along the camera ray direction, providing an accurate data foundation for subsequent mesh vertex generation.
[0178] Figure 11 This illustrates the geometric relationships generated by the inverse camera parameters and a schematic diagram of vertex position calculation. For example... Figure 11 As shown, the plane containing the screen is the reference plane (i.e., the plane where Z=0), and the camera is located on the negative half of the Z-axis. For any pixel in the image, its corresponding camera ray is a ray that originates from the camera position, passes through the corresponding point on the screen where that pixel is located, and is denoted as ray d. c Calculate the total length of the ray (i.e., the distance from the camera position to the intersection of the ray and the reference plane), and denote it as the first distance, i.e. The second distance from the spatial point corresponding to this pixel to the reference plane along the camera ray is denoted as... Subtracting the second distance from the first distance yields the actual distance value d corresponding to that pixel. s Position the camera along ray d. c unit direction movement distance d s This gives the location of the three-dimensional mesh vertex corresponding to the pixel.
[0179] For example, the formula for calculating the coordinates P of the 3D mesh vertex corresponding to pixel i is: Where O represents the camera position. d is the unit direction vector of the camera ray; s This is the distance value corresponding to that pixel.
[0180] The method of generating geometry by inversely using camera imaging parameters is called inverse camera parameter generation. The entire inverse camera parameter generation process can be executed in parallel, which has high computational efficiency.
[0181] Figure 12 This paper presents a side view comparison of geometry generated directly using depth map coordinates and geometry generated using inverse camera parameters. The left side shows the geometric model generated by using only depth values as vertex coordinates without considering camera parameter mapping, which shows obvious geometric distortion in the side view. The right side shows the geometric model obtained using the inverse camera parameter generation method of this application. When the rendering view is the same as the original camera view, the rendered image is completely consistent with the original image, effectively ensuring that no serious distortion occurs in multi-view rendering.
[0182] This embodiment effectively counteracts the perspective projection effect in the original imaging process by introducing an inverse camera parameter generation method, avoiding geometric distortion caused by two perspective transformations. By converting depth information into actual distances along the camera ray direction, consistency between the rendered image and the original image is ensured under the original viewpoint. Simultaneously, constructing the foreground and background as independent meshes facilitates subsequent parallel processing and improves processing efficiency.
[0183] In some embodiments, after obtaining the foreground 3D mesh and the background 3D mesh, they need to be rendered into multi-view images and synthesized into an interlaced image that can be directly played by a naked-eye 3D device. The step S140 above, which generates an interlaced image for playback by a naked-eye 3D display device based on the 3D mesh, may include:
[0184] Based on the display parameters of the target glasses-free 3D display device, the foreground 3D mesh and the background 3D mesh are rendered from multiple perspectives, and the rendered images from multiple perspectives are combined into an interlaced image for playback on the target glasses-free 3D display device.
[0185] Specifically, the display parameters of the target glasses-free 3D screen can be obtained, including at least screen resolution, number of viewpoints, number of lines, and tilt angle. The reconstructed foreground and background 3D meshes are input into a renderer, and images corresponding to each camera viewpoint are rendered according to the display parameters, resulting in multiple viewpoint images. A de-interlacing algorithm is used to synthesize the multiple viewpoint images into an interlaced image based on the number of viewpoints, number of lines, and tilt angle in the screen parameters. This interlaced image can be directly input into the target glasses-free 3D display device for playback. For example, the de-interlacing algorithm can be a linear interlacing algorithm based on raster parameters, a weighted interlacing algorithm based on viewpoint mapping, or a parallax interlacing algorithm based on depth maps. These algorithms rearrange the image pixels from different viewpoints according to the raster period of the glasses-free 3D screen to generate an interlaced image suitable for a specific glasses-free 3D device.
[0186] In this embodiment, the multi-view rendering process can employ different camera array modes, including lens-shifting or orientation-shifting modes. In lens-shifting mode, the optical center position of each virtual camera is fixed, and different viewpoints are rendered by shifting the imaging plane. In orientation-shifting mode, the optical center positions of each virtual camera are arranged at equal intervals along the horizontal direction, and the optical axis direction of each camera points to the scene center or a preset reference point. Figure 13 The diagrams show two camera array modes in multi-view rendering on naked-eye 3D screens. The left side shows the orientation offset mode, and the right side shows the lens offset mode.
[0187] In some examples, different rendering parameters can be applied to the foreground and background meshes during rendering (e.g., applying a stronger stereoscopic effect to the foreground) to enhance the visual effect of naked-eye 3D displays.
[0188] In this embodiment, by automatically converting the foreground 3D mesh and the background 3D mesh into an interlaced graph that conforms to the physical characteristics of the target naked-eye 3D screen, the rendering and interlacing strategies can be flexibly adjusted according to the device's physical parameters, resulting in good device adaptability.
[0189] The various embodiments or implementation methods described in this specification are presented in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the embodiments can be referred to each other.
[0190] It is worth noting that, without contradiction, the various embodiments of the image processing method can be arbitrarily combined with each other. For example, some or all of the steps of different embodiments can be arbitrarily combined, and one embodiment can be arbitrarily combined with the optional implementations of other embodiments.
[0191] Next, combine Figure 14 The image processing method provided in the embodiments of this application will be further described.
[0192] Figure 14 This is a flowchart illustrating an image processing method according to one embodiment. Figure 14 As shown, the method includes steps S210 to S280.
[0193] S210: Input a two-dimensional image;
[0194] S220: Perform monocular depth estimation on the two-dimensional image to obtain the corresponding depth map;
[0195] S230: Based on the depth map, perform foreground / background separation and edge extraction, and depth map optimization;
[0196] S240: Remove the foreground portion from the background area and perform image restoration to obtain a complete background image;
[0197] S250: Parallelize depth map to 3D mesh, and reconstruct foreground and background separately;
[0198] S260: The generated mesh data is input into the renderer to render the corresponding images from various viewpoints;
[0199] S270: By using a trans-interlacing algorithm, images are synthesized into an interlaced image that can be played and displayed on glasses-free 3D devices;
[0200] S280: Displays interlaced images when playing on glasses-free 3D devices.
[0201] In summary, the technical solution provided in this application offers a complete pipeline for generating high-quality 2D images to glasses-free 3D effects. It can automatically convert user-uploaded 2D images into effects suitable for glasses-free 3D screen display without manual intervention. By optimizing the foreground depth accuracy reconstruction scheme, the stereoscopic effect of the foreground (generally the most important area) is enhanced, reducing distortion problems caused by depth estimation errors. Optimizing the depth map significantly reduces the streaking artifacts in the depth reconstruction effect, thereby improving the quality of depth map-based 3D reconstruction. Furthermore, through inverse camera parameter generation, any input image can be automatically reconstructed, and the rendering results of multi-view images meet the requirements of glasses-free 3D. This reconstruction scheme avoids the distortion caused by traditional methods that directly convert depth maps to point clouds and then reconstruct them.
[0202] Figure 15 This is a schematic diagram illustrating the structure of an image processing apparatus according to one embodiment. Figure 15 As shown, the image processing apparatus 100 includes:
[0203] The first generation module 101 is used to perform depth estimation on the input two-dimensional image and generate a first depth map, wherein the first depth map is used to characterize the distance of the spatial point corresponding to each pixel in the two-dimensional image relative to the camera that acquired the two-dimensional image.
[0204] The optimization processing module 102 is used to optimize the first depth map to obtain an optimized depth map.
[0205] 3D reconstruction module 103 is used to reconstruct a 3D mesh based on camera spatial inverse transformation and the optimized depth map;
[0206] The second generation module 104 is used to generate an interlaced map for playback on a naked-eye 3D display device based on the three-dimensional mesh.
[0207] In some embodiments, the first generation module 101 is configured to:
[0208] A pre-trained deep learning model is used to predict the depth information of each pixel in the two-dimensional image, and the prediction results are linearly normalized to a predetermined interval to obtain a first depth map.
[0209] During the training process of the deep learning model, the depth information predicted by the model and the depth information in the training labels are both linearly normalized to a predetermined interval before being input into the loss function, and then the same non-uniform mapping is performed.
[0210] The non-uniform mapping is such that, for any two depth information intervals of equal length, the length of the interval formed by the corresponding values closer to the camera after mapping within the predetermined interval is greater than the length of the interval formed by the corresponding values farther from the camera after mapping.
[0211] In some embodiments, the depth information is an inverse depth value that is negatively correlated with the depth value, and the larger the inverse depth value, the closer it is to the camera; the non-uniform mapping uses a non-linear function to map the depth values converted from the inverse depth value and normalized to the predetermined interval; wherein, the non-linear function is a monotonically increasing power function, and during the training phase, the power exponent of the power function is set to a predetermined value that is greater than 0 and less than 1; during the inference phase, the power exponent of the power function is set to 1.
[0212] In some embodiments, the optimization processing module 102 includes:
[0213] A threshold determination unit is used to determine a neighborhood threshold for each pixel in the first depth map based on statistics of depth information in the neighborhood of the pixel.
[0214] A segmentation processing unit is used to divide the first depth map into a foreground region and a background region based on the neighborhood threshold to obtain a second depth map;
[0215] The completion processing unit is used to perform image completion processing on the background hole region after removing the foreground region in the second depth map to obtain an optimized depth map.
[0216] In some embodiments, the depth information is an inverse depth value; the partitioning processing unit is used for:
[0217] For each pixel in the first depth map, based on the neighborhood threshold corresponding to the pixel, it is determined whether the pixel belongs to the foreground region or the background region, and the pixel determined to belong to the foreground region is assigned the maximum inverse depth value in the neighborhood where the pixel is located, thus obtaining the third depth map.
[0218] For each pixel in the third depth map, if there is a pixel belonging to the foreground region in the neighborhood of the pixel, then the inverse depth value of all pixels in the neighborhood of the pixel is adjusted to the maximum inverse depth value of the neighborhood, so as to expand the foreground region.
[0219] Gradient constraint optimization is performed on the expanded third depth map to obtain a second depth map; the gradient constraint optimization is used to minimize the difference in depth information of each pixel between the adjusted depth map and the unadjusted depth map, and at the same time minimize the gradient magnitude of the adjusted depth map in the neighborhood of each pixel.
[0220] In some embodiments, the partitioning processing unit is used for:
[0221] For each pixel in the first depth map, the difference between the maximum and minimum inverse depth values in the neighborhood of the pixel is determined as the range.
[0222] Whether to perform the step of determining whether the pixel belongs to the foreground region or the background region depends on whether the range is greater than a preset range threshold.
[0223] In some embodiments, the three-dimensional reconstruction module 103 is used for:
[0224] Based on the camera space inverse transform, the optimized depth map is converted into a distance map; wherein, the distance map is used to characterize the distance from the three-dimensional spatial point corresponding to each pixel to the camera position along the camera ray direction;
[0225] Based on the distance values of each pixel in the distance map, the position of the three-dimensional spatial point corresponding to each pixel on the camera ray is determined, and the mesh vertices are obtained;
[0226] A foreground 3D mesh is constructed based on the mesh vertices belonging to the foreground region, and a background 3D mesh is constructed based on the mesh vertices belonging to the background region.
[0227] In some embodiments, the three-dimensional reconstruction module 103 is specifically used for:
[0228] Acquire the intrinsic parameter information of the camera that acquires the two-dimensional image and the size of the two-dimensional image, wherein the intrinsic parameter information includes the camera position and the camera field of view.
[0229] For each pixel, perform the following operations:
[0230] Based on the position of the pixel in the two-dimensional image, the size of the two-dimensional image, and the camera field of view, a camera ray is determined with the camera position as the starting point.
[0231] Based on the intersection point of the camera ray and the reference plane, a first distance from the camera position to the intersection point is determined; wherein, the reference plane is perpendicular to the camera optical axis and corresponds to the plane where the naked-eye 3D screen is located;
[0232] Based on the depth information of the pixel in the optimized depth map, determine the second distance from the spatial point corresponding to the pixel to the reference plane along the camera ray;
[0233] Subtract the first distance from the second distance to obtain the distance value corresponding to the pixel;
[0234] The distance map is generated based on the distance values corresponding to all pixels.
[0235] In some embodiments, the second generation module 104 is used to:
[0236] Based on the display parameters of the target glasses-free 3D display device, the foreground 3D mesh and the background 3D mesh are rendered from multiple perspectives, and the rendered images from multiple perspectives are combined into an interlaced image for playback on the target glasses-free 3D display device.
[0237] The image processing apparatus provided in this application belongs to the same concept as the image processing method provided in the above-mentioned embodiments of this application. It can execute the image processing method provided in any of the above-mentioned embodiments of this application and has the corresponding functional modules and beneficial effects for executing the image processing method. Technical details not described in detail in this embodiment can be found in the specific processing content of the image processing method provided in the above-mentioned embodiments of this application, and will not be repeated here.
[0238] It should be understood that one or more modules of the above image processing apparatus can be integrated into the image processing apparatus in the foregoing embodiments. Each module in the image processing apparatus can be implemented as software called by a processor, or each module can be implemented as a hardware circuit. The functions of some or all modules can be achieved through the design of the hardware circuit, which can be understood as one or more processors.
[0239] This application also provides an electronic device, including a processor, a memory, and an executable program stored in the memory and executable by the processor. When the processor runs the executable program, it performs the steps of the image processing method provided in any of the foregoing embodiments.
[0240] This application also provides a computer-readable storage medium having an executable program stored thereon, which, when executed by a processor, implements the steps of the image processing method provided in any of the foregoing embodiments.
[0241] For ease of understanding, the following focuses on explaining the terminology used in this embodiment:
[0242] In this application embodiment, a processor is a circuit with signal processing capabilities. In one implementation, the processor can be a circuit with instruction read and execute capabilities, such as a Central Processing Unit (CPU), a microprocessor, a graphics processing unit (GPU) (which can be understood as a type of microprocessor), or a digital signal processor (DSP). In another implementation, the processor can implement certain functions through the logical relationships of hardware circuits. The logical relationships of the aforementioned hardware circuits are fixed or reconfigurable. For example, the processor is a hardware circuit implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD), such as an FPGA. In a reconfigurable hardware circuit, the process of the processor loading a configuration document and configuring the hardware circuit can be understood as the process of the processor loading instructions to implement the functions of some or all of the above units or modules. In addition, it can also be a hardware circuit designed for artificial intelligence, which can be understood as an ASIC, such as a Neural Network Processing Unit (NPU), a Tensor Processing Unit (TPU), a Deep Learning Processing Unit (DPU), etc.
[0243] The computer-readable storage medium provided in this embodiment can execute the image processing method of the above embodiment. Its implementation principle and technical effect are similar to those of the above embodiment, and will not be repeated here.
[0244] The aforementioned computer-readable storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The readable storage medium can be any available medium accessible to a general-purpose or special-purpose computer.
[0245] An exemplary readable storage medium is coupled to a processor, enabling the processor to read information from and write information to the readable storage medium. Of course, the readable storage medium can also be a component of the processor. The processor and the readable storage medium can reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the processor and the readable storage medium can exist as discrete components in an electronic device or a host device.
[0246] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.
[0247] The various embodiments or implementation methods described in this specification are presented in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the embodiments can be referred to each other.
[0248] In the description of this specification, references to "one embodiment," "some embodiments," "illustrative embodiment," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with an embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0249] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. An image processing method, characterized by, The method includes: A pre-trained deep learning model is used to estimate the depth of the input two-dimensional image to generate a first depth map, wherein the first depth map is used to characterize the distance of the spatial point corresponding to each pixel in the two-dimensional image relative to the camera that acquired the two-dimensional image. The first depth map is optimized to obtain an optimized depth map; Based on the inverse camera space transformation, a 3D mesh is reconstructed using the optimized depth map. Based on the three-dimensional mesh, an interlaced map is generated for playback on glasses-free three-dimensional display devices; The step of optimizing the first depth map to obtain an optimized depth map includes: for each pixel in the first depth map, determining a neighborhood threshold based on the statistical value of the depth information in the neighborhood of the pixel; dividing the first depth map into a foreground region and a background region based on the neighborhood threshold to obtain a second depth map; and performing image completion processing on the background hole region after removing the foreground region in the second depth map to obtain the optimized depth map. The reconstruction of the 3D mesh based on the camera space inverse transform and the optimized depth map includes: Based on the camera space inverse transform, the optimized depth map is converted into a distance map; wherein, the distance map is used to characterize the distance from the three-dimensional spatial point corresponding to each pixel to the camera position along the camera ray direction; Based on the distance values of each pixel in the distance map, the position of the three-dimensional spatial point corresponding to each pixel on the camera ray is determined, and the mesh vertices are obtained; A foreground 3D mesh is constructed based on the mesh vertices belonging to the foreground region, and a background 3D mesh is constructed based on the mesh vertices belonging to the background region.
2. The image processing method according to claim 1, characterized in that, The step of performing depth estimation on the input two-dimensional image to generate a first depth map includes: A pre-trained deep learning model is used to predict the depth information of each pixel in the two-dimensional image, and the prediction results are linearly normalized to a predetermined interval to obtain a first depth map. During the training process of the deep learning model, the depth information predicted by the model and the depth information in the training labels are both linearly normalized to a predetermined interval before being input into the loss function, and then the same non-uniform mapping is performed. The non-uniform mapping is such that, for any two depth information intervals of equal length, the length of the interval formed by the corresponding values closer to the camera after mapping within the predetermined interval is greater than the length of the interval formed by the corresponding values farther from the camera after mapping.
3. The image processing method according to claim 2, characterized in that, The depth information is an inverse depth value that is negatively correlated with the depth value; the larger the inverse depth value, the closer the distance to the camera. The non-uniform mapping uses a non-linear function to map depth values that are transformed from inverse depth values and normalized to the predetermined interval. The nonlinear function is a monotonically increasing power function. During the training phase, the power exponent of the power function is set to a predetermined value that is greater than 0 and less than 1; during the inference phase, the power exponent of the power function is set to 1.
4. The image processing method according to claim 1, characterized in that, The depth information is an inverse depth value; the process of dividing the first depth map into a foreground region and a background region based on the neighborhood threshold to obtain a second depth map includes: For each pixel in the first depth map, based on the neighborhood threshold corresponding to the pixel, it is determined whether the pixel belongs to the foreground region or the background region, and the pixel determined to belong to the foreground region is assigned the maximum inverse depth value in the neighborhood where the pixel is located, thus obtaining the third depth map. For each pixel in the third depth map, if there is a pixel belonging to the foreground region in the neighborhood of the pixel, then the inverse depth value of all pixels in the neighborhood of the pixel is adjusted to the maximum inverse depth value of the neighborhood, so as to expand the foreground region. Gradient constraint optimization is performed on the expanded third depth map to obtain the second depth map; The gradient constraint optimization is used to minimize the difference in depth information between the adjusted depth map and the unadjusted depth map at each pixel, while minimizing the gradient magnitude of the adjusted depth map in the neighborhood of each pixel.
5. The image processing method according to claim 4, characterized in that, The method further includes: For each pixel in the first depth map, the difference between the maximum and minimum inverse depth values in the neighborhood of the pixel is determined as the neighborhood range corresponding to the pixel. The step of determining whether a pixel belongs to a foreground region or a background region based on a neighborhood threshold corresponding to the pixel includes: If the neighborhood range of a pixel is greater than a preset range threshold, the pixel is determined to belong to either the foreground region or the background region based on the neighborhood threshold.
6. The image processing method according to claim 1, characterized in that, The step of converting the optimized depth map into a distance map based on camera space inverse transformation includes: Acquire the intrinsic parameter information of the camera that acquires the two-dimensional image and the size of the two-dimensional image, wherein the intrinsic parameter information includes the camera position and the camera field of view. For each pixel, perform the following operations: Based on the position of the pixel in the two-dimensional image, the size of the two-dimensional image, and the camera field of view, a camera ray is determined with the camera position as the starting point. Based on the intersection point of the camera ray and the reference plane, a first distance from the camera position to the intersection point is determined; wherein, the reference plane is perpendicular to the camera optical axis and corresponds to the plane where the naked-eye 3D screen is located; Based on the depth information of the pixel in the optimized depth map, determine the second distance from the spatial point corresponding to the pixel to the reference plane along the camera ray; Subtract the first distance from the second distance to obtain the distance value corresponding to the pixel; generate the distance map based on the distance values corresponding to all pixels.
7. The image processing method according to claim 1 or 6, characterized in that, The step of generating an interlaced map for playback on a glasses-free 3D display device based on the three-dimensional mesh includes: Based on the display parameters of the target glasses-free 3D display device, the foreground 3D mesh and the background 3D mesh are rendered from multiple perspectives, and the rendered images from multiple perspectives are combined into an interlaced image for playback on the target glasses-free 3D display device.
8. An image processing apparatus, characterized in that, The device includes: The first generation module is used to perform depth estimation on the input two-dimensional image using a pre-trained deep learning model to generate a first depth map, wherein the first depth map is used to characterize the distance of the spatial point corresponding to each pixel in the two-dimensional image relative to the camera that acquired the two-dimensional image. An optimization processing module is used to optimize the first depth map to obtain an optimized depth map; The 3D reconstruction module is used to reconstruct a 3D mesh based on the camera spatial inverse transform and the optimized depth map. The second generation module is used to generate an interlaced map for playback on a naked-eye 3D display device based on the three-dimensional mesh; The optimization processing module includes: A threshold determination unit is used to determine a neighborhood threshold for each pixel in the first depth map based on statistics of depth information in the neighborhood of the pixel. A segmentation processing unit is used to divide the first depth map into a foreground region and a background region based on the neighborhood threshold to obtain a second depth map; The completion processing unit is used to perform image completion processing on the background hole region after removing the foreground region in the second depth map to obtain an optimized depth map; The three-dimensional reconstruction module is used for: Based on the camera space inverse transform, the optimized depth map is converted into a distance map; wherein, the distance map is used to characterize the distance from the three-dimensional spatial point corresponding to each pixel to the camera position along the camera ray direction; Based on the distance values of each pixel in the distance map, the position of the three-dimensional spatial point corresponding to each pixel on the camera ray is determined, and the mesh vertices are obtained; A foreground 3D mesh is constructed based on the mesh vertices belonging to the foreground region, and a background 3D mesh is constructed based on the mesh vertices belonging to the background region.
9. An electronic device comprising a processor, a memory, and an executable program stored in the memory and executable by the processor, characterized in that, When the processor runs the executable program, it performs the steps of the image processing method as described in any one of claims 1 to 7.
10. A computer-readable storage medium having an executable program stored thereon, characterized in that, When the executable program is executed by a processor, it implements the steps of the image processing method as described in any one of claims 1 to 7.
Citation Information
Patent Citations
Image and depth 3D image format and multi-viewpoint naked-eye 3D display method thereof
CN106604013A
Naked eye three-dimensional video generation method and device, computer equipment and medium
CN118474328A