A sparse view scene super-resolution reconstruction method based on three-dimensional Gaussian representation and wavelet domain constraint
The super-resolution reconstruction method for sparse view scenes using 3D Gaussian representation and wavelet domain constraints solves the reconstruction challenges under sparse view and low resolution conditions, and achieves high-quality 3D scene reconstruction and rendering.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING INFORMATION SCI & TECH UNIV
- Filing Date
- 2025-12-12
- Publication Date
- 2026-06-16
AI Technical Summary
Under sparse view and low resolution conditions, existing technologies struggle to achieve high-quality 3D scene reconstruction, exhibiting problems such as structural instability, lack of detail, cross-view inconsistency, and artifact accumulation.
A sparse view scene super-resolution reconstruction method based on 3D Gaussian representation and wavelet domain constraints is adopted. Through Gaussian densification, multi-source constraint mechanism and iterative optimization, high-resolution and cross-view consistent 3D scene reconstruction results are generated.
It effectively alleviates the problems of structural instability, lack of detail and artifact accumulation, improves reconstruction quality and rendering effect, and achieves high-resolution reconstruction with coherent structure and clear details.
Smart Images

Figure CN121616461B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and image processing technology, and in particular to a method for super-resolution reconstruction of sparse view scenes based on three-dimensional Gaussian representation and wavelet domain constraints. Background Technology
[0002] With the development of deep learning and 3D reconstruction, the demand for scene reconstruction and new perspective synthesis based on multiple views is increasing for applications such as virtual reality, augmented reality, robot navigation, digital twins, and film and television production. 3D Gaussian representation, due to its advantages of fast reconstruction speed and high rendering efficiency, is widely used for real-time or near-real-time high-fidelity rendering. However, in actual data acquisition, limitations in shooting conditions and equipment often result in only sparse and low-resolution multi-view images, which places higher demands on high-quality reconstruction. Scene reconstruction faces multiple challenges under the coexistence of sparse views and low resolution:
[0003] 1. Insufficient structural sampling and lack of texture evidence: Low resolution weakens high-frequency cues, sparse viewpoints reduce parallax, and the lack of texture evidence and cross-view constraints during the optimization stage leads to weakened geometry and easy texture degradation.
[0004] 2. The divide-and-conquer approach is difficult to reconcile with the dilemma: Existing work often separates the "sparse view" and "low resolution" for processing; the sparse view focuses on introducing geometric priors or pseudo-views, while the low resolution focuses on relying on external super-resolution networks, but processing them separately is prone to failure when both occur at the same time.
[0005] 3. Inconsistency and artifact accumulation across different viewpoints: While super-resolution priors and artifact supervision can improve details, they can introduce inconsistent textures and artifacts across different viewpoints, typically manifested as edge ringing, bright edges, and texture drift.
[0006] 4. Computational overhead and reliability issues: Strong reliance on two-dimensional priors usually leads to heavy computational costs, while maintaining consistency across perspectives remains difficult. Summary of the Invention
[0007] The purpose of this invention is to propose a super-resolution reconstruction method for 3D Gaussian scenes with sparse viewpoints and low-resolution inputs, in order to alleviate the problems of structural instability, lack of detail, cross-viewpoint inconsistency and artifact accumulation that are prone to occur in existing technologies, and to obtain reconstruction and rendering results with coherent structure, clear details and improved resolution.
[0008] To achieve the above objectives, this invention provides a sparse view scene super-resolution reconstruction method based on 3D Gaussian representation and wavelet domain constraints, comprising:
[0009] Initialize a 3D Gaussian representation of a 3D scene using sparse, low-resolution multi-view images as input;
[0010] The three-dimensional Gaussian representation is subjected to Gaussian density processing;
[0011] On the target high-resolution canvas, the densed 3D Gaussian is trained and rendered to generate a high-resolution rendered image and depth information.
[0012] A multi-source constraint mechanism is constructed, wherein the multi-source constraint mechanism includes appearance constraints, geometric constraints, and frequency domain constraints;
[0013] Based on the multi-source constraint mechanism, the parameters of the three-dimensional Gaussian are jointly optimized and iteratively updated until the convergence condition is met, outputting a high-resolution and cross-viewpoint consistent three-dimensional scene reconstruction result.
[0014] Preferably, the Gaussian densification process includes:
[0015] Initial Gaussian atoms are generated based on sparse point clouds, and the position, scale, orientation, opacity and color parameters of the Gaussian atoms are initialized.
[0016] Several child Gaussian atoms are generated in the local neighborhood of each parent Gaussian atom. The child Gaussian atoms are selected according to the minimum spacing and view coverage criteria, and Gaussian atoms that can improve the expression of local details are retained to form a compacted Gaussian set.
[0017] Preferably, the appearance constraint is a pairing supervision between a high-resolution reference image generated based on a pre-trained super-resolution model and the high-resolution rendered image, combined with a low-resolution consistency constraint.
[0018] The geometric constraints are used to render and pair virtual views generated by interpolation of adjacent viewpoints, combined with depth alignment constraints.
[0019] The frequency domain constraint is a constraint that supervises the consistency of sub-bands obtained by performing a stationary wavelet transform on the brightness channels of the paired images.
[0020] Preferably, the appearance constraint is achieved by a weighted combination of the first appearance loss and the second appearance loss;
[0021] Wherein, the first appearance loss is calculated based on the difference between the high-resolution rendered image and the high-resolution reference image;
[0022] The second appearance loss is calculated based on the difference between the low-resolution rendered image obtained by downsampling the high-resolution rendered image and the original low-resolution input image.
[0023] Preferably, the first appearance loss is:
[0024] ;
[0025] The second appearance loss is:
[0026] ;
[0027] The appearance constraints are as follows:
[0028] ;
[0029] In the formula, , High-resolution rendered image and reference image respectively. , These are the low-resolution rendered image and the input image, respectively. , All are sets of valid pixels at the corresponding scale. All are weighting coefficients. The first damage is to the appearance. For the second appearance loss, This is a loss of appearance.
[0030] Preferably, the geometric constraints are obtained by weighted combination of virtual views. Figure 1 Consistency loss and depth alignment consistency loss are achieved;
[0031] Among them, the virtual view Figure 1 The consistency loss is calculated based on the difference between the image rendered in the virtual view and the corresponding reference image; the depth alignment consistency loss is calculated based on the difference between the depth map estimated from the low-resolution input and the high-resolution rendered depth map after downsampling to a low-resolution scale.
[0032] Preferably, the virtual view Figure 1 The loss of efficacy is:
[0033] ;
[0034] The depth alignment consistency loss is:
[0035] ;
[0036] The geometric constraints are:
[0037] ;
[0038] In the formula, and These are the rendered image and reference image of the virtual view, respectively. For the effective comparison area; This is the rendered depth map after downsampling according to the input scale. For a depth map estimated from low-resolution input, For the effective depth region, All are weighting coefficients. For virtual vision Figure 1 Sexual damage, For the consistency loss of depth alignment, The loss is geometric.
[0039] Preferably, the frequency domain constraint includes:
[0040] By performing stationary wavelet transforms on the brightness channels of the low-resolution rendered image and the low-resolution input image respectively, low-frequency sub-bands and several high-frequency sub-bands are obtained.
[0041] Based on the differences between corresponding subbands in the low-frequency subband and the high-frequency subband, the wavelet domain subband loss is calculated.
[0042] Preferably, the wavelet domain subband loss is calculated as follows:
[0043] ;
[0044] In the formula, and These are a low-resolution rendered image and a low-resolution input image, respectively. Indicates luminance channel extraction, For stationary wavelets in subband The response on This is the effective comparison region for the sub-band. Weight each sub-task. For stationary wavelet loss, LL, LH, HL, and HH all represent different subbands.
[0045] Preferably, the joint optimization is to combine the appearance loss, geometric loss and wavelet domain subband loss into a total loss function according to a preset weight, and to iteratively update the parameters of the three-dimensional Gaussian using a gradient-based optimization algorithm;
[0046] The parameters include the center position of the Gaussian, scale, rotation direction, opacity, and color.
[0047] Compared with the prior art, the present invention has the following advantages and technical effects:
[0048] This invention optimizes for both sparse viewpoints and insufficient resolution within the same training process, avoiding cross-viewpoint inconsistencies and artifact accumulation caused by treating these two types of problems separately. By combining Gaussian compaction with high-resolution canvas training, it enhances the initial sampling density and detail carrying capacity, providing a foundation for high-frequency information reconstruction. In appearance supervision, a dual-channel strategy of high-resolution reference pairing and low-resolution alignment is employed, balancing detail supplementation and observation consistency to stabilize the rendered appearance. In geometric supervision, virtual views based on adjacent viewpoint interpolation and depth alignment constraints are introduced to strengthen shape constraints and reduce deformation and drift under sparse viewpoint conditions. In frequency domain supervision, the brightness channel undergoes smooth wavelet decomposition and sub-band consistency constraints are implemented, achieving separate modeling and joint constraints for large-scale structures and high-frequency details, effectively suppressing artifacts such as ringing and bright edges and improving boundary and texture clarity. By jointly optimizing appearance, geometry, and wavelet domain losses, cross-viewpoint consistency and high-resolution reconstruction quality are significantly improved while ensuring training and inference efficiency. Attached Figure Description
[0049] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:
[0050] Figure 1 This is a schematic diagram of the process structure of a sparse view scene super-resolution reconstruction method based on three-dimensional Gaussian representation and wavelet domain constraints according to an embodiment of the present invention.
[0051] Figure 2 This is a schematic diagram of the training process according to an embodiment of the present invention. Detailed Implementation
[0052] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0053] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.
[0054] This embodiment proposes a sparse view scene super-resolution reconstruction method based on 3D Gaussian representation and wavelet domain constraints, such as... Figures 1-2 ,include:
[0055] Initialize a 3D Gaussian representation of a 3D scene using sparse, low-resolution multi-view images as input;
[0056] The three-dimensional Gaussian representation is subjected to Gaussian density processing;
[0057] On the target high-resolution canvas, the densed 3D Gaussian is trained and rendered to generate a high-resolution rendered image and depth information.
[0058] A multi-source constraint mechanism is constructed, wherein the multi-source constraint mechanism includes appearance constraints, geometric constraints, and frequency domain constraints;
[0059] Based on the multi-source constraint mechanism, the parameters of the three-dimensional Gaussian are jointly optimized and iteratively updated until the convergence condition is met, outputting a high-resolution and cross-viewpoint consistent three-dimensional scene reconstruction result.
[0060] Specifically, this embodiment first designs and constructs a 3D Gaussian reconstruction training process for sparse viewpoints and low-resolution inputs, including:
[0061] Gaussian compaction: Based on the sparse point cloud and camera parameters obtained from the motion recovery structure technique, an initial 3D Gaussian representation is generated. The Gaussian is copied in multiple directions in the local principal direction and scaled to form a controlled-growth high-density Gaussian set, which is used to improve the sampling sufficiency of subsequent rendering and optimization.
[0062] 3D Gaussian Training: Receives the current batch of views and camera parameters, and iteratively updates the center, scale (covariance-related parameters), direction (rotation), opacity, and color of the Gaussian; it only focuses on the above learnable parameters and does not construct or maintain additional data structures.
[0063] 3D Gaussian rendering: Perform volume rendering based on 3D Gaussian on the target high-resolution canvas according to camera parameters, and output high-resolution color rendering and high-resolution depth maps corresponding to each training view; and perform regional downsampling on the color rendering to obtain a low-resolution correspondence consistent with the input observation.
[0064] Super-resolution loss optimization: Low-resolution input is fed into the pre-trained super-resolution model to obtain a high-resolution reference image, which is then paired one-to-one with the high-resolution rendering result; at the same time, the downsampled image of the high-resolution rendering result is paired one-to-one with the original low-resolution input to form two-way appearance supervision: high-resolution and low-resolution.
[0065] Geometric loss optimization: virtual views are generated based on interpolation of adjacent training views and the corresponding images are rendered; high-resolution rendered depth maps are aligned to low-resolution scales and paired with depth maps estimated from low-resolution inputs for geometric constraints.
[0066] Stationary wavelet loss optimization: Perform stationary wavelet transform on the brightness channels of the two paired images at a low resolution scale to obtain stationary wavelet subbands, and establish subband-level consistency constraints accordingly.
[0067] Further, the Gaussian density processing is performed, including:
[0068] Initial Gaussian atoms are generated based on sparse point clouds, and the position, scale, orientation, opacity and color parameters of the Gaussian atoms are initialized.
[0069] Several child Gaussian atoms are generated in the local neighborhood of each parent Gaussian atom. The child Gaussian atoms are selected according to the minimum spacing and view coverage criteria, and Gaussian atoms that can improve the expression of local details are retained to form a compacted Gaussian set.
[0070] Specifically, this step obtains camera parameters and sparse point clouds from the structure-of-motion technique, generates an initial 3D Gaussian representation, and performs multi-directional copying and scale shrinking locally to obtain a compacted Gaussian set for subsequent training and rendering.
[0071] The sparse low-resolution multi-view is unified to a fixed resolution and color space. Motion recovery structure is used to solve the sparse 3D point cloud involving inside and outside the camera, and scale normalization and coordinate convention are completed, which serve as the geometric baseline for subsequent representation and training.
[0072] Using sparse point cloud as anchor points, Gaussian atoms are generated for each 3D point. The center, scale (or equivalent covariance related parameters), orientation, opacity and color are initialized to form an initial Gaussian set that can be used for training and rendering, while retaining the association information with the input view.
[0073] To refine the spatial distribution, several child atoms are generated in the local neighborhood of each parent atom, with the child atom scaled proportionally. The contraction direction and color are inherited from the parent atom or slightly perturbed to avoid overlap;
[0074] The newly generated subatoms are screened based on criteria such as minimum spacing and view coverage. Only candidate write sets that can improve local coverage and detail expression are retained. The densified Gaussian set is then output as the input for subsequent 3D Gaussian training and rendering.
[0075] Among them, the sub-atom center is:
[0076] ;
[0077] In the formula, Centered on the parent atom The step size is related to the scene scale or target resolution. The preset unit direction vector (e.g., eight directions of a cube).
[0078] Furthermore, the appearance constraint is a pairing supervision between the high-resolution reference image generated based on the pre-trained super-resolution model and the high-resolution rendered image, combined with the constraint of low-resolution consistency.
[0079] The geometric constraints are used to render and pair virtual views generated by interpolation of adjacent viewpoints, combined with depth alignment constraints.
[0080] The frequency domain constraint is a constraint that supervises the consistency of sub-bands obtained by performing a stationary wavelet transform on the brightness channels of the paired images.
[0081] Specifically, in 3D Gaussian training, the parameters of the compacted 3D Gaussian are optimized in batches based on views. The parameters to be learned include Gaussian center, scale (or equivalent covariance-related parameters), orientation, opacity, and color. Forward and backward updates are completed by combining the camera intrinsic and extrinsic parameters of the corresponding view. In 3D Gaussian rendering, forward rendering is performed on each training view on the target high-resolution canvas based on the camera parameters to obtain a result consistent with that view. Figure 1 A corresponding high-resolution color image and a high-resolution depth image are generated, and the color image is downsampled to obtain a low-resolution correspondence consistent with the input observation. At the same time, the size alignment and pixel correspondence are recorded for subsequent pairing supervision, so that the high-resolution rendering result can be paired with the super-resolution reference image, its downsampled result can be paired with the low-resolution input, and the depth result can be aligned with the depth estimation at the low-resolution scale, thereby providing a unified input for the calculation of appearance, geometry and wavelet domain loss.
[0082] Furthermore, the appearance constraint is achieved by weighted combination of the first appearance loss and the second appearance loss;
[0083] Wherein, the first appearance loss is calculated based on the difference between the high-resolution rendered image and the high-resolution reference image;
[0084] The second appearance loss is calculated based on the difference between the low-resolution rendered image obtained by downsampling the high-resolution rendered image and the original low-resolution input image.
[0085] Specifically, this step, under unified alignment and color space, constructs two appearance pairings: "high-resolution reference - high-resolution rendering" and "low-resolution input - low-resolution rendering," to simultaneously constrain detail and observation consistency. This includes:
[0086] The corresponding low-resolution input image is fed into the pre-trained super-resolution network to obtain a high-resolution reference image. During the inference phase, the network is kept in evaluation mode and randomness is fixed to avoid the introduction of training noise. The reference image and the high-resolution rendered image are aligned in size, color gamut and gamma space, and share the same effective comparison area with the effective region mask. To reduce data distribution offset, if necessary, the reference image and the input image are uniformly mean-reduced and variance-normalized, and the pairing results are cached at the batch processing level for direct reading in loss calculation.
[0087] The corresponding low-resolution input image is fed into the pre-trained super-resolution network to obtain a high-resolution reference image. During the inference phase, the network is kept in evaluation mode and randomness is fixed to avoid the introduction of training noise. The reference image and the high-resolution rendered image are aligned in size, color gamut and gamma space, and share the same effective comparison area with the effective region mask. To reduce data distribution offset, if necessary, the reference image and the input image are uniformly mean-reduced and variance-normalized, and the pairing results are cached at the batch processing level for direct reading in loss calculation.
[0088] At the high-resolution scale, the appearance pairing loss is calculated between the high-resolution rendered image and the high-resolution reference image; at the low-resolution scale, the consistency pairing loss is calculated between the low-resolution rendered image and the original low-resolution input; the two signals are weighted to obtain the total appearance loss, which is used as the output of this step.
[0089] Furthermore, the first appearance loss (i.e., the high-scale appearance pairing loss) is:
[0090] ;
[0091] The second appearance loss (i.e., low-scale consistency pairing loss) is:
[0092] ;
[0093] The weighted sum of the appearance losses from both routes is as follows:
[0094] ;
[0095] In the formula, , High-resolution rendered image and reference image respectively. , These are the low-resolution rendered image and the input image, respectively. , All are sets of valid pixels at the corresponding scale. All are weighting coefficients. The first damage is to the appearance. For the second appearance loss, This is a loss of appearance.
[0096] Furthermore, the geometric constraints are obtained through a weighted combination of virtual views. Figure 1 Consistency loss and depth alignment consistency loss are achieved;
[0097] Among them, the virtual view Figure 1 The consistency loss is calculated based on the difference between the image rendered in the virtual view and the corresponding reference image; the depth alignment consistency loss is calculated based on the difference between the depth map estimated from the low-resolution input and the high-resolution rendered depth map after downsampling to a low-resolution scale.
[0098] Specifically, this step generates and renders virtual views using interpolation from adjacent training views on a unified geometric baseline. Simultaneously, the high-resolution rendered depth is aligned with the depth estimated from the input at a low-resolution scale, forming two types of geometric supervision: "virtual view pairing" and "depth alignment pairing." These are used to constrain shape consistency and scale alignment relationships, specifically including:
[0099] The camera parameters of the virtual view are obtained by interpolation of adjacent training viewpoints. A 3D Gaussian is then rendered forward from this viewpoint to obtain the virtual view rendering image and the corresponding rendering depth. If there is a reference or synthesized virtual view image, it is aligned with the rendering result in terms of size, color gamut and effective area.
[0100] The high-resolution rendered depth map is downsampled to the input scale according to a fixed protocol to obtain the low-resolution rendered depth; the low-resolution depth map obtained by the low-resolution input through the depth estimation network is used as the pairing target, and occlusion and invalid pixel masking are completed in the same effective area;
[0101] In the virtual view channel, the consistency pairing loss is calculated for the virtual view rendering map and its reference (or composite) map within the effective area; in the depth channel, the alignment pairing loss is calculated for the rendered depth map downsampled to the input scale and the depth map estimated by the low-resolution input within the effective area; the two geometric signals are weighted to obtain the total geometric loss, which is used as the output of this step.
[0102] Furthermore, the virtual view Figure 1 The loss of efficacy is:
[0103] ;
[0104] The depth alignment consistency loss is:
[0105] ;
[0106] Geometric loss weighted summation:
[0107] ;
[0108] In the formula, and These are the rendered image and reference image of the virtual view, respectively. For the effective comparison area; This is the rendered depth map after downsampling according to the input scale. For a depth map estimated from low-resolution input, For the effective depth region, All are weighting coefficients. For virtual vision Figure 1 Sexual damage, For the consistency loss of depth alignment, The loss is geometric.
[0109] Furthermore, the frequency domain constraints include:
[0110] By performing stationary wavelet transforms on the brightness channels of the low-resolution rendered image and the low-resolution input image respectively, low-frequency sub-bands and several high-frequency sub-bands are obtained.
[0111] Based on the differences between corresponding subbands in the low-frequency subband and the high-frequency subband, the wavelet domain subband loss is calculated.
[0112] Specifically, this step performs a stationary wavelet transform on the brightness channel of the paired image at a low-resolution scale, separating the structure and details into four sub-bands: LL, LH, HL, and HH. Consistency constraints are then established on these sub-bands to supplement the frequency domain monitoring signal. This includes:
[0113] Low-resolution rendered images and corresponding low-resolution input images are selected as one-to-one paired frequency domain supervision inputs, unified to the same color and dynamic range, and the luminance channel is extracted as the input for wavelet decomposition.
[0114] Perform a single-layer stationary wavelet transform (without downsampling) on the two brightness images to obtain LL (low frequency) and LH, HL, and HH (three high frequency) subbands, keeping them the same size as the original images for pixel-by-pixel alignment and comparison.
[0115] Furthermore, the weighted consistency loss is calculated on the four corresponding sub-bands and summarized as the output of this step:
[0116] ;
[0117] In the formula, and These are a low-resolution rendered image and a low-resolution input image, respectively. Indicates luminance channel extraction, For stationary wavelets in subband The response on This is the effective comparison region for the sub-band. Weight each sub-task. For stationary wavelet loss, LL, LH, HL, and HH all represent different subbands.
[0118] Furthermore, the joint optimization is achieved by combining the appearance loss, geometric loss, and wavelet domain subband loss with preset weights into a total loss function, and then using a gradient-based optimization algorithm to iteratively update the parameters of the three-dimensional Gaussian.
[0119] The parameters include the center position of the Gaussian, scale, rotation direction, opacity, and color.
[0120] Specifically, this step jointly solves the three types of supervision—appearance, geometry, and wavelet domain—within the same training round, updating the center, scale (or equivalent covariance-related parameters), orientation, opacity, and color of the 3D Gaussian. If the preset maximum number of iterations is not reached or the convergence criterion is not met, training returns to continue. This includes:
[0121] Loss of appearance Geometric loss With wavelet domain loss The weighted summaries are then used to form the overall objective function for this round of training. :
[0122] ;
[0123] In the formula, and These are all weighting coefficients used to balance the proportions of geometric and frequency domain supervision in the overall objective.
[0124] Using a three-dimensional Gaussian parameter set as the variable to be optimized, a gradient-based first-order optimization method (such as Adam or its variants) is used to update the parameters in batches. Each training round executes a closed-loop process of "rendering - loss calculation - backpropagation - parameter update". The learning rate can be scheduled using strategies such as piecewise decay or cosine annealing, and combined with conventional training techniques such as gradient clipping, weight decay and early stopping to stabilize the optimization.
[0125] Training terminates and the final model is output when the convergence criterion is met or the maximum number of iterations is reached; if the termination condition is not met, high-resolution rendering and various loss optimizations continue until the termination condition is met.
[0126] This embodiment first constructs a unified training process, including Gaussian compaction, 3D Gaussian training and high-resolution rendering, super-resolution loss optimization, geometric loss optimization, and stationary wavelet loss optimization. Under sparse low-resolution multi-view input, camera parameters and sparse point clouds are first obtained from motion reconstruction techniques to generate an initial 3D Gaussian representation. Subsequently, local multi-directional replication and scale contraction are used to increase the sampling density. In appearance supervision, a pre-trained super-resolution model is introduced to generate a high-resolution reference and pair it with the rendering result. At the same time, the rendering result is downsampled and paired with the low-resolution input to form a dual-channel constraint for detail and observation consistency. In geometric supervision, virtual views are obtained through adjacent viewpoint interpolation and paired for rendering. At the same time, the high-resolution rendered depth map is aligned with the estimated depth according to the input scale. In frequency domain supervision, stationary wavelet decomposition is performed on the brightness channel of the paired images at the low-resolution scale, and consistency constraints are established in the LL, LH, HL, and HH subbands. Finally, appearance, geometric, and wavelet domain losses are jointly optimized to optimize the 3D Gaussian parameters, resulting in a high-resolution scene reconstruction result with coherent structure, clear details, and cross-view consistency.
[0127] The above are merely preferred embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for super-resolution reconstruction of sparse view scenes based on 3D Gaussian representation and wavelet domain constraints, characterized in that, include: Initialize a 3D Gaussian representation of a 3D scene using sparse, low-resolution multi-view images as input; The three-dimensional Gaussian representation is subjected to Gaussian density processing; On the target high-resolution canvas, the densed 3D Gaussian is trained and rendered to generate a high-resolution rendered image and depth information. A multi-source constraint mechanism is constructed, wherein the multi-source constraint mechanism includes appearance constraints, geometric constraints, and frequency domain constraints; The appearance constraint is a pairing supervision between the high-resolution reference image generated based on the pre-trained super-resolution model and the high-resolution rendered image, combined with the constraint of low-resolution consistency. The geometric constraints are used to render and pair virtual views generated by interpolation of adjacent viewpoints, combined with depth alignment constraints. The frequency domain constraint is a constraint that supervises the consistency of sub-bands obtained by performing a stationary wavelet transform on the brightness channels of the paired images. The appearance constraint is achieved by weighted combination of the first appearance loss and the second appearance loss; Wherein, the first appearance loss is calculated based on the difference between the high-resolution rendered image and the high-resolution reference image; The second appearance loss is calculated based on the difference between the low-resolution rendered image obtained by downsampling the high-resolution rendered image and the original low-resolution input image; The geometric constraints are achieved by a weighted combination of virtual view consistency loss and depth alignment consistency loss. The virtual view consistency loss is calculated based on the difference between the image rendered in the virtual view and the corresponding reference image; the depth alignment consistency loss is calculated based on the difference between the depth map estimated from the low-resolution input after downsampling the high-resolution rendered depth map to a low-resolution scale. The frequency domain constraints include: By performing stationary wavelet transforms on the brightness channels of the low-resolution rendered image and the low-resolution input image respectively, low-frequency sub-bands and several high-frequency sub-bands are obtained. Based on the difference between the corresponding sub-bands in the low-frequency sub-band and the high-frequency sub-band, the wavelet domain sub-band loss is calculated; Based on the multi-source constraint mechanism, the parameters of the three-dimensional Gaussian are jointly optimized and iteratively updated until the convergence condition is met, outputting a high-resolution and cross-viewpoint consistent three-dimensional scene reconstruction result.
2. The sparse view scene super-resolution reconstruction method based on 3D Gaussian representation and wavelet domain constraints according to claim 1, characterized in that, The Gaussian density processing includes: Initial Gaussian atoms are generated based on sparse point clouds, and the position, scale, orientation, opacity and color parameters of the Gaussian atoms are initialized. Several child Gaussian atoms are generated in the local neighborhood of each parent Gaussian atom. The child Gaussian atoms are selected according to the minimum spacing and view coverage criteria, and Gaussian atoms that can improve the expression of local details are retained to form a compacted Gaussian set.
3. The sparse view scene super-resolution reconstruction method based on three-dimensional Gaussian representation and wavelet domain constraints according to claim 1, characterized in that, The first appearance loss is: ; The second appearance loss is: ; The appearance constraints are as follows: ; In the formula, , High-resolution rendered image and reference image respectively. , These are the low-resolution rendered image and the input image, respectively. , All are sets of valid pixels at the corresponding scale. All are weighting coefficients. The first damage is to the appearance. For the second appearance loss, This is a loss of appearance.
4. The sparse view scene super-resolution reconstruction method based on three-dimensional Gaussian representation and wavelet domain constraints according to claim 1, characterized in that, The virtual view consistency loss is: ; The depth alignment consistency loss is: ; The geometric constraints are: ; In the formula, and These are the rendered image and reference image of the virtual view, respectively. For the effective comparison area; This is the rendered depth map after downsampling according to the input scale. For a depth map estimated from low-resolution input, For the effective depth region, All are weighting coefficients. For virtual view consistency loss, For the consistency loss of depth alignment, The loss is geometric.
5. The sparse view scene super-resolution reconstruction method based on 3D Gaussian representation and wavelet domain constraints according to claim 1, characterized in that, The wavelet domain subband loss is calculated as follows: ; In the formula, and These are a low-resolution rendered image and a low-resolution input image, respectively. Indicates luminance channel extraction, For stationary wavelets in subband The response on This is the effective comparison region for the sub-band. Weight each sub-task. For stationary wavelet loss, LL, LH, HL, and HH all represent different subbands.
6. The sparse view scene super-resolution reconstruction method based on three-dimensional Gaussian representation and wavelet domain constraints according to claim 1, characterized in that, The joint optimization involves combining appearance loss, geometric loss, and wavelet domain subband loss with preset weights into a total loss function, and then using a gradient-based optimization algorithm to iteratively update the parameters of the three-dimensional Gaussian. The parameters include the center position of the Gaussian, scale, rotation direction, opacity, and color.