A 3D GS view point synthesis method based on multi-view enhancement

By constructing a view-dependent color model through multi-view enhancement technology and global multi-view self-attention mechanism, the problems of background depth information distortion and neglect of viewpoint lighting changes in traditional 3DGS methods are solved, and high-quality new viewpoint image generation is achieved.

CN122199776APending Publication Date: 2026-06-12FUZHOU GAOTU INFORMATION TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
FUZHOU GAOTU INFORMATION TECH
Filing Date
2026-03-17
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Traditional 3DGS methods struggle to accurately estimate background depth information and ignore color changes caused by light reflection and shadows from different viewpoints, resulting in insufficient rendered image quality.

Method used

Prior depth information is obtained through multi-view enhancement technology, and a view-dependent color model is constructed by combining a global multi-view self-attention mechanism with MLP. The Gaussian function parameters are updated, and rasterization technology is used to generate new viewpoint images.

🎯Benefits of technology

It significantly improves the depth accuracy, color fidelity, and scene detail reproduction of rendered images, and has superior real-time high-quality rendering capabilities.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199776A_ABST
    Figure CN122199776A_ABST
Patent Text Reader

Abstract

The application discloses a 3DGS view synthesis method based on multi-view enhancement, which comprises the following steps: collecting multi-view observation data; obtaining prior depth through multi-view feature extraction, homographic transformation and cost volume depth estimation; and updating Gaussian function depth in a three-dimensional progressive propagation strategy supervision; then extracting color information containing view direction through a global multi-view self-attention mechanism, combining the Gaussian function to construct a color model by using an MLP, finally updating the Gaussian function parameters and generating a new view image through rasterization projection. The technical scheme can effectively solve the problems of traditional 3DGS background depth distortion and neglecting the view angle related light changes, improve the depth accuracy, color authenticity and detail restoration degree of the new view image, and enhance the real-time rendering quality.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision technology, specifically to a 3DGS viewpoint synthesis method based on multi-view enhancement. Background Technology

[0002] Understanding the 3D structure of real-world scenes is a crucial step in various advanced computer vision tasks, and the 3D Gaussian spatter method has significantly advanced this field. Although related work has made great strides in 3DGS model building and inference, 3DGS struggles to accurately estimate the depth information of the background and ignores the changes in light reflection on the object's surface caused by viewing the object from different perspectives.

[0003] Because 3DGS is limited by insufficient initial background depth information, it can only roughly estimate the background depth, and subsequent training cannot improve this lack of depth information. Furthermore, when rendering images from the target viewpoint, 3DGS simply projects a 3D Gaussian function onto a 2D plane. The Gaussian function, due to its inherent properties, means that the color features of objects on the rendered image do not change with the viewpoint, thus ignoring the variations in color information caused by light reflection and shadows at different viewpoints. Ultimately, this results in insufficient rendered image quality. Summary of the Invention

[0004] In view of the above problems, this application provides a 3DGS viewpoint synthesis method based on multi-view enhancement to solve the problems of background depth information distortion and lack of light effects on object surface in the rendered image, and to achieve better real-time rendering of high-quality images.

[0005] To achieve the above objectives, this application provides a 3DGS viewpoint synthesis method based on multi-view enhancement, comprising the following steps:

[0006] Step S1: Collect multi-view observation data in the scene. The multi-view observation data includes camera data, raw image data, and SfM point cloud data. The camera data includes camera intrinsic and extrinsic parameter matrices and camera pose data. The raw image data includes multi-view images taken from different perspectives. The SfM point cloud data includes sparse point cloud data representing the 3D scene.

[0007] Step S2: The original image data from the collected multi-view observation data is fed into the multi-view feature extraction network to extract the feature map of each multi-view image. Then, prior depth information is obtained from the feature information through homography transformation and cost volume depth estimation. Subsequently, a three-dimensional progressive propagation strategy is adopted, and the prior depth information is used as supervision to continuously update the depth information of the Gaussian function.

[0008] Step S3: Use the global multi-view self-attention mechanism module to extract color information containing view direction information from the feature information obtained from multiple views. Then, input the color information obtained from the global multi-view self-attention mechanism module and the Gaussian function into the MLP to construct a color model. The color model outputs color features and volume density update values ​​that are dependent on view direction.

[0009] Step S4: Based on the 3DGS training process, combined with the depth information of the Gaussian function obtained in step S2, the color features and volume density update values ​​obtained in step S3, update the mean value, covariance matrix, color parameters and opacity parameters of the Gaussian function. Project all the updated Gaussian functions onto the two-dimensional plane using sputtering rasterization technology, mix the color information and opacity information of all overlapping Gaussian functions, and finally obtain the new viewpoint image.

[0010] Furthermore, the specific implementation of step S2 includes:

[0011] S21. For N multi-view images, firstly, random layered sampling is performed on the ray emanating from the target viewpoint to obtain M 3D sampling points in the scene. Then, using the transformation matrix formed by the virtual camera pose of the target viewpoint and the camera poses of the S source view images, the 3D sampling points are sequentially projected onto the camera plane where the source view is located. The feature query position of the 3D sampling points is obtained through bilinear interpolation. Finally, the source views from different perspectives are fed into the multi-view feature extraction network to obtain dense feature maps.

[0012] S22. After obtaining feature information from multiple views, use homography transformation to transform the feature maps from different perspectives to the target perspective, and use the method of constructing cost volume to estimate depth, thereby obtaining prior depth information.

[0013] S23. Using the progressive propagation method, homography transformation is performed between pixel k and each candidate plane to find possible corresponding pixels in adjacent views. With color consistency as the benchmark, the plane with the highest color consistency between pixel k and its possible paired pixels is selected as the solution. The selected candidate plane is used to update the depth information of pixel k.

[0014] Furthermore, the specific implementation of step S3 includes:

[0015] S31. Preprocess the two-dimensional image features by concatenating the C-dimensional 2D feature vectors of the corresponding pixels of the 3D sampling points in space with the RGB values ​​of the image pixels. Encode the spatial coordinates of the 3D sampling points into a high-dimensional position encoding vector. Add the high-dimensional position encoding vector to the concatenated feature vector element by element to obtain the input token. Multiply the input token by H weight matrices through a linear layer to obtain the Query, Key, and Value of each head in the multi-head self-attention block. Encode the relative direction vector between the target viewpoint and the source viewpoint into a high-dimensional vector as an additional input in the attention calculation process. Obtain H attention matrices through multi-head self-attention calculation. Concatenate the H attention matrices and multiply them by the weight matrix to obtain the attention matrix of the same dimension as the original input. Add the attention matrix element by element to the original input feature matrix and pass it through the convolution, ReLU activation, and LayerNorm normalization operations of the feedforward neural network. Finally, feed it into a fully connected network with Sigmoid as the activation function for normalization to obtain the RGB related features of the 3D sampling points on the ray.

[0016] S32. The RGB-related features obtained in step S31, along with the covariance matrix of the Gaussian function and the spherical harmonic function, are fed into the MLP to train the MLP to adjust the color features and opacity information according to the target viewpoint direction. The output of the MLP is the volume density update value and the color features that include the viewpoint direction dependence, thus completing the construction of the color model.

[0017] Furthermore, in step S21, the calculation process of projecting the 3D sampling points onto the camera plane where the source view is located satisfies: ;in The coordinates of the 3D sampling point on the target ray projected onto the source view. Let represent the 3D coordinates of the j-th 3D sampling point on the ray from the target's perspective. For the camera's intrinsic and extrinsic parameters matrix, , θ represents the camera intrinsic and extrinsic parameter matrix representing the position of the i-th source view. t The camera intrinsic and extrinsic parameter matrix representing the target viewpoint position; The function projects 3D sampling points onto the source view plane according to the camera's spatial pose definition, satisfying the following correspondence:

[0018] Where (X,Y,Z) are the position coordinates of the 3D sampling points, K is the camera intrinsic parameter matrix, R is the rotation matrix, and T is the translation matrix.

[0019] Furthermore, in step S22, the constructed cost volume dimension is (X, C, D, H, W), where X represents the number of feature maps, C represents the number of channels per feature map, D represents the depth dimension, H represents the feature map height, and W represents the feature map width; the homography transformation at depth d satisfies:

[0020] ;

[0021] in Let be the intrinsic parameter matrix, rotation matrix, and translation matrix of the i-th camera, respectively. Let I be the camera's principal axis and I be the identity matrix.

[0022] Furthermore, in step S23, the homography transformation satisfies:

[0023] ;

[0024] Where P is the target view. represents the camera's intrinsic parameter matrix, rotation matrix, and translation matrix, respectively; s represents the camera's principal axis; and d represents the depth.

[0025] Furthermore, in step S22, when obtaining prior depth information, the method of MVSNet is followed. After calculating the variance of the feature maps under different viewpoints, the cost volume is regularized using 3DCNN. Finally, the expectation of each pixel is calculated along the depth d direction to obtain the initial depth value of the corresponding pixel, thereby obtaining the depth map under the target viewpoint.

[0026] Furthermore, in step S31, the number of multi-head self-attention blocks is 8, and the number of TransformerLayer blocks used in alternating stacking is 4.

[0027] Furthermore, in step S4, the calculation process of mixing the color information and opacity information of all overlapping Gaussian functions satisfies:

[0028] ;

[0029] in For the color information of pixel p in the new viewpoint image, The color information for the i-th Gaussian function. Let be the opacity of the i-th Gaussian function. Let be the opacity of the j-th Gaussian function.

[0030] Furthermore, the Gaussian function satisfies:

[0031] ;

[0032] in Let be the mean of the Gaussian function, and Σ be the covariance matrix of the Gaussian function. R is the orthogonal rotation matrix, and S is the diagonal scaling matrix.

[0033] Furthermore, in step S4, when updating the Gaussian function parameters, an adaptive density control strategy is adopted: when the scaling factor of the Gaussian function exceeds the threshold, it is split into smaller Gaussian functions; when the scaling factor is less than the threshold, the current Gaussian function is cloned; finally, Gaussian functions with too low opacity or too high scaling factor are trimmed.

[0034] Unlike existing technologies, the above-mentioned technical solution is based on a multi-view enhanced 3DGS viewpoint synthesis method. This includes acquiring multi-view observation data, obtaining prior depth through multi-view feature extraction, homography transformation, and cost volume depth estimation, and using a 3D progressive propagation strategy to supervise the updating of the Gaussian function depth. Then, a global multi-view self-attention mechanism is used to extract color information containing the viewpoint direction, and a color model is constructed using an MLP combined with the Gaussian function. Finally, the Gaussian function parameters are updated, and a new viewpoint image is generated through rasterization projection. This technical solution can effectively solve the problems of background depth distortion and neglect of viewpoint-related lighting changes in traditional 3DGS, improve the depth accuracy, color fidelity, and detail reproduction of the new viewpoint image, and enhance real-time rendering quality.

[0035] The above description of the invention is merely an overview of the technical solution of this application. In order to enable those skilled in the art to better understand the technical solution of this application and to implement it based on the description and drawings, and to make the above-mentioned objectives and other objectives, features and advantages of this application easier to understand, the following description is provided in conjunction with the specific embodiments and drawings of this application. Attached Figure Description

[0036] The accompanying drawings are only used to illustrate the principles, implementation methods, applications, features, and effects of specific embodiments of the present invention and other related contents, and should not be considered as limitations on this application.

[0037] In the accompanying drawings of the instruction manual:

[0038] Figure 1 This is a flowchart illustrating the overall process of the 3DGS viewpoint synthesis method based on multi-view enhancement according to an embodiment of the present invention.

[0039] Figure 2 for Figure 1 Corresponding diagram;

[0040] Figure 3 Here is a flowchart of step S2;

[0041] Figure 4 This is a diagram of the progressive depth propagation module for multi-view information enhancement in this embodiment of the invention;

[0042] Figure 5 This is a schematic diagram of target viewpoint reprojection after sampling in an embodiment of the present invention;

[0043] Figure 6 This is a schematic diagram of the multi-view feature extraction module in an embodiment of the present invention;

[0044] Figure 7 This is a schematic diagram of the global multi-view self-attention mechanism module in an embodiment of the present invention; Detailed Implementation

[0045] To illustrate the possible application scenarios, technical principles, implementable specific solutions, and achievable objectives and effects of this application in detail, the following description, in conjunction with the listed specific embodiments and accompanying drawings, provides a detailed explanation. The embodiments described herein are merely illustrative of the technical solutions of this application and are therefore intended to limit the scope of protection of this application.

[0046] In this document, the term "embodiment" means that a specific feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The term "embodiment" appearing in various places throughout the specification does not necessarily refer to the same embodiment, nor does it specifically limit its independence or connection with other embodiments. In principle, in this application, as long as there are no technical contradictions or conflicts, the technical features mentioned in each embodiment can be combined in any way to form corresponding implementable technical solutions.

[0047] Unless otherwise defined, the technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the use of related terms herein is merely for the purpose of describing particular embodiments and is not intended to limit this application.

[0048] In the description of this application, the term "and / or" is used to describe the logical relationship between objects, indicating that three relationships can exist. For example, A and / or B means: A exists, B exists, and A and B exist simultaneously. Additionally, the character " / " in this document generally indicates that the preceding and following objects have an "or" logical relationship.

[0049] In this application, terms such as “first” and “second” are used only to distinguish one entity or operation from another, and do not necessarily require or imply any actual quantity, hierarchy or order relationship between these entities or operations.

[0050] Without further limitations, the use of terms such as “comprising,” “including,” “having,” or other similar open-ended expressions in this application is intended to cover non-exclusive inclusion, which does not exclude the presence of additional elements in a process, method, or product that includes the stated elements, such that a process, method, or product that includes a list of elements may include not only those defined elements but also other elements not expressly listed, or elements inherent to such a process, method, or product.

[0051] As understood in the Examination Guidelines, in this application, expressions such as "greater than," "less than," and "exceeding" are understood to exclude the stated number; expressions such as "above," "below," and "within" are understood to include the stated number. Furthermore, in the description of the embodiments in this application, "multiple" means two or more (including two), and similar expressions related to "multiple" are also understood in this way, such as "multiple groups" and "multiple times," unless otherwise explicitly specified.

[0052] In the description of the embodiments of this application, the space-related expressions used, such as "center," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "vertical," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," and "circumferential," indicate the orientation or positional relationship based on the orientation or positional relationship shown in the specific embodiments or drawings. They are only for the purpose of describing the specific embodiments of this application or for the reader's understanding, and do not indicate or imply that the device or component referred to must have a specific position, a specific orientation, or be constructed or operated in a specific orientation. Therefore, they should not be construed as limitations on the embodiments of this application.

[0053] Unless otherwise expressly specified or limited, the terms "installation," "connection," "linking," "fixing," and "setting," as used in the description of the embodiments of this application, should be interpreted broadly. For example, "connection" can be a fixed connection, a detachable connection, or an integral setting; it can be a mechanical connection, an electrical connection, or a communication connection; it can be a direct connection or an indirect connection through an intermediate medium; it can be the internal connection of two components or the interaction between two components. For those skilled in the art to which this application pertains, the specific meaning of the above terms in the embodiments of this application can be understood according to the specific circumstances.

[0054] Please see Figures 1 to 7This embodiment provides a 3DGS viewpoint synthesis method based on multi-view enhancement, applicable to viewpoint synthesis tasks in fields such as virtual reality, augmented reality, digital twins, and 3D scene reconstruction. Addressing the technical problems of traditional 3DGS methods, such as difficulty in accurately estimating background depth information and ignoring color changes caused by light reflection and shadows from different viewpoints, resulting in insufficient rendered image quality, this 3DGS viewpoint synthesis method based on multi-view enhancement obtains prior depth through multi-view enhancement technology and supervises Gaussian function depth updates. It combines a global multi-view self-attention mechanism and MLP to construct a viewpoint-dependent color model, ultimately generating new viewpoint images. This effectively improves the depth accuracy, color realism, and scene detail reproduction of rendered images, providing superior real-time high-quality rendering capabilities.

[0055] like Figure 1 and Figure 2 As shown, in this embodiment, the 3DGS viewpoint synthesis method based on multi-view enhancement includes the following steps:

[0056] Step S1: Collect multi-view observation data in the scene. The multi-view observation data includes camera data, raw image data, and SfM point cloud data. The camera data includes camera intrinsic and extrinsic parameter matrices and camera pose data. The raw image data includes multi-view images taken from different perspectives. The SfM point cloud data includes sparse point cloud data representing the three-dimensional scene.

[0057] Step S2: The original image data from the collected multi-view observation data is sent to the multi-view feature extraction network to extract the feature map of each multi-view image. Then, prior depth information is obtained from the feature information through homography transformation and cost volume depth estimation. Subsequently, a three-dimensional progressive propagation strategy is adopted and the prior depth information is used as supervision to continuously update the depth information of the Gaussian function.

[0058] Step S3: Use the global multi-view self-attention mechanism module to extract color information containing viewpoint direction information from the feature information obtained from multiple views. Then, input the color information obtained from the global multi-view self-attention mechanism module along with a Gaussian function into the MLP to construct a color model. The color model outputs color features and volume density update values ​​that depend on the viewpoint direction. For example... Figure 7 The diagram shown is a schematic of the global multi-view self-attention mechanism module.

[0059] Step S4: Based on the 3DGS training process, combined with the depth information of the Gaussian function obtained in step S2, the color features and volume density update values ​​obtained in step S3, update the mean value, covariance matrix, color parameters and opacity parameters of the Gaussian function. Project all the updated Gaussian functions onto the two-dimensional plane using sputtering rasterization technology, mix the color information and opacity information of all overlapping Gaussian functions, and finally obtain the new viewpoint image.

[0060] 3DGS represents 3D scenes using an anisotropic 3D Gaussian distribution and renders them as images using a sputtering-based rasterization technique. Each 3D Gaussian function is represented by a covariance matrix Σ and a mean μ, and the Gaussian function satisfies:

[0061] ;

[0062] in Let be the mean of the Gaussian function, and Σ be the covariance matrix of the Gaussian function. To ensure effective optimization of gradient descent, and... , where R is the orthogonal rotation matrix and S is the diagonal scaling matrix.

[0063] like Figure 3 , Figure 4 , Figure 5 and Figure 6 As shown, step S2 specifically includes:

[0064] S21. For N multi-view images, firstly, random layered sampling is performed on the ray emanating from the target viewpoint to obtain M 3D sampling points in the scene. Then, using the transformation matrix formed by the virtual camera pose of the target viewpoint and the camera poses of the S source view images, the 3D sampling points are sequentially projected onto the camera plane where the source view is located. The feature query position of the 3D sampling points is obtained through bilinear interpolation. Finally, the source views from different perspectives are fed into the multi-view feature extraction network to obtain dense feature maps. Figure 6 This is a schematic diagram of the feature extraction process.

[0065] S22. After obtaining feature information from multiple views, homography transformation is used to transform the feature maps from different perspectives to the target perspective, and depth estimation is performed by constructing cost volume to obtain prior depth information.

[0066] S23. Using the progressive propagation method, homography is performed between pixel k and each candidate plane to find possible corresponding pixels in adjacent views. Based on color consistency, the plane with the highest color consistency between pixel k and its possible paired pixels is selected as the solution. The selected candidate plane is used to update the depth information of pixel k.

[0067] In this embodiment, by standardizing the calculation logic of projecting 3D sampling points onto the source view and clarifying the homography transformation formulas at the cost volume dimension and depth, dense feature maps of multiple views are first accurately obtained through random hierarchical sampling and bilinear interpolation. Then, the features of different viewpoints are aligned and the cost volume is constructed through homography transformation. Combined with the MVSNet approach, accurate prior depth estimation is achieved. Finally, depth information is updated through progressive propagation based on color consistency, forming a complete closed loop of "feature extraction-depth estimation-depth optimization". This effectively solves the problems of background depth information distortion, insufficient initial depth and difficulty in optimization in traditional 3DGS, significantly improves the accuracy and update stability of Gaussian function depth parameters, and provides a reliable depth foundation for subsequent high-quality new viewpoint image rendering.

[0068] In step S21, for N multi-viewpoint images, firstly, random layered sampling is performed on the ray emanating from the target viewpoint to obtain M 3D sampling points in the scene. Then, using the transformation matrix formed by the virtual camera pose of the target viewpoint and the camera poses of the S source viewpoint images, the sampling points are sequentially projected onto the camera plane where the source viewpoints are located. The calculation process of projecting the 3D sampling points onto the camera plane where the source viewpoints are located satisfies: ;in The coordinates of the 3D sampling point on the target ray projected onto the source view. Let represent the 3D coordinates of the j-th 3D sampling point on the ray from the target's perspective. For the camera's intrinsic and extrinsic parameters matrix, , θ represents the camera intrinsic and extrinsic parameter matrix representing the position of the i-th source view. t The camera intrinsic and extrinsic parameter matrix representing the target viewpoint position; The function projects 3D sampling points onto the source view plane according to the camera's spatial pose definition, satisfying the following correspondence:

[0069] Where (X,Y,Z) are the position coordinates of the 3D sampling points, K is the camera intrinsic parameter matrix, R is the rotation matrix, and T is the translation matrix. After obtaining feature information from multiple views, this paper uses homography transformation to transform the feature maps from different viewpoints to the target viewpoint, and performs depth estimation by constructing cost volume to obtain prior depth information.

[0070] In step S22, a 3D scene is constructed based on the extracted feature information and the input camera. I represents the image, X represents the number of images, and i=1 is set as the reference image (i=1,2,…,X). Images from i different viewpoints are warped to the target viewpoint using homography transformation. The constructed cost volume dimension is (X,C,D,H,W), where X represents the number of feature maps, C represents the number of channels in each feature map, D represents the depth dimension, H represents the feature map height, and W represents the feature map width. The homography transformation at depth d satisfies:

[0071] ;

[0072] in Let be the intrinsic parameter matrix, rotation matrix, and translation matrix of the i-th camera, respectively. Let I be the camera principal axis and I be the identity matrix. After obtaining the feature map under the target viewpoint, following the MVSNet method, the variance of the feature maps under different viewpoints is calculated, and then the cost volume is regularized using a 3D CNN. Finally, the expectation of each pixel along the depth d direction is calculated to obtain the initial depth value of the corresponding pixel, and the depth map under the target viewpoint is obtained from the cost volume. Therefore, in step S22, when obtaining the prior depth information, following the MVSNet method, the variance of the feature maps under different viewpoints is calculated, the cost volume is regularized using a 3D CNN, and finally the expectation of each pixel along the depth d direction is calculated to obtain the initial depth value of the corresponding pixel, thereby obtaining the depth map under the target viewpoint.

[0073] In step S23, to select the best candidate plane for pixel k during propagation, a progressive propagation method is used to perform homography transformation between pixel k and each candidate plane, thereby finding possible corresponding pixels in adjacent views. Using color consistency as a benchmark, the plane with the highest color consistency between k and its possible paired pixels is selected as the solution. The selected candidate plane is then used to update the depth information of pixel k. The homography transformation satisfies:

[0074] ;

[0075] Where P is the target view. Let represent the camera's intrinsic parameter matrix, rotation matrix, and translation matrix, respectively; s represent the camera's principal axis; and d represent the depth. This step, through a complete closed loop of "feature extraction - depth estimation - depth optimization," effectively solves the problems of distorted background depth information, insufficient initial depth, and difficulty in optimization in traditional 3DGS, significantly improving the accuracy and update stability of the Gaussian function depth parameters.

[0076] In this embodiment, step S3 uses a global multi-view self-attention mechanism module to extract color information containing viewpoint direction information from the feature information obtained from multiple views. The color information obtained from the global multi-view self-attention mechanism module, together with a Gaussian function, is fed into an MLP to construct a color model. The color model outputs color features and volume density update values ​​that depend on the viewpoint direction. Specifically, this includes two sub-steps, S31 and S32:

[0077] S31. Preprocess the two-dimensional image features by concatenating the C-dimensional 2D feature vectors of the corresponding pixels of the 3D sampling points in space with the RGB values ​​of the image pixels. Encode the spatial coordinates of the 3D sampling points into a high-dimensional position encoding vector. Add the high-dimensional position encoding vector to the concatenated feature vector element by element to obtain the input token. Multiply the input token by H weight matrices through a linear layer to obtain the Query, Key, and Value of each head in the multi-head self-attention block. Encode the relative direction vector between the target viewpoint and the source viewpoint into a high-dimensional vector as an additional input in the attention calculation process. Obtain H attention matrices through multi-head self-attention calculation. Concatenate the H attention matrices and multiply them by the weight matrix to obtain the attention matrix of the same dimension as the original input. Add the attention matrix element by element to the original input feature matrix and pass it through the convolution, ReLU activation, and LayerNorm normalization operations of the feedforward neural network. Finally, feed it into a fully connected network with Sigmoid as the activation function for normalization to obtain the RGB correlation features of the 3D sampling points on the ray. In step S31, the number of multi-head self-attention blocks is 8, and the number of TransformerLayer blocks used in alternating stacking is 4. In other embodiments, the number of multi-head self-attention blocks and the number of TransformerLayer blocks can be increased or decreased as needed.

[0078] The 2D image features are preprocessed by concatenating the C-dimensional 2D feature vectors of the corresponding pixels at the 3D sampling points in space across N source views with the RGB values ​​of the image pixels. Since deep networks tend to learn and fit low-frequency information functions while ignoring high-frequency information, directly inputting position information and light direction results in a blurry image. Therefore, this paper first encodes the spatial coordinates of the 3D sampling points into high-dimensional vectors, and then... The feature vector obtained by the above concatenation is added element-wise and then fed into the MVT module of this paper. The preprocessing process is shown in the following formula:

[0079] ;

[0080] in, and Let T be the feature vector and RGB value mapped from the j-th sampling point on the target ray to the i-th source view, respectively. The processed input token, denoted as T, is multiplied by H weight matrices in the linear layer to obtain the Query, Key, and Value for each head in the multi-head self-attention block, as shown in the following equation:

[0081] ;

[0082] Simultaneously, the relative direction vectors between the target viewpoint and the source viewpoint are... The encoding is a high-dimensional vector, which serves as an additional input in the attention calculation process, enhancing the model's ability to analyze differences in image features across different viewpoints. The attention calculation process is shown in the following equation:

[0083] This enhances the model's ability to analyze image feature differences from various perspectives. The H attention matrices obtained after multi-head self-attention computation are concatenated and multiplied by the weight matrix to obtain an attention matrix of the same dimension as the original input. This attention matrix is ​​then added element-wise to the original input feature matrix and sequentially passed through convolution, ReLU activation, and LayerNorm normalization operations in a feedforward neural network. In this embodiment, there are 8 multi-head self-attention blocks and 4 alternately stacked Transformer Layer blocks. Finally, the output tensor is fed into a fully connected network with Sigmoid activation for normalization, yielding the RGB correlation features of the 3D sampling points on the ray.

[0084] S32. The RGB-related features obtained in step S31, along with the covariance matrix of the Gaussian function and the spherical harmonic function, are fed into the MLP to train the MLP to adjust the color features and opacity information according to the target viewpoint direction. The output of the MLP is the volume density update value and the color features that include the viewpoint direction dependence, thus completing the construction of the color model.

[0085] The H attention matrices obtained after multi-head self-attention computation are concatenated and multiplied by the weight matrix to obtain an attention matrix of the same dimension as the original input. This attention matrix is ​​then added element-wise to the original input feature matrix and normalized using convolution, ReLU activation, and LayerNorm layers in a feedforward neural network. This paper uses 8 heads in the multi-head self-attention module and alternately stacks 4 Transformer Layer blocks. Finally, the output tensor is fed into a fully connected network with Sigmoid activation to obtain the RGB values ​​of the 3D sampling points on the ray.

[0086] After extracting color information related to the viewpoint direction from the feature map extracted by the 2D image feature extraction network, it is fed into the MLP to train the model to change the color features and opacity information according to the target viewpoint direction. The final MLP model is shown in the following equation.

[0087] ;

[0088] The covariance matrix of the Gaussian distribution Given the spherical harmonic function c and the target viewpoint direction d, return the updated volume density value. .

[0089] Traditional 3DGS color modeling does not incorporate viewpoint information, and directly inputting position information and light direction can cause the depth network to ignore high-frequency information, resulting in blurred rendered images. At the same time, it cannot reproduce the color changes of the object surface caused by light reflection and shadow under different viewpoints, and the color features lack viewpoint adaptability.

[0090] In this embodiment, feature splicing and high-dimensional position encoding preprocessing are used to avoid the loss of high-frequency information and improve the image detail. By introducing a global multi-view self-attention mechanism with relative direction vectors, the perspective differences between multiple views are accurately captured, and color features containing perspective direction information are extracted. Then, by combining the covariance matrix of the Gaussian function and the spherical harmonic function, a color model is constructed through MLP to realize the perspective-dependent adjustment of color features and opacity information. Finally, the color of the rendered image is more in line with the target viewpoint scene, and the light reflection and shadow effects are more realistic, significantly improving color realism and perspective consistency.

[0091] Step S4: The goal is to generate a new viewpoint image. In step S4, based on the 3DGS training process, combined with the depth information of the Gaussian function obtained in step S2, and the color features and volume density update values ​​obtained in step S3, the mean, covariance matrix, color parameters, and opacity parameters of the Gaussian function are updated. An adaptive density control strategy is employed during the update process: when the scaling of the Gaussian function exceeds a threshold, it is split into smaller Gaussian functions; when the scaling is less than a threshold, the current Gaussian function is cloned; finally, Gaussian functions with excessively low opacity or excessively large scaling are pruned to ensure a reasonable distribution of the Gaussian function. All updated Gaussian functions are projected onto a two-dimensional plane using a sputtering-based rasterization technique, and the color and opacity information of all overlapping Gaussian functions are mixed. The color mixing calculation process satisfies the following:

[0092] ;

[0093] in For the color information of pixel p in the new viewpoint image, The color information for the i-th Gaussian function. Let be the opacity of the i-th Gaussian function. Let be the opacity of the j-th Gaussian function. Meanwhile, 3DGS uses an adaptive density control module to improve rendering quality. When the scaling exceeds a threshold, it is split into smaller Gaussian functions; if the scaling is less than the threshold, the current Gaussian function is cloned. Finally, Gaussian functions with excessively low opacity or excessively high scaling are trimmed.

[0094] In step S2 above, a multi-view feature extraction module is designed for the three-dimensional data collected in step S1 to obtain feature information from multiple views. The extracted features are transformed to the target viewpoint by homography transformation of feature maps from different perspectives. Prior depth information is obtained by constructing cost volume depth estimation method. A progressive depth propagation strategy is used to transfer depth information between adjacent Gaussian functions. The prior depth information obtained from multiple views is used to supervise the transfer process, so as to accurately represent the background depth information.

[0095] In step S3, a global multi-view self-attention mechanism module is designed to extract color features from multi-view feature information, which includes target viewpoint direction information. The color feature information generated by the neural network and the color information of the Gaussian function itself are fed into the neural network for feature fusion. Finally, the light reflection and shadow rendering of the object surface in the image under the target viewpoint are more realistic.

[0096] In step S4, after modeling the 3D scene, in order to render pixel p from a given viewpoint, it is necessary to calculate all overlapping Gaussian functions at that location to obtain the final color information. In step S4, when updating the Gaussian function parameters, an adaptive density control strategy is adopted: when the scaling factor of the Gaussian function exceeds a threshold, it is split into smaller Gaussian functions; when the scaling factor is less than the threshold, the current Gaussian function is cloned; finally, Gaussian functions with excessively low opacity or excessively large scaling factors are trimmed.

[0097] In this embodiment, the "feature extraction-depth estimation-depth optimization" closed loop in step S2 effectively solves the problems of distorted background depth information, insufficient initial depth, and difficulty in optimization in traditional 3DGS, significantly improving the accuracy and update stability of the Gaussian function depth parameters. Through feature preprocessing and view-dependent color modeling in step S3, high-frequency information loss is avoided, and viewpoint differences between multiple views are accurately captured, restoring the light reflection and shadow effects on the object surface under different viewpoints. Finally, through parameter updating and rasterization projection in step S4, the generated new viewpoint image has significantly improved depth accuracy, color fidelity, and scene detail restoration, possessing superior real-time high-quality rendering capabilities.

[0098] Finally, it should be noted that although the above embodiments have been described in the text and drawings of this application, this should not limit the scope of patent protection of this application. Any technical solutions that are based on the essential concept of this application and utilize the content described in the text and drawings of this application, resulting in equivalent structural or procedural substitutions or modifications, as well as the direct or indirect application of the technical solutions of the above embodiments to other related technical fields, are all included within the scope of patent protection of this application.

Claims

1. A 3DGS viewpoint synthesis method based on multi-view enhancement, characterized in that, Includes the following steps: Step S1: Collect multi-view observation data in the scene. The multi-view observation data includes camera data, raw image data, and SfM point cloud data. The camera data includes camera intrinsic and extrinsic parameter matrices and camera pose data. The raw image data includes multi-view images taken from different perspectives. The SfM point cloud data includes sparse point cloud data representing the 3D scene. Step S2: The original image data in the collected multi-view observation data is sent into the multi-view feature extraction network to extract the feature map of each multi-view image. Then, the prior depth information is obtained from the feature information by homography transformation and the method of constructing cost volume depth estimation. Subsequently, a three-dimensional progressive propagation strategy is adopted and the prior depth information is used as supervision to continuously update the depth information of the Gaussian function. Step S3: Use the global multi-view self-attention mechanism module to extract color information containing view direction information from the feature information obtained from multiple views. Then, input the color information obtained from the global multi-view self-attention mechanism module and the Gaussian function into the MLP to construct a color model. The color model outputs color features and volume density update values ​​that are dependent on view direction. Step S4: Based on the 3DGS training process, combined with the depth information of the Gaussian function obtained in step S2, the color features and volume density update values ​​obtained in step S3, update the mean value, covariance matrix, color parameters and opacity parameters of the Gaussian function. Project all the updated Gaussian functions onto the two-dimensional plane using sputtering rasterization technology, mix the color information and opacity information of all overlapping Gaussian functions, and finally obtain the new viewpoint image.

2. The 3DGS viewpoint synthesis method based on multi-view enhancement according to claim 1, characterized in that, Step S2 includes the following steps: S21. For N multi-view images, firstly, random layered sampling is performed on the ray emanating from the target viewpoint to obtain M 3D sampling points in the scene. Then, using the transformation matrix formed by the virtual camera pose of the target viewpoint and the camera poses of the S source view images, the 3D sampling points are sequentially projected onto the camera plane where the source view is located. The feature query position of the 3D sampling points is obtained through bilinear interpolation. Finally, the source views from different perspectives are fed into the multi-view feature extraction network to obtain dense feature maps. S22. After obtaining feature information from multiple views, use homography transformation to transform the feature maps from different perspectives to the target perspective, and use the method of constructing cost volume to estimate depth, thereby obtaining prior depth information. S23. Using the progressive propagation method, homography is performed between pixel k and each candidate plane to find possible corresponding pixels in adjacent views. Based on color consistency, the plane with the highest color consistency between pixel k and its possible paired pixels is selected as the solution. The selected candidate plane is used to update the depth information of pixel k.

3. The 3DGS viewpoint synthesis method based on multi-view enhancement according to claim 2, characterized in that, In step S21, the calculation process of projecting 3D sampling points onto the camera plane where the source view is located satisfies: ;in The coordinates of the 3D sampling point on the target ray projected onto the source view. Let represent the 3D coordinates of the j-th 3D sampling point on the ray from the target's perspective. For the camera's intrinsic and extrinsic parameters matrix, , θ represents the camera intrinsic and extrinsic parameter matrix representing the position of the i-th source view. t The camera intrinsic and extrinsic parameter matrix representing the target viewpoint position; The function projects 3D sampling points onto the source view plane according to the camera's spatial pose definition, satisfying the following correspondence: ; Where (X,Y,Z) are the position coordinates of the 3D sampling point, K is the camera intrinsic parameter matrix, R is the rotation matrix, and T is the translation matrix.

4. The 3DGS viewpoint synthesis method based on multi-view enhancement according to claim 2, characterized in that, In step S22, the constructed cost volume dimension is (X, C, D, H, W), where X represents the number of feature maps, C represents the number of channels per feature map, D represents the depth dimension, H represents the feature map height, and W represents the feature map width; the homography transformation at depth d satisfies: ; in Let be the intrinsic parameter matrix, rotation matrix, and translation matrix of the i-th camera, respectively. Let I be the camera's principal axis and I be the identity matrix.

5. The 3DGS viewpoint synthesis method based on multi-view enhancement according to claim 2, characterized in that, In step S23, the homography transformation satisfies: ; Where P is the target view. represents the camera's intrinsic parameter matrix, rotation matrix, and translation matrix, respectively; s represents the camera's principal axis; and d represents the depth.

6. The 3DGS viewpoint synthesis method based on multi-view enhancement according to claim 2, characterized in that, In step S22, when obtaining prior depth information, the method of MVSNet is followed. After calculating the variance of the feature maps under different viewpoints, the cost volume is regularized using 3DCNN. Finally, the expectation of each pixel is calculated along the depth d direction to obtain the initial depth value of the corresponding pixel, thereby obtaining the depth map under the target viewpoint.

7. The 3DGS viewpoint synthesis method based on multi-view enhancement according to claim 2, characterized in that, The specific implementation methods of step S3 include: S31. Preprocess the two-dimensional image features by concatenating the C-dimensional 2D feature vectors of the corresponding pixels of the 3D sampling points in space with the RGB values ​​of the image pixels. Encode the spatial coordinates of the 3D sampling points into a high-dimensional position encoding vector. Add the high-dimensional position encoding vector to the concatenated feature vector element by element to obtain the input token. Multiply the input token by H weight matrices through a linear layer to obtain the Query, Key, and Value of each head in the multi-head self-attention block. Encode the relative direction vector between the target viewpoint and the source viewpoint into a high-dimensional vector as an additional input in the attention calculation process. Obtain H attention matrices through multi-head self-attention calculation. Concatenate the H attention matrices and multiply them by the weight matrix to obtain the attention matrix of the same dimension as the original input. Add the attention matrix element by element to the original input feature matrix and pass it through the convolution, ReLU activation, and LayerNorm normalization operations of the feedforward neural network. Finally, feed it into a fully connected network with Sigmoid as the activation function for normalization to obtain the RGB related features of the 3D sampling points on the ray. S32. The RGB-related features obtained in step S31, along with the covariance matrix of the Gaussian function and the spherical harmonic function, are fed into the MLP to train the MLP to adjust the color features and opacity information according to the target viewpoint direction. The output of the MLP is the volume density update value and the color features that include the viewpoint direction dependence, thus completing the construction of the color model.

8. The 3DGS viewpoint synthesis method based on multi-view enhancement according to claim 7, characterized in that, In step S31, the number of multi-head self-attention blocks is 8, and the number of TransformerLayer blocks used in alternating stacking is 4.

9. The 3DGS viewpoint synthesis method based on multi-view enhancement according to claim 1, characterized in that, In step S4, the calculation process of mixing the color information and opacity information of all overlapping Gaussian functions satisfies: ; in For the color information of pixel p in the new viewpoint image, The color information for the i-th Gaussian function. Let be the opacity of the i-th Gaussian function. Let be the opacity of the j-th Gaussian function.

10. The 3DGS viewpoint synthesis method based on multi-view enhancement according to claim 1, characterized in that, In step S4, when updating the Gaussian function parameters, an adaptive density control strategy is adopted: when the scaling factor of the Gaussian function exceeds the threshold, it is split into smaller Gaussian functions; when the scaling factor is less than the threshold, the current Gaussian function is cloned; finally, Gaussian functions with too low opacity or too high scaling factor are trimmed.