Robust viewpoint compositing for unconstrained image data
A machine-learned viewpoint synthesis model with static and transient content sections and uncertainty handling addresses variable lighting and occlusions in uncontrolled images, enhancing realism and user control while reducing computational needs.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- GOOGLE LLC
- Filing Date
- 2026-03-30
- Publication Date
- 2026-07-02
AI Technical Summary
Existing viewpoint synthesis techniques struggle to model ubiquitous real-world phenomena such as variable lighting and transient occlusions in uncontrolled images, limiting their effectiveness in generating realistic composite images.
A machine-learned viewpoint synthesis model is developed with a base for static content and a transient content section, incorporating generative embeddings and uncertainty values to handle inconsistencies in uncontrolled images, allowing adjustments in camera parameters and modeling both static and transient content.
The model enhances robustness to uncontrolled images, improving realism, consistency, and user controllability in viewpoint synthesis, with faster convergence and reduced computational resources.
Smart Images

Figure 2026110595000001_ABST
Abstract
Description
[Technical Field]
[0001] Related applications This application claims priority and benefit of U.S. Provisional Patent Application No. 63 / 059,322, filed on 31 July 2020. U.S. Provisional Patent Application No. 63 / 059,322 is incorporated herein by reference in its entirety.
[0002] This disclosure generally relates to systems and methods for modeling scenes to facilitate, for example, the performance of scene viewpoint synthesis. More specifically, this disclosure relates to machine learning models that provide improved robustness to ubiquitous real-world phenomena in uncontrolled images, such as variable lighting and transient occlusions. [Background technology]
[0003] The task of viewpoint synthesis aims to generate a new viewpoint (e.g., scene, object, or subject) of a particular environment, starting from several photographs taken from a given viewpoint. As an example, given several images of a particular subject captured from a specific point with specific camera settings and orientation, a viewpoint synthesis system attempts to generate a composite image that appears as if it were captured from a virtual camera placed at different points and having a given setting.
[0004] While several viewpoint synthesis techniques leverage neural luminance fields to effectively learn volumetric scene density and luminance from images captured in controlled environments, these techniques cannot model many ubiquitous real-world phenomena in uncontrolled images, such as variable lighting and transient occlusions. [Prior art documents] [Non-patent literature]
[0005] [Non-Patent Document 1] Representing Scenes as Neural Radiance Fields for View Synthesis by Mildenhall et al. (arXiv:2003.08934v1) [Overview of the Initiative] [Means for solving the problem]
[0006] Aspects and advantages of the embodiments of this disclosure are partially described in the following description, or can be learned from the description, or can be learned through the implementation of the embodiments.
[0007] An exemplary aspect of the present disclosure relates to a computing system for generating a composite image of a scene. The computing system comprises one or more processors and one or more non-temporary computer-readable media, the one or more non-temporary computer-readable media which collectively store a machine-learned viewpoint composite model comprising a base, a static content section for modeling static content in the scene, and a temporary content section for modeling temporary content in the scene, and instructions that, when executed by one or more processors, cause the computing system to perform an action. The action includes obtaining a position in three-dimensional space, processing data describing the position using the base of the machine-learned viewpoint composite model to generate static opacity and latent representations, processing the latent representation using the static content section of the machine-learned viewpoint composite model to generate static color, processing the latent representation using the temporary content section of the machine-learned viewpoint composite model to generate temporary opacity and temporary color, and performing volume rendering to generate composite pixel colors for composite pixels in the composite image from the static opacity, static color, temporary opacity, and temporary color.
[0008] Another exemplary aspect of this disclosure relates to a computer implementation method for viewpoint synthesis having user-specifiable properties. The method includes the step of obtaining a desired location in three-dimensional space and user-specified generative embeddings by a computing system including one or more computing devices, the generative embeddings encoding one or more visual properties of the resulting composite image. The method includes the step of processing data describing the location using the base of a machine-learned viewpoint synthesis model to generate opacity and latent representations by the computing system. The method includes the step of processing the latent representations and generative embeddings by the computing system using the content portion of a machine-learned viewpoint synthesis model to generate color. The method includes the step of performing volume rendering by the computing system to generate composite pixel colors for composite pixels of the composite image from the opacity and color, the composite image exhibiting one or more visual properties encoded by the generative embeddings.
[0009] Another exemplary aspect of the Disclosure relates to one or more non-temporary computer-readable media for storing instructions together, which, when executed by one or more processors, cause a computing system to perform an action. The action includes the computing system obtaining a position in three-dimensional space and one or more camera parameters associated with an existing training image. The action includes the computing system processing data describing the position and one or more camera parameters using the base of a machine-learned viewpoint synthesis model to generate opacity and latent representations. The action includes the computing system processing the latent representations using the content portion of the machine-learned viewpoint synthesis model to generate colors. The action includes the computing system performing volume rendering to generate composite pixel colors for composite pixels in a composite image from the opacity and colors. The action includes evaluating a loss function that compares the composite pixel colors to ground truth pixel colors for training pixels contained in an existing training image. The action includes modifying one or more values of camera parameters based at least in part on the loss function.
[0010] Another exemplary aspect of the present disclosure relates to a computing system for generating a composite image of a scene. The computing system comprises one or more processors and one or more non-temporary computer-readable media, the one or more non-temporary computer-readable media which collectively store a machine-learned viewpoint synthesis model having a base and a static content section that models static content in a scene, the base and static section being trained together with a temporary content section that models temporary content in a scene, and instructions that cause the computing system to perform an action when executed by one or more processors. The action includes obtaining a position in three-dimensional space, processing data describing the position using the base of the machine-learned viewpoint synthesis model to generate static opacity and latent representation, processing the latent representation using the static content section of the machine-learned viewpoint synthesis model to generate static color, and performing volume rendering to generate composite pixel colors for composite pixels in the composite image from the static opacity and static color.
[0011] Other aspects of this disclosure cover a variety of systems, apparatus, non-temporary computer-readable media, user interfaces, and electronic devices.
[0012] These and other features, aspects, and advantages of the various embodiments of this disclosure will be better understood by referring to the following description and the appended claims. The appended drawings incorporated herein and forming part thereof illustrate exemplary embodiments of this disclosure and, together with this description, are useful for illustrating the relevant principles.
[0013] A detailed description of embodiments intended for those skilled in the art is provided herein, and this specification refers to the accompanying figures. [Brief explanation of the drawing]
[0014] [Figure 1]A block diagram of an exemplary process for training a machine - learned viewpoint synthesis model according to an exemplary embodiment of the present disclosure and then using the machine - learned viewpoint synthesis model to perform viewpoint synthesis. [Figure 2] A block diagram of an exemplary process for training a machine - learned viewpoint synthesis model according to an exemplary embodiment of the present disclosure. [Figure 3] A block diagram of an exemplary process for using a machine - learned viewpoint synthesis model according to an exemplary embodiment of the present disclosure. [Figure 4] A block diagram of an exemplary machine - learned viewpoint synthesis model according to an exemplary embodiment of the present disclosure. [Figure 5A] A block diagram of an exemplary computing system according to an exemplary embodiment of the present disclosure. [Figure 5B] A block diagram of an exemplary computing device according to an exemplary embodiment of the present disclosure. [Figure 5C] A block diagram of an exemplary computing device according to an exemplary embodiment of the present disclosure.
Mode for Carrying Out the Invention
[0015] Reference numerals repeated across multiple figures identify the same features in various implementations.
[0016] Overview Generally, the present disclosure is directed to systems and methods for synthesizing novel viewpoints of complex scenes (e.g., outdoor scenes). In some implementations, the systems and methods can include or use a machine - learned model that can learn, for example, from unstructured and / or unconstrained collections of images such as “wild” photographs. In particular, exemplary implementations of the present disclosure can learn volumetric scene density and luminance represented by a machine - learned model such as one or more multilayer perceptrons (MLPs), other neural networks, or other machine - learned models.
[0017] More specifically, several techniques, such as those described in Mildenhall et al.'s NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (arXiv:2003.08934v1), (hereafter "NeRF"), are effective for images captured in a controlled environment. However, these techniques cannot model many ubiquitous real-world phenomena in uncontrolled images, such as variable lighting and temporary occlusions.
[0018] Specifically, NeRF is built on the assumption of 3-D consistency, which assumes that two photographs taken of the same scene should match. However, in uncontrolled environments (e.g., "wild" photographs), two images of the same scene often have inconsistencies, including inconsistencies in lighting characteristics or other visual characteristics and / or the presence / absence of temporary content / items (e.g., each of two photographs of a landmark may have a different set of temporary travelers positioned in front of the landmark).
[0019] NeRF learns a scene-by-scene model described by a (learned) function. This function takes location (x, y, z) and line of sight direction (theta, phi) to create RGB color and opacity (sigma). To generate a render, rays passing through each pixel in the camera are tracked, and the colors, weighted by the opacity accumulated along the length of the rays, are integrated. The NeRF model is trained to minimize the squared error between the predicted RGB values and the ground truth image. Therefore, photographic and camera parameters are needed to train NeRF. The photograph indicates what color each pixel should be, and the camera parameters indicate where in 3-D space the rays diverging from that pixel reside.
[0020] This disclosure provides several techniques for improving the NeRF model or a similar model. These additions enable the proposed model to become more robust to cluttered real-world photographs.
[0021] In particular, according to one aspect of this disclosure, while training a machine learning-prepared viewpoint synthesis model on a set of training images, each training image may be assigned a generative embedding. The generative embedding can serve to encode the visual characteristics of the image, such as lighting characteristics (e.g., white balance), content characteristics (e.g., time of day, weather, season), and style characteristics (e.g., photo editing software settings).
[0022] Therefore, in some implementations, each training image may be assigned a small generative embedding (e.g., a vector of 8-32 floats) (e.g., indexed by the image ID). The use of such generative embeddings allows the model to account for variations in individual images, such as post-processing of white balance and stylistics.
[0023] These generative embeddings can be learned / updated as part of the model. For example, generative embeddings can be treated as model parameters and updated during model training (e.g., during or as a result of backpropagation of the loss function). Through such a process, generative embeddings for images with similar properties can be moved closer to each other.
[0024] Next, during viewpoint synthesis, desired generative embeddings that encode desired characteristics (e.g., desired lighting characteristics, content characteristics, style characteristics, etc.) may be specified by the user and supplied to the model. In response, the model can generate a composite image having the desired characteristics encoded by the desired generative embeddings. In one example, the desired generative embeddings may be obtained by interpolating two or more generative embeddings learned for two or more training images selected by the user (for example, because they exhibit / represent the desired characteristics).
[0025] Therefore, the model can use generative embeddings to capture differences in image properties that do not reflect actual differences within the underlying scene. For example, differences in post-processing settings for a landmark image do not actually reflect differences in the landmark itself. This allows the trained model to handle inconsistencies between training images better, which in turn makes the model more robust to unconstrained sets of training images.
[0026] In another exemplary embodiment, some implementations of the present disclosure directly incorporate uncertainty into the model's volume rendering pipeline. For example, in addition to static opacity and static color, the model can also predict an uncertainty value "β", transient opacity, and transient color. The transient data can be integrated so as to be done on the static data to obtain a learned weight multiplier for each ray. Intuitively, this allows the model to lower the priority of reconstructing the "hard" parts of the image that do not match other photographs (e.g., often corresponding to transient and inconsistent occlusions in the image).
[0027] Accordingly, exemplary implementations of this disclosure have a novel model architecture comprising a base, a static content section for modeling static content in a scene, and a transient content section for modeling transient content in a scene. In some implementations, the loss function on which the model is trained may result in a reduced amount of loss for instances where the difference between the predicted composite color and the ground truth color arises from transient content modeled by the viewpoint composite model. Thus, the model has the ability to deviate from the ground truth image when the ground truth image represents transient content or, in some cases, is associated with high uncertainty. This allows the trained model to better handle inconsistencies between training images, which in turn makes the model more robust to unconstrained sets of training images.
[0028] In another exemplary embodiment, several implementations of this disclosure may directly fine-tune the camera parameters of each training image within the model itself. Exemplary camera parameters include orientation, location, focal length, principal point, skew, radial distortion, tangential distortion, and / or various camera intrinsics. This allows the model to adjust the camera parameters (e.g., within several thresholds or tolerances) to perfectly match the reconstructed scene and to be independent of the sparse image features commonly used in 3-D registration. In other words, by allowing the camera parameters for each training image to be modified, some amount of noise present in the training dataset can be removed from the dataset. This allows the trained model to better handle inconsistencies between training images, which results in the model being more robust to unconstrained sets of training images.
[0029] The systems and methods of this disclosure offer several technical effects and benefits. For example, the systems and methods described herein enable models to become more robust to uncontrolled images. This results in improved model performance when synthesizing viewpoints of a scene based on uncontrolled training datasets. Thus, the systems and methods of this disclosure improve the realism, consistency, and user controllability of viewpoint synthesis systems.
[0030] Another exemplary technical effect and benefit of providing a model with a clear mechanism for identifying and / or correcting inconsistencies within the training dataset is that the model can converge to the optimal solution faster (e.g., with fewer training iterations and / or requiring fewer training images). Faster convergence can result in savings of computational resources, such as reduced processor usage, memory usage, and / or bandwidth usage.
[0031] Accordingly, this disclosure provides a system and method for solving problems related to unconstrained images and / or providing additional user controllability. Exemplary experimental data contained in U.S. Provisional Patent Application No. 63 / 059,322 demonstrate the effectiveness of the proposed technique through comprehensive artificial and real-world experiments. The exemplary experiments apply an exemplary implementation of the system described herein to multiple landmarks using both high-resolution captures and internet photographs, resulting in photorealistic reconstructions and results that significantly surpass previous works.
[0032] While the exemplary implementations of this disclosure are described in relation to a single frame of an image, the systems and methods of this disclosure may equally apply to video or other multi-frame datasets. For example, a video showing "flying over" a modeled scene may be synthesized. The frames of the video may be temporally and stylistically consistent (for example, through the consistent use of the same desired generative embedding for all frames of the synthesized video).
[0033] In addition, the frames of the synthesized image may have the same or different resolution as the training images on which the model is trained. For example, a model may be trained on low-resolution images and then used to generate high-resolution synthesized images.
[0034] Color data used and / or generated by the systems and methods of this disclosure (e.g., input data colors, static colors, temporary colors, composite colors, etc.) may be represented in any color space, including, for example, RGB space (e.g., RGB, RGBA), CIE space, HSV and HSL space, CMYK space, grayscale, and / or other color spaces.
[0035] Next, exemplary embodiments of this disclosure will be described in more detail with reference to the figures. Exemplary Techniques
[0036] This section describes an exemplary implementation of the proposed method, sometimes referred to as NeRF-W in some implementations. The model described is well-designed for reconstructing 3-D scenes from "wild" photobooks and improves upon the NeRF model described by Mildenhall et al.
[0037] At its core, NeRF relies on multi-view consistency, meaning that all points in 3-D space must appear the same from all (unobstructed) viewpoints. While this condition is relaxed depending on the viewing direction, significant color variations are not something NeRF can capture. In particular, NeRF generally handles color variations by associating them with the viewing angle. Therefore, a single image may look normal, but verification views or sets of multiple images from different viewing angles (e.g., fly-through video) may not match.
[0038] When a photograph is captured in a controlled setting by a single person, the assumption of NeRF is satisfied, and realistic reconstruction is achievable. However, unconstrained photobooks, such as amateur photographs of famous landmarks, present many challenges that NeRF cannot grasp.
[0039] For example, unconstrained images may show the same scene but in different weather conditions. Photographs are taken at different times under variable lighting conditions. In outdoor photography, time and weather directly affect the colors of all objects in the scene. Furthermore, the sky itself changes over time.
[0040] As another example, an unconstrained image may show the same scene but with different post-processing, and post-processing of a photograph, including exposure and white balance adjustments, is not constrained at all and further affects the color of all objects in the scene.
[0041] As yet another example, unconstrained images may depict the same scene but with different temporary objects, and since photographs are not captured at a single point in time, temporary objects, including people, flags, and branches and leaves, vary from image to image.
[0042] In the following description (and also in the summary above), this disclosure proposes several enhancements that are directly designed to address these phenomena.
[0043] Figure 1 shows a block diagram of an exemplary process for training a machine learning-based viewpoint synthesis model and then performing viewpoint synthesis using the machine learning-based viewpoint synthesis model, according to an exemplary embodiment of the present disclosure.
[0044] Referring to Figure 1, the training dataset 12 may include existing training images representing a scene. As described herein, the training images are not constrained and may exhibit various inconsistencies with one another. As shown in 14, the computing system can perform a model optimization or training process on the training dataset 12 to generate a machine-learned viewpoint synthesis model 16 (see, for example, Figure 2). After training, a desired location 18 for a synthesized image may be provided to the model 16. In response, the model 16 can generate a synthesized image 20 representing a scene from location 18 (see, for example, Figure 3).
[0045] Figure 2 shows a block diagram of an exemplary process for training a machine learning-prepared viewpoint synthesis model according to an exemplary embodiment of the present disclosure. In some implementations, the process shown in Figure 2 may be performed for each pixel of each training image.
[0046] Referring to Figure 2, the training position 22 of existing training images may be provided to the machine-learned viewpoint synthesis model 24. The position 22 may include the location and orientation of the camera that took the training images. In addition, according to aspects of this disclosure, in some implementations, one or more camera parameters 25 for the training images and / or training image embeddings 26 for the training images may be provided to the machine-learned viewpoint synthesis model 24. For example, the additional camera parameters 25 may include focal length, principal point, skew, radial distortion, tangential distortion, and / or various camera intrinsics. The training image embeddings 26 may be generative embeddings assigned to the training images.
[0047] In particular, at the heart of the challenges presented by "wild" images is the concept of color variation from image to image, where the 3-D geometry of the scene is assumed to be identical across all images, while little color consistency can be expected due to variations in camera settings such as lighting and exposure.
[0048] To solve this problem, in some implementations, each image in the training set has a unique embedding.
[0049]
number
[0050] 26 can be assigned. These embeddings
[0051]
number
[0052] This can be optimized during training in conjunction with the model parameters.
[0053] Referring again to Figure 2, the machine learning-prepared viewpoint synthesis model 24 can process the input data to generate opacity and color data 27. For example, in some implementations, the machine learning-prepared viewpoint synthesis model 24 can generate only a single set of opacity and color data 27, or in other implementations, it can generate both a static set of opacity and color data for the static content of the scene and a temporary set of color and opacity data 27 for the temporary content of the scene.
[0054] As an example, differential opacity σ(r) and color c(r,d)²⁷ can be predicted by a multilayer perceptron (MLP) or other model (e.g., a neural network or some other form of other machine-learned model) given a 3-D location r(t) and line of sight d. This MLP or other model can be explicitly designed to ensure that the line of sight d does not affect the differential opacity σ. For example, the base of the model can predict opacity from location only as long as color can be predicted from both location and field of sight / line of sight. Again, in some implementation forms, the input to this MLP is an embedded e (g) It can increase. (c, σ) = MLP(r(t), d, e (g) ) Here e (g) This is a generative embedding corresponding to the rendered image. Similar to the gaze direction d, some exemplary implementations use generative embedding e (g) This ensures that it does not affect the differential opacity σ. Embedding the MLP input
[0055]
number
[0056] By increasing this, the proposed model can directly change the color and lighting of a scene based on the image's identity without modifying its 3-D geometry.
[0057] In 28, volume rendering techniques may be used to generate a composite pixel color from opacity and color data 27. For example, for a single set of opacity and color data 27, the composite pixel color may be obtained by integrating along the light rays diverging from the camera.
[0058]
number
[0059] In another exemplary embodiment, GLO can capture variable lighting and post-processing, but cannot model variations in 3-D geometry. Therefore, some implementations of this disclosure feature a double-head model that includes both a static part for modeling static content and a transient part for modeling transient content. Unlike implementations that create a single tuple (σ, c) for positions in 3-D space, the proposed model having both static and transient heads, one of which is a “static” object (σ) common to all images. s , c s ) is opposed to another, which is a "temporary" object specific to a particular image (σt and c t Create two heads for each of (a), (b), and (c). These amounts can be combined with a modified version of the volume rendering equation presented in Equation 1. Further, the proposed model outputs an uncertainty estimate β that is used to modulate the loss function for each pixel. FIG. 4 provides a diagram of the architecture of the proposed model.
[0060] Specifically, referring now to FIG. 4, a block diagram of an exemplary machine-learned view synthesis model according to an exemplary embodiment of the present disclosure is provided. The model can include a base 34, a static portion 36, and a temporal portion 38.
[0061] When given a 3-D point r(t) 40, a GLO embedding e (g) 48, and an uncertainty embedding e (u) 52, the model shown in FIG. 4 creates a differential opacity σ s , σ t 44, 54, a color c s , c t 45, 56, and a differential uncertainty β 58. The position embedding, the viewing direction, and the non-linearity are omitted for clarity. As described above, additional camera parameters 42 can optionally be provided as well.
[0062] In some implementations, the base 34 of the exemplary proposed model includes a MLP applied to the 3-D point r(t) 40. Other models may be used as well. This MLP outputs a differential opacity σ s 44 and a latent representation z 46. The latent representation z 46 is employed in two ways. The first is a static portion 36 for z 46 and the viewing direction d (e.g., it can include a 4-layer MLP or other model) for creating the color c s 45, similar to the MLP of NeRF. The second is for creating the three amounts of a temporal differential opacity σ t 54, a temporal color c t 56, and an uncertainty value β 58, for each image embedding e (u)The temporary part 38 for z46 can be increased by 52 (for example, it may include a second 4-layer MLP or other model). In some implementations, l1 regularization is used to promote dilution σ t It can be applied to this.
[0063] Referring together to Figures 2 and 4, the following variation of the volume rendering formula is used to create a color for a single pixel in 28 from both static and transient data 27:
[0064]
number
[0065]
number
[0066] It can be used.
[0067] In Equation 1, color c is σ s c s +σ t c t The accumulated opacity σ is replaced by a linear combination of σ s +σ t Please note that it will be replaced by the sum of [the specified values].
[0068] In some implementations, such as the example shown in Figure 4, the model is allowed to emit an uncertainty estimate β58. During training time, the accumulated opacity formula can be used to obtain the uncertainty for the corresponding predicted color C(r).
[0069]
number
[0070] As an example, the loss for a single pixel having ground truth color y can therefore be given by the following formula.
[0071]
number
[0072] Intuitively, a larger value of β allows the model to reduce the weight of outliers—generally transient or moving objects such as people, grass, or clouds. The logarithmic term prevents β from growing indefinitely and can be directly derived from the likelihood of a normal distribution. In some implementations, the hyperparameter β is used to prevent the model from concentrating a large portion of the loss on a small number of pixels. min ≥0 may be used.
[0073] More generally, referring again to Figure 2, the loss function 30 can evaluate the difference between the composite pixel color generated in 28 and the ground truth pixel color 32 of the existing training image. For example, the squared error between pixel colors represented in RGB or some other color scheme may be used.
[0074] The loss function 30 can be backpropagated to train a machine learning-prepared viewpoint synthesis model 24. In addition, in some implementations, the training image embeddings 26, training positions 22, and / or camera parameters 25 can also be updated based on the loss function 30 (for example, by continuing to backpropagate the loss through and around the model 24).
[0075] During the test, Model 24 can be used to render a common static geometry for all photos in the training set. In some implementations, the images are σ t , c t It can be rendered by omitting all of β.
[0076] As an example, Figure 3 shows an exemplary use of a trained machine learning-based viewpoint synthesis model. Specifically, a desired position 40 (e.g., location and orientation) for the synthesized image of the scene is provided. Optionally, desired camera parameters 42 and / or desired generative embeddings 44 may also be provided.
[0077] The machine learning-prepared viewpoint synthesis model 24 can process the input to generate opacity and color data 27 (for example, a single set of opacity and color data, or both static and transient opacity and color data, or simply static opacity and color data). Volume rendering 28 can be performed on the opacity and color data (for example, static data only) to generate composite pixel colors for the pixels of the composite image.
[0078] The process shown in Figure 3 can be performed for each pixel of the composite image. Exemplary devices and systems
[0079] Figure 5A shows a block diagram of an exemplary computing system 100 according to an exemplary embodiment of the present disclosure. The system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150, which are communicably coupled via a network 180.
[0080] The user computing device 102 may be any type of computing device, such as a personal computing device (e.g., a laptop or desktop), a mobile computing device (e.g., a smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.
[0081] The user computing device 102 includes one or more processors 112 and memory 114. The one or more processors 112 may be any suitable processing device (e.g., a processor core, microprocessor, ASIC, FPGA, controller, microcontroller, etc.) and may be one processor or multiple processors connected operably. The memory 114 may include one or more non-temporary computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, and combinations thereof. The memory 114 can store data 116 and instructions 118 executed by the processors 112 to cause the user computing device 102 to perform operations.
[0082] In some implementations, the user computing system 102 may store or contain one or more machine learning models 120. For example, the machine learning models 120 may be a variety of machine learning models, including neural networks (e.g., deep neural networks) or other types of machine learning models including nonlinear and / or linear models, or may otherwise contain such models. Neural networks may include feedforward neural networks, recurrent neural networks (e.g., long-short-term memory recurrent neural networks), convolutional neural networks, or other forms of neural networks. Exemplary machine learning models 120 are discussed with reference to Figures 1 to 4.
[0083] In some implementations, one or more machine learning models 120 may be received from a server computing system 130 via a network 180, stored in user computing device memory 114, and then used or otherwise implemented by one or more processors 112. In some implementations, the user computing device 102 may implement multiple parallel instances of a single machine learning model 120 (for example, to perform parallel viewpoint synthesis across multiple examples of the same or different scenes).
[0084] As an addition or alternative, one or more machine learning models 140 may be included in, or otherwise stored and implemented by, a server computing system 130 that communicates with a user computing device 102 according to a client-server relationship. For example, a machine learning model 140 may be implemented by the server computing system 140 as part of a web service (e.g., a viewpoint synthesis service). Thus, one or more models 120 may be stored and implemented in the user computing device 102, and / or one or more models 140 may be stored and implemented in the server computing system 130.
[0085] The user computing device 102 may also include one or more user input components 122 that receive user input. For example, the user input component 122 may be a touch-sensitive component (e.g., a touch-sensitive display screen or touchpad) that is sensitive to the touch of a user input object (e.g., a finger or stylus). The touch-sensitive component may be useful for implementing a virtual keyboard. Other exemplary user input components include a microphone, a conventional keyboard, or other means by which the user can provide user input.
[0086] The server computing system 130 includes one or more processors 132 and memory 134. The one or more processors 132 may be any suitable processing device (e.g., a processor core, microprocessor, ASIC, FPGA, controller, microcontroller, etc.) and may be one processor or multiple processors connected operably. The memory 134 may include one or more non-temporary computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, and combinations thereof. The memory 134 can store data 136 and instructions 138 executed by the processors 132 to cause the server computing system 130 to perform operations.
[0087] In some implementations, the server computing system 130 includes one or more server computing devices, or is otherwise implemented by one or more server computing devices. In cases where the server computing system 130 includes multiple server computing devices, such server computing devices can operate according to a sequential computing architecture, a parallel computing architecture, or any combination thereof.
[0088] As described above, the server computing system 130 may store or otherwise include one or more machine learning models 140. For example, the models 140 may be various machine learning models or otherwise include them. Exemplary machine learning models include neural networks or other multilayer nonlinear models. Exemplary neural networks include feedforward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Exemplary models 140 are discussed with reference to Figures 1 to 4.
[0089] The user computing device 102 and / or the server computing system 130 can train models 120 and / or 140 by interacting with a training computing system 150 which is communicatively connected via a network 180. The training computing system 150 may be separate from the server computing system 130 or may be part of the server computing system 130.
[0090] The training computing system 150 includes one or more processors 152 and memory 154. The one or more processors 152 may be any suitable processing device (e.g., a processor core, microprocessor, ASIC, FPGA, controller, microcontroller, etc.) and may be a single processor or multiple processors operably connected. The memory 154 may include one or more non-temporary computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 154 can store data 156 and instructions 158 executed by the processors 152 to cause the training computing system 150 to perform operations. In some implementations, the training computing system 150 includes one or more server computing devices, or is otherwise implemented by server computing devices.
[0091] The training computing system 150 may include a model trainer 160 that trains machine-learned models 120 and / or 140 stored in the user computing device 102 and / or server computing system 130 using various training or learning techniques, such as backpropagation. For example, a loss function may be backpropagated through the model to update one or more parameters of the model (for example, based on the gradient of the loss function). Various loss functions may be used, such as mean squared error, likelihood loss, cross-entropy loss, hinge loss, and / or various other loss functions. Gradient descent techniques may be used to iteratively update parameters over several training iterations.
[0092] In some implementations, performing backpropagation may include performing shortened temporal backpropagation. The model trainer 160 may implement several generalization techniques (e.g., weight decay, dropout, etc.) to improve the generalization ability of the model being trained.
[0093] In particular, the model trainer 160 can train the machine-learned models 120 and / or 140 based on the training data set 162. The training data 162 may include unconstrained image data, such as "wild" photographs.
[0094] In some implementations, if the user gives consent, training examples may be provided by the user computing device 102. Thus, in such implementations, the model 120 provided to the user computing device 102 can be trained by the training computing system 150 on user-specific data received from the user computing device 102. In some cases, this process is referred to as model personalization.
[0095] The model trainer 160 includes computer logic used to provide desired functionality. The model trainer 160 can be implemented in hardware, firmware, and / or software that control a general-purpose processor. For example, in some implementations, the model trainer 160 includes a program file stored on a storage device, loaded into memory, and executed by one or more processors. In other implementations, the model trainer 160 includes one or more sets of computer executable instructions stored on a RAM hard disk or a tangible computer-readable storage medium such as an optical or magnetic medium.
[0096] Network 180 may be any type of communication network, such as a local area network (e.g., an intranet), a wide area network (e.g., the Internet), or any combination thereof, and may include any number of wired or wireless links. Generally, communication over Network 180 can be carried over any type of wired and / or wireless connection using a wide variety of communication protocols (e.g., TCP / IP, HTTP, SMTP, FTP), encoding or formatting (e.g., HTML, XML), and / or protection methods (e.g., VPN, Secure HTTP, SSL).
[0097] Figure 5A shows one exemplary computing system that may be used to implement the present disclosure. Other computing systems may be used similarly. For example, in some implementations, the user computing device 102 may include a model trainer 160 and a training dataset 162. In such implementations, the model 120 can be both trained locally on the user computing device 102 and used. In some such implementations, the user computing device 102 may implement the model trainer 160 based on user-specific data to personalize the model 120.
[0098] Figure 5B shows a block diagram of an exemplary computing device 10 that operates according to an exemplary embodiment of the present disclosure. The computing device 10 may be a user computing device or a server computing device.
[0099] The computing device 10 includes several applications (for example, applications 1 to N). Each application includes its own machine learning library and pre-trained models. For example, each application may include a pre-trained model. Exemplary applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, and so on.
[0100] As shown in Figure 5B, each application can communicate with several other components of the computing device, such as one or more sensors, a context manager, a device state component, and / or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application.
[0101] Figure 5C shows a block diagram of an exemplary computing device 50 that operates according to an exemplary embodiment of the present disclosure. The computing device 50 may be a user computing device or a server computing device.
[0102] The computing device 50 contains several applications (for example, applications 1 through N). Each application communicates with a central intelligence layer. Exemplary applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, and a browser application. In some implementations, each application can communicate with the central intelligence layer (and the models stored within it) using an API (for example, a common API across all applications).
[0103] The central intelligence layer contains several machine learning models. For example, as shown in Figure 5C, each machine learning model (e.g., Model) may be provided to each application and managed by the central intelligence layer. In other implementations, two or more applications may share a single machine learning model. For example, in some implementations, the central intelligence layer may provide a single model (e.g., SingleModel) to all applications. In some implementations, the central intelligence layer is included in the operating system of the computing device 50, or otherwise implemented by the operating system.
[0104] The central intelligence layer can communicate with the central device data layer. The central device data layer may be a centralized repository of data for the computing device 50. As shown in Figure 5C, the central device data layer can communicate with several other components of the computing device, such as one or more sensors, a context manager, a device state component, and / or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API). Additional Disclosure
[0105] The technologies discussed herein refer to servers, databases, software applications, and other computer-based systems, as well as the actions performed and the information transmitted to and from such systems. The inherent flexibility of computer-based systems allows for a wide variety of possible configurations, combinations, and divisions of tasks and functions among their components. For example, the processes described herein may be implemented using a single device or component, or multiple devices or components working in combination. Databases and applications may be implemented on a single system or distributed across multiple systems. Distributed components may operate sequentially or in parallel.
[0106] While the subject matter has been described in detail with respect to various specific exemplary embodiments thereof, each example is given for illustrative purposes only and not as a limitation of the disclosure. Those skilled in the art, understanding the foregoing, will readily be able to create modifications, variations, and equivalents of such embodiments. Therefore, the disclosure does not preclude the inclusion of such modifications, variations, and / or additions to the subject matter, as will be readily apparent to those skilled in the art. For example, a feature shown or described as part of one embodiment may also be used in conjunction with another embodiment to bring about further embodiments. Thus, the disclosure is intended to cover such modifications, variations, and equivalents. [Explanation of Symbols]
[0107] 10 Computing Devices 12 training datasets 14. Model Optimization or Training Process 16 Machine Learning-Prepared Viewpoint Synthesis Model 18 Composite image 20 Composite Images 22 Training positions 24 Machine Learning-Prepared Viewpoint Synthesis Model 25 Camera Parameters 26. Training response embedding 27. Opacity and color data 28 Volume Rendering 30 Loss Function 32 Ground Truth Pixel Colors 34 Base 36 Static part 38 Temporary part 40 3-D points 42 Additional camera parameters 44 Differential opacity 46 colors 48 GLO embedded 50 Computing Devices 52. Embedding Uncertainty 54 Differential Opacity 56 colors 58 Differential Uncertainty 100 Computing Systems 102 User Computing Devices 112 processors 114 memory 116 data 118 Command 120 machine learning models 122 User Input Components 130 Server Computing Systems 132 processors 134 memory 136 data 138 Command 140 machine learning models 150 Training Computing Systems 152 processors 154 memory 156 data 158 Command 160 Model Trainers 162 training data sets 180 Networks
Claims
1. A computing system for generating composite images of scenes, One or more processors, A system comprising one or more non-temporary computer-readable media, wherein the one or more non-temporary computer-readable media are A machine learning-prepared viewpoint synthesis model including a base, a static content section that models static content within the scene, and a temporary content section that models temporary content within the scene, When executed by the one or more processors, the instructions that cause the computing system to perform an operation are stored together, and the operation is, Obtaining a position in 3D space, Processing data describing the position using the base of the machine learning-prepared viewpoint synthesis model to generate static opacity and latent representation, The latent representation is processed using the static content portion of the machine learning-prepared viewpoint synthesis model in order to generate static colors, Processing the latent representation using the temporary content portion of the machine learning-prepared viewpoint synthesis model to generate temporary opacity and temporary color, A computing system comprising performing volume rendering to generate composite pixel colors for composite pixels of a composite image from the static opacity, the static color, the temporary opacity, and the temporary color.
2. The aforementioned position in three-dimensional space includes a capture position associated with an existing training image. The aforementioned operation is, The loss function is evaluated by comparing the composite pixel color with the ground truth pixel color of the training pixels included in the existing training image. The computing system according to claim 1, further comprising modifying one or more parameter values for one or more parameters of the machine-learned viewpoint synthesis model based at least in part on the loss function.
3. Processing the latent representation using the temporary content portion of the machine learning-prepared viewpoint synthesis model further generates uncertainty values, The computing system according to claim 2, wherein the loss function includes a pixel loss term that gives a negative correlation between the magnitude of the loss and the uncertainty value.
4. The aforementioned operation is, The computing system according to any one of claims 1 to 3, further comprising inputting uncertainty embeddings to the temporary content portion of the machine-learned viewpoint synthesis model alongside the latent representation in order to generate the temporary opacity and the temporary color.
5. The aforementioned operation is, The computing system according to claim 2, 3, or 4, further comprising inputting a generative embedding into the static content portion of the machine-learned viewpoint synthesis model alongside the latent representation in order to generate the static color.
6. The generative embedding is associated with the existing training image, The computing system according to claim 5, wherein the operation further comprises modifying one or more values of the generative embedding based at least partially on the loss function.
7. The aforementioned operation is, To generate the static opacity and the latent representation, one or more camera parameters are input to the base of the machine-learned viewpoint synthesis model in parallel with the position, The computing system according to any one of claims 2 to 6, further comprising modifying one or more of the camera parameters based at least in part on the loss function.
8. A computing system according to any one of claims 1 to 7, which performs each of the operations for each individual pixel in the composite image.
9. The computing system according to any one of claims 1 to 8, wherein the position includes location and orientation.
10. The computing system according to claim 1, 8, or 9, wherein the position includes a new position not included in the training set on which the machine learning-prepared viewpoint synthesis model was trained.
11. The computing system according to any one of claims 1 to 10, wherein each of the base, static content, and temporary content of the machine learning-prepared viewpoint synthesis model includes a multilayer perceptron.
12. A computer implementation method for viewpoint synthesis having user-specifiable characteristics, A computing system comprising one or more computing devices acquires a desired position and user-specified generative embedding in three-dimensional space, wherein the generative embedding encodes one or more visual characteristics of the resulting composite image. The steps include processing data describing the position using the basis of a machine learning-prepared viewpoint synthesis model to generate opacity and latent representations, using the computing system, The steps include: processing the latent representation and the generative embedding using the content portion of the machine learning-prepared viewpoint synthesis model to generate color using the computing system; A computer implementation method comprising the steps of performing volume rendering by the computing system to generate a composite pixel color for the composite pixels of the composite image from the opacity and the color, wherein the composite image exhibits one or more visual characteristics encoded by the generative embedding.
13. The opacity generated by the base includes static opacity, The step of processing the latent representation and the generative embedding using the content portion of the machine learning-prepared viewpoint synthesis model to generate the aforementioned color by the computing system is as follows: The steps include: processing the latent representation and the generative embedding using the static content portion of the machine learning-prepared viewpoint synthesis model to generate static colors using the computing system; The process includes the step of processing the latent representation by the computing system using the temporary content portion of the machine learning-prepared viewpoint synthesis model to generate temporary opacity and temporary color, The computer implementation method according to claim 12, wherein the step of performing volume rendering by the computing system to generate the composite pixel color for the composite pixels of the composite image from the opacity and the color includes the step of performing volume rendering by the computing system to generate the composite pixel color for the composite pixels of the composite image from the static opacity, the static color, the temporary opacity, and the temporary color.
14. The computer implementation method according to claim 13, further comprising the step of inputting uncertainty embeddings by the computing system into the temporary content portion of the machine-learned viewpoint synthesis model alongside the latent representation in order to generate the temporary opacity and the temporary color.
15. The computer implementation method according to any one of claims 12 to 14, wherein the generative embedding includes an interpolated embedding generated by interpolating each image embedding associated with two or more existing images selected by the user.
16. One or more non-temporary computer-readable media that, when executed by one or more processors, stores together instructions that cause a computing system to perform an action, wherein the action is The computing system acquires a position in three-dimensional space and a training embedding associated with an existing training image, wherein the training embedding encodes one or more visual characteristics of the existing training image. The computing system processes the data describing the position using the base of a machine learning-prepared viewpoint synthesis model to generate opacity and latent representations. The computing system processes the latent representation and the training embedding using the content portion of the machine learning-prepared viewpoint synthesis model in order to generate color, Volume rendering is performed by the computing system to generate composite pixel colors for composite pixels in a composite image from the aforementioned opacity and color, The loss function is evaluated by comparing the composite pixel color with the ground truth pixel color of the training pixels included in the existing training image. One or more non-transient computer-readable media, comprising modifying one or more values of the training embedding based at least partially on the loss function.
17. The opacity generated by the base includes static opacity. To generate the aforementioned color, the computing system processes the latent representation and the training embedding using the content portion of the machine learning-trained viewpoint synthesis model. The latent representation and the training embedding are processed by the computing system using the static content portion of the machine learning-trained viewpoint synthesis model to generate static colors, This includes processing the latent representation by the computing system using the temporary content portion of the machine learning-prepared viewpoint synthesis model to generate temporary opacity and temporary color, One or more non-temporary computer-readable media according to claim 16, wherein volume rendering by the computing system is performed to generate the composite pixel color for the composite pixels of the composite image from the opacity and the color, and volume rendering by the computing system is performed to generate the composite pixel color for the composite pixels of the composite image from the static opacity, the static color, the temporary opacity, and the temporary color.
18. Processing the latent representation using the temporary content portion of the machine learning-prepared viewpoint synthesis model further generates uncertainty values, The loss function includes a pixel loss term that results in a negative correlation between the magnitude of the loss and the uncertainty value, one or more non-transient computer-readable media according to claim 17.
19. The aforementioned operation, To generate the opacity and the latent representation, one or more camera parameters associated with the camera that captured the existing training image are input to the base of the machine-learned viewpoint synthesis model, along with the position. One or more non-temporary computer-readable media according to any one of claims 16 to 18, further comprising modifying one or more of the camera parameters based at least in part on the loss function.
20. One or more non-temporary computer-readable media that, when executed by one or more processors, stores together instructions that cause a computing system to perform an action, wherein the action is The computing system acquires the position in three-dimensional space and one or more camera parameters associated with existing training images, The computing system processes data describing the position and one or more camera parameters using the basis of a machine learning-prepared viewpoint synthesis model to generate opacity and latent representations. The latent representation is processed by the computing system using the content portion of the machine learning-prepared viewpoint synthesis model in order to generate color, Volume rendering is performed by the computing system to generate composite pixel colors for composite pixels in a composite image from the aforementioned opacity and color, The loss function is evaluated by comparing the composite pixel color with the ground truth pixel color of the training pixels included in the existing training image. One or more non-temporary computer-readable media, comprising modifying one or more values of the camera parameters based at least partially on the loss function.
21. A computing system for generating composite images of scenes, One or more processors, A system comprising one or more non-temporary computer-readable media, wherein the one or more non-temporary computer-readable media are A machine learning-trained viewpoint synthesis model comprising a base and a static content section that models static content within the scene, wherein the base and static section are trained together with a temporary content section that models temporary content within the scene, When executed by the one or more processors, the instructions that cause the computing system to perform an operation are stored together, and the operation is, Obtaining a position in 3D space, Processing data describing the position using the base of the machine learning-prepared viewpoint synthesis model to generate static opacity and latent representation, The latent representation is processed using the static content portion of the machine learning-prepared viewpoint synthesis model in order to generate static colors, A computing system comprising performing volume rendering to generate composite pixel colors for composite pixels of a composite image from the static opacity and the static color.