Learning representations of data items and generating new views of data items using a diffusion model
By combining encoder neural networks and denoising decoder neural networks, and utilizing self-supervised training and diffusion models, the problems of high-level semantic capture and high computational resource consumption when generating new views of data items in existing technologies are solved, thus achieving high-quality image processing tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GDM HOLDING LLC
- Filing Date
- 2024-11-22
- Publication Date
- 2026-06-19
Smart Images

Figure CN122249814A_ABST
Abstract
Description
[0001] Cross-references to related applications
[0002] This application claims priority to GB application No. 2318012.8, filed November 24, 2023. The disclosure of the prior application is considered part of the disclosure of this application and is incorporated herein by reference. Background Technology
[0003] This manual relates to processing data using machine learning models.
[0004] A neural network is a machine learning model that uses one or more layers of non-linear units to predict outputs from received inputs. In addition to the output layer, some neural networks also include one or more hidden layers. The output of each hidden layer is used as input to the next layer in the network (i.e., the next hidden layer or output layer). Each layer of the network generates an output from the received inputs based on the current values of its corresponding set of parameters. Summary of the Invention
[0005] This specification describes systems and methods for training neural network systems and for using trained components of such systems, the system and methods being implemented as computer programs on one or more computers in one or more locations. The neural network system can be trained using self-supervised training, and the trained components of the system can be used to perform a wide range of tasks, such as perception, reconstruction, and editing tasks. For example, an implementation of the trained system can generate new views of a scene given one or more views of that scene, such as views from a new perspective.
[0006] In a first aspect, a computer-implemented method for training a neural network system is described. The neural network system includes an encoder neural network. The neural network system also includes a denoising decoder neural network for generating the output data item by incrementally reducing the level of noise in the output data item at multiple time steps. The output data item may include, for example, an image or audio.
[0007] The method generally involves obtaining multiple training data items, each including at least one source data item and a target data item. Data items may include images, audio, or other data items. The source and target data items represent views of an object or scene, such as an image or audio representation of that object or scene.
[0008] At each training iteration across multiple training iterations, the method obtains one training data item from the training data items and processes the source data item within that training data item using an encoder neural network to generate at least one latent vector representing the source data item. A time value for one time step is obtained, and a noisy version of the target data item, the embedding of the time value, and the latent vector are processed using a denoising decoder neural network to generate a denoised output including the estimated noisy data item for that time step. The estimated noisy data item can be used to compensate for noise in the noisy version of the target data item.
[0009] The denoising decoder neural network and the encoder neural network are trained by backpropagating the gradient of the objective function, which depends on the accuracy of estimating the noise in the noisy version of the target data item (e.g., image) from the estimated noisy data item (e.g., image).
[0010] A computer-implemented method for generating output data items (e.g., output images) by incrementally reducing the level of noise in the output data items at multiple time steps is also described. One or more characteristics of the output data items may be obtained, for example, from a user, and the output data items may be generated using these characteristics, such that the output data items represent these characteristics. In implementations where the output data items include output images, the generated output image may be a 2D or 3D image also defined by the target viewpoint.
[0011] A system is also described, comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers. The storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the described methods.
[0012] A non-transitory computer storage medium is further described, comprising one or more storage instructions that, when executed by one or more computers, perform the operations of the described methods.
[0013] The subject matter described in this specification can be implemented in specific embodiments to achieve one or more of the following advantages.
[0014] The described techniques can be used to generate representations of data items that capture their high-level semantics in a particularly useful and efficient manner. The described techniques can also be used to generate new examples of data items that are semantically similar to existing data items, such as new views of 2D or 3D images, or new perspective views of 3D objects.
[0015] Specifically, the system is implemented using a denoising decoder neural network, which enables the generation of data items based on their latent representations without requiring a complete definition of all their characteristics. In other words, the described method allows the system, particularly the latent vectors, to represent only the most salient or descriptive qualities of the data items. For example, in the case of images, this could include the high-level semantics of the image, while the generation of local and high-frequency details is delegated to the denoising decoder neural network.
[0016] The latent representation of a data item as one or more latent vectors provides a bottleneck between the encoder neural network and the denoising decoder neural network, which prompts the system to learn latent vector representations that typically capture the key properties and semantics of the data items used in training in an explicit and interpretable manner.
[0017] Generally, this system can be implemented to both generate and modify data items, such as images, audio, and other data items. The denoising decoder neural network trained as described in this paper can produce very high-quality output data items.
[0018] This system can learn latent vector representations that are unwrapped, for example, according to unwrapping fractions, i.e., the factors responsible for variations in the content of data items (such as the appearance of an image) can be separated, especially when the data items have a real-world source. This facilitates the manipulation and control of the generated data items, for example, to generate modified versions of the data items. For example, in the case of images, this could enable the alteration of hue, sharpness, or lighting conditions, or the changes to characteristics of the objects represented in the image, such as the size or material of the objects, or the age or gender of people in the image.
[0019] The representations learned by this system can also be used for downstream prediction tasks, such as data item (e.g., image) classification, and can provide improved accuracy compared to some other techniques.
[0020] The system described is implemented in a way that allows for training using self-supervised methods without the need for labeled training data.
[0021] The described technique can be used to generate output data items with higher quality than some other techniques and with reduced computational resource usage. For example, in one implementation, high-quality images can be generated using as few as 20 denoising steps, with computational requirements significantly lower than some other methods (e.g., NeRF-based methods).
[0022] Details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the following description. Other features, aspects, and advantages of this subject matter will become apparent from the description, drawings, and claims. Attached Figure Description
[0023] Figure 1 An example training system is shown for training a neural network system that includes an encoder neural network and a denoising decoder neural network.
[0024] Figure 2 This is a flowchart of an example process for training a neural network system that includes an encoder neural network and a denoising decoder neural network.
[0025] Figure 3 Different views of the object are shown.
[0026] Figure 4 An example of noise scheduling is shown.
[0027] Figure 5 Examples of feature types that can be represented by subvectors of the latent vector are shown.
[0028] Figure 6 The example illustrates the selection of scaling factors.
[0029] Figure 7 This is a flowchart of an example process for generating output data items.
[0030] Figure 8 This is a flowchart of an example process for performing data item processing tasks.
[0031] Figure 9A and Figure 9B An example of a modified image generated by the described technique is shown.
[0032] Figure 10A and Figure 10B An example is shown of generating a new 3D view of an object using the described technique.
[0033] In the various figures, the same reference numerals and names indicate the same elements. Detailed Implementation
[0034] The described techniques generally involve encoder neural networks and denoising decoder neural networks. The encoder neural network is configured to encode data items, such as images, into one or more latent vectors representing those data items, which are then used by the denoising decoder neural network to guide the synthesis of new but relevant data items.
[0035] The denoising decoder neural network is configured to implement an iterative denoising process conditioned on latent vectors, such as a diffusion model or a consistency model. Both the encoder neural network and the denoising decoder neural network can be trained using self-supervised methods, and the latent vectors learned in this way capture visual semantics useful for reconstruction, editing, and synthesis tasks, as well as downstream perceptual tasks.
[0036] As a specific example, a neural network system comprising an encoder neural network and a denoising decoder neural network can implement a self-supervised diffusion model. This system uses latent vectors to provide a bottleneck between the encoder and denoising decoder neural networks, and during training, it learns compact and useful representations.
[0037] Some implementations of this system use images for training, but the techniques described are not limited to images. In one implementation, the system is trained to generate new views of objects or scenes as a self-supervised objective, which helps the system capture visual semantics in an unsupervised manner. Representations of different variations in the training data, typically unwrapped, learned in this way can be used for many image processing tasks, such as image reconstruction, editing, and synthesis. Generally, all these image processing tasks involve generating images, such as reconstructed or denoised versions of input images, or edited versions of input images, or synthesized versions of input images, such as representing 3D objects or scenes from new poses or perspectives. Figure 10 (described later) illustrates the generation of new 3D views of some objects.
[0038] The example system uses an image encoder to encode the input view into a latent vector (i.e., a low-dimensional representation), which is then used to guide the synthesis of a new output view. The view can be any collection of images that maintain some relationship (visual or semantic) with each other, such as various enhancements or distortions of the original image, different poses or perspectives of 3D objects, or simply images that share the same semantic category. Figure 3 (Described later) Some examples of different views of various objects are shown.
[0039] The denoising decoder neural network performs image denoising using latent vectors instead of simple reconstruction. This allows the encoder to focus on the most unique and descriptive qualities of the image, rather than compressing all information about the image into its representation. The encoder neural network learns to encode a source view into latent vectors, which the denoising decoder neural network uses to generate a target view that is visually or semantically related to the source view. The learned representation (i.e., the latent vectors) learns to capture the most prominent commonalities between the source and target views.
[0040] As described later, several additional techniques can be incorporated into the system to improve these representations. For example, Figure 5 ( Figure 1 The small diagram above (described in more detail later) illustrates a “layer modulation” technique that facilitates the generation of unwrapped representations. This partitions the latent vector into subvectors, each of which modulates a corresponding layer pair (downsampling and upsampling layers) of a U-Net-type denoising decoder neural network, thereby promoting specialization among the latent subvectors, such as… Figure 5 As indicated. For example, this can be particularly useful for image editing; for novel 3D view synthesis, methods based on cross-attention, as described later, may be even better.
[0041] Figure 1 An example training system for training neural network systems is shown. Figure 1 The training system can be implemented as a computer program on one or more computers in one or more locations.
[0042] The neural network system includes an encoder neural network 120 and a denoising decoder neural network 130. Generally, the encoder neural network 120 and the denoising decoder neural network 130 can have any suitable neural network architecture, including, for example, one or more feedforward neural network layers, or convolutional neural network layers or attention neural network layers.
[0043] The encoder neural network 120 is configured to process the source data item 110 according to learnable parameters (e.g., weights) of the encoder neural network to generate at least one latent vector representing the source data item. 122. In some implementations, as described later, the encoder neural network 120 is configured to process multiple source data items 110, such as aggregating their corresponding latent vectors.
[0044] As an example, if the source data item 110 includes an image, the encoder neural network 120 may include a ResNet block (He et al., “Deep residual learning for image recognition”, Proc. IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016). Figure 1 A source image including a first source view of a tiger is shown as an example. The encoder neural network 120 can then encode the source data item 110 into a single... dMultiple latent vectors. As another example, the denoising decoder neural network 130 can have a ViT (Visual Transformer) architecture (Dosovitskiy et al., arXiv:2010.11929, 2021). In some implementations, the encoder neural network 120 can encode the source data item 110 into multiple latent vectors.
[0045] As another example, where source data item 110 includes audio data, encoder neural network 120 may include an audio language model or an audio encoder of a speech recognition system such as BEST-RQ (Chine et al., arXiv:2202.01855).
[0046] Generally speaking, latent vectors The number of dimensions will depend on the nature of the training data, such as how many different varying factors in the training data are expected to be represented. This is merely an example. d It can be between, for example, 32 and 8096.
[0047] The denoising decoder neural network 130 is configured to process noisy data items 132 (e.g., pixel values of an image, as shown in the thumbnail) according to learnable parameters (e.g., weights) of the denoising decoder neural network, and to identify the time values of the denoising time steps. Embedding 124 and latent vector 122, to generate a denoised output 134.
[0048] As described in more detail later, the encoder neural network and the denoising decoder neural network are trained using the target data item (e.g., the target image). Figure 1 A target image including a second target view of a tiger is shown as an example. In the implementation, a denoising decoder neural network is trained to generate a denoised output based on the accuracy of the noise estimation in a noisy version of the target data item (e.g., the target image) by the denoising decoder neural network 130. The denoising decoder neural network 130 can estimate the noise either by explicitly generating an estimate of the noise (at the denoised output 134) or by estimating a denoised version of the noisy data item 132 (at the denoised output 134).
[0049] Therefore, generally, the denoising output 134 includes an estimated noise data item for that time step, which is suitable for compensating for noise in a noisy version of the target data item, i.e., suitable for reducing the noise level in the noisy version of the target data item, either by explicitly estimating the noise or by estimating a denoised version of the noisy version of the target data item. The actual use of the denoising decoder neural network to reduce the noise level during a data item generation (e.g., image generation) task occurs during inference.
[0050] Generally, the denoising decoder neural network 130 can be configured to implement a diffusion model or a consistency model. Generally, the denoising output 134 can include an estimate of the noise in the noisy data item 132 or an estimate of a denoised version of the noisy data item 132.
[0051] As an example, in inference, the current noisy data item 132 at a time step can be processed to generate a denoised output 134, wherein the estimated noisy data item for that time step includes a prediction of the noise, which can be combined with the current noisy data item at that time step to obtain an updated denoised version of the current noisy data item for use in the next denoising iteration time step. For example, the estimated noisy data item, in particular a scaled version of the estimated noisy data item, can be subtracted from the current noisy data item, for example, as described later.
[0052] As another example, in inference, the current noisy data item 132 at a time step can be processed to generate a denoised output 134, wherein the estimated noisy data item for that time step includes a prediction of the denoised version of the noisy data item at that time step. This can then be used as an updated denoised version of the current noisy data item for the next denoising iteration time step.
[0053] In the implementation, the denoising decoder neural network 130 has an architecture that allows the neural network to map inputs of a given dimension to outputs of the same dimension (i.e., to map noisy data items 132 to estimated noisy data items).
[0054] As an example, the denoising decoder neural network 130 can have a U-Net architecture (Ronneberger et al., “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv:1505.04597). This architecture can include, for example, one or more ResNet blocks and / or one or more self-attention layers, as well as one or more skip connections between neural network layers of corresponding resolution. For example, U-Net can be implemented as a stack of residual layers, convolutional layers, and downsampling or upsampling layers (in the encoding and decoding parts of the U-Net, respectively), which are further linked by symmetric skip connections.
[0055] However, more generally, the denoising decoder neural network for the denoising process can operate in the space of the data items (e.g., in the image space) or in the latent space. Thus, as some other examples, the denoising decoder neural network 130 may include a diffusion transformer (DiT) (Peebles et al., “Scalable Diffusion Models with Transformers” arXiv: 2212.09748, 2023), or a Transformer backbone, or U-ViT (Hoogeboom et al., arXiv: 2301.11093, 2023).
[0056] Throughout this specification, “embedding” of an entity means representing an entity as an ordered collection of numerical values, such as a vector or matrix of numerical values.
[0057] In the implementation method, the time value of the denoising time step is identified. The embedding of a viewpoint, or (later) the embedding of one or more components or coordinates, can be determined by encoding it as a d-dimensional vector. Any suitable encoding (embedding) can be used. As an example, a sine embedding can be used, where each dimension... i The value can be for even numbers i sin( ωt ) and for odd numbers i cos( ωt ), in N It is a large number, such as 10,000, and among them It is a time value (or the coordinates of the viewpoint).
[0058] The denoising decoder neural network 130 is configured to use time values Embedding 124 and latent vector 122 is used as a condition to generate the denoised output 134. In some implementations, the time value... Embeddings and latent vectors Combining, for example by concatenating or summing them, and conditioned on this combination, the denoising decoder neural network 130. In other implementations, the denoising decoder neural network 130 individually... Embeddings and latent vectors As a condition.
[0059] There are many ways to make the denoising decoder neural network 130 Embeddings and latent vectors As an example, this can be done by incorporating a FiLM (Feature-wise Linear Modulation) layer (Perez et al., arXiv:1709.07871) into the neural network. As another example, this can be done by including cross-attention blocks in the neural network, where queries derived from the activations of the denoising decoder neural network query keys and values derived from the conditionalization information. Some particularly useful ways to conditionalize the denoising decoder neural network 130 are described later.
[0060] In the implementation of the trained system for image generation, the generation of the denoised output 134 can also be conditioned on viewpoint data defining the target viewpoint for the generated image. For example, if source data item 110 includes source image 110A and noisy data item 123 includes noisy image 132A, this can be accomplished by determining viewpoint embeddings 110B, 132B and processing them using encoder neural network 120 and / or denoising decoder neural network 130, respectively, the viewpoint embedding representing the coordinates of the viewpoint for each pixel in the corresponding source image 110A and / or noisy image 132A.
[0061] Viewpoint embedding for a pixel of an image can be combined with the pixel values of the image, for example, by concatenation or addition, such as pixel-wise combination. Alternatively, viewpoint embedding for a pixel can be combined with the representation of the image in the encoder neural network 120 and / or the denoising decoder neural network 130. For example, viewpoint embedding for a pixel can be combined with the representation of the image (which may but does not need to preserve the image size) at or after the output of the first neural network layer of the respective neural network, for example by embedding the sinusoidal position after the first (input) layer of the encoder neural network 120 and / or after the first (input) layer of the denoising decoder neural network 120 by concatenating the sinusoidal position to the linearly mapped RGB channels of the image.
[0062] The training system includes a training engine for training the encoder neural network 120 and the denoising decoder neural network 130.
[0063] Figure 2 This is a flowchart of an example process for training a neural network system comprising an encoder neural network and a denoising decoder neural network, and for convenience, this example process refers to... Figure 1 It has been described. Figure 2 The process can be implemented by one or more computers at one or more locations. One or both of the encoder neural network and the denoising decoder neural network can be pre-trained; or they can be trained from scratch.
[0064] The method involves obtaining multiple training data items (step 200). Each training data item includes at least one source data item 110 and a target data item. The source data item and the target data item represent views of objects or scenes.
[0065] Here, "view" can include observations of an object or scene, which can be a real, tangible object or scene, or an intangible object, such as a data object. Generally, an intangible object can represent any type of entity. Examples of source and target data items representing entities in their views are given later.
[0066] As an example, source and target data items can represent different views of the same object; for example, one could be a modified (“enhanced”) or distorted view of the object, while the other could be an unmodified view. As another example, source and target data items can include images as views of the same physical object from different viewpoints. As a further example, source and target data items can represent different views of objects of the same type or category; for example, they could represent views of cats, but not necessarily views of the same cat. That is, source and target data items can both represent views of the same (type) of object, but can include different examples of that object; i.e., they can be semantically related. In some implementations, such as when the system is being trained for a reconstruction task, source and target data items can represent the same view, such as the same view of an object or scene. For example, Figure 3 Each row represents three different views of an object: image cropping (top), image enhancement and distortion (middle), and camera viewpoint (bottom).
[0067] The method involves performing multiple training iterations.
[0068] At each iteration, the process obtains, for example, a sample, one of the training data items (step 202), and processes one or more source data items 110 in the training data items using encoder neural network 120 to generate one or more latent vectors 122 representing the one or more source data items (step 204).
[0069] The process also obtains a (random) time value for one of a plurality of data generation time steps, for example by sampling from a distribution (e.g., from a uniform distribution) (step 206). Generally, the time step spans an endpoint time step (e.g., in...). or At one point, i.e., the time value is 1 or 0, and at another endpoint time step (e.g., at...). Among them It is a range between integers. In other words, in the implementation, the time value is an integer time index.
[0070] In some implementations, during inference (i.e., when the denoising decoder neural network generates output data items by incrementally reducing the level of noise in the output data items at a series of time steps), the time steps (i.e., time values) can be derived from... The initial step countdown to or The final time step. However, the direction, and which is the initial time step and which is the final time step, are all arbitrary choices.
[0071] In some implementations, the inference process can be performed in steps, that is, while the denoising decoder neural network is generating output data items, it can be performed every... S Instead of performing an update step as described later at each time step, the update step is performed at each time step.
[0072] The method also involves using a denoising decoder neural network 130 to process a noisy version of the target data item 132, an embedding 124 of the time value, and a latent vector 122 to generate a denoised output 134 that includes the estimated noisy data item for that time step (step 208).
[0073] Generally, the noisy version of the target data item 132 and the estimated noisy data item each have one or more dimensions that match the target data item (and, as described later, the current data item in the case of inference). For example, if the target data item includes dimensions composed of... In an image represented by an array, the noisy version of the target data item and the estimated noise data item can also be separated. Array representation.
[0074] In this implementation, the denoising decoder neural network and the encoder neural network are trained (e.g., jointly) by backpropagating the gradient of the objective function, which depends on the accuracy of the noise in the noisy version of the estimated target data item (step 210).
[0075] In other words, in the implementation, the gradient of the objective function can be backpropagated through the denoising decoder neural network 130 to the encoder neural network 120 to update the trainable parameters (e.g., weights) of each neural network in the neural network.
[0076] In some implementations, only a portion of the denoising decoder neural network 130 and / or the encoder neural network 120 is trained, such as an adapter neural network portion of one or both of these neural networks, which is used to adapt a pre-trained portion of the corresponding neural network for which parameters (e.g., weights) are fixed during training.
[0077] This training can be performed using any suitable gradient descent optimization algorithm, such as Adam or another optimization algorithm.
[0078] The following example illustrates one implementation of the denoising diffusion model, but the methods and systems described herein can be used with variations of this method or other diffusion or consistency modeling techniques.
[0079] In some implementations, a noisy version of the target data item is obtained by adding (or subtracting) a noisy data item representing noise to the target data item, and the value of the objective function can then depend on the difference between the noisy data item and the estimated noisy data item.
[0080] For example, in some implementations, a noisy version of the target data item can be obtained by sampling a noisy data item from a noise distribution (e.g., a Gaussian noise distribution, a mixture of Gaussian noise distributions, or a gamma noise distribution). The noisy data item can then be used to determine the noisy version of the target data item, for example, using a noisy data item scaled by a scaling factor dependent on the time value. For example, the noisy version can be derived from the target data item or a scaled version of the target data item (e.g., scaled by a scaling factor dependent on the time value). The target data item (the version being scaled) is used to add or subtract (the scaled) noise data item. In some implementations, the value of the objective function can then depend on the difference between the noise data item and the estimated noise data item.
[0081] As an example, if in inference, the time value comes from Starting from the initial time step and counting backwards, the scaling factor can decrease over a series of time steps. In some implementations, the scaling factor is determined by... The value definition is given by, where Depending on the time value, for example, where It is equal to or greater than zero and equal to or less than 1. Then, The value of can increase as the time value (time index) decreases, that is, making For example, at the initial time step, the scaling factor can be 1; at the final time step, the scaling factor can be reduced to 0.
[0082] Noisy version of the target data item It can be identified as, for example, as ,in It is the target data item, and These are noisy data items, for example The value of the objective function can depend on the noisy data items. With the estimated noise data items from the denoising decoder neural network 130 The differences between them depend, for example, on the L1 or L2 norm or other measures of difference. For example, this can be achieved by estimating... To determine the value of the objective function. This can be done by averaging samples from the noise distribution and... The values are averaged. The value can, for example, come from across the defined time step ( Sampling is performed within a uniform distribution of the values of ).
[0083] As an illustrative technical background, in an example implementation of a diffusion model, the diffusion process is modeled as follows: ,in Define variance scheduling, and the denoising process can involve... right The value is sampled, where It is a fixed or learned variance term. (make) The encoder neural network 120 can determine the latent vector as... ,in It is a clean (noise-free) source data item, and Optional conditional data, such as viewpoint (embedded) data for source data items; and the denoising decoder neural network 130 can determine the estimated noisy data items as ,in This represents optional conditional data, such as viewpoint (embedded) data for a target data item.
[0084] Conditioning of the denoising decoder neural network 130 It can be included, for example, with a noisy data item. In the case of an image, this can be accomplished by linking the viewpoint embedding 132B with a noisy version 132A of the target data item (i.e., the target image) or otherwise combining them.
[0085] As previously mentioned, a noisy version of the target data item can be determined using (e.g., by adding) noisy data items scaled by a scaling factor. In some implementations, the scaling factor can be determined by... The value definition is given by, where Depending on the time value. Then, the method can involve determining... , making The gradient of the value changing with time at the initial time step (e.g., at that initial time step), The time value at and at the final time step within the time step (e.g., at that final time step, or The scaling factor has a vertical asymptote at the time value of the initial time step and the time value of the final time step within the time step. More generally, in the implementation, the scaling factor as a function of time has a vertical asymptote at one or both the time value of the initial time step and the time value of the final time step within the time step. In other words, it represents the change of the scaling factor as a function of time. The gradient of the noise dispatch curve as a function of time can have a gradient of less than one over a range of time values around a central time value between the initial and final values, where this range is less than 50%, 30%, or 20% of the total range. That is, it can be arranged... The scheduling of changes emphasizes training with noisy versions of target data items that have a moderate level of noise compared to higher or lower noise levels. It should be noted that... And therefore the scaling factor The way the changes occur does not depend on whether the target data item is scaled.
[0086] Figure 4 An example of noise scheduling that can be used in an implementation of the described training technique is shown. Specifically, Figure 4 It shows A graph relative to time steps (normalized to a range between 0 and 1).
[0087] In more detail, Figure 4 The diagram illustrates the conventional noise scheduling 400 (cosine scheduling) and the inverse noise scheduling 410 as described above. The inverse noise scheduling prioritizes medium noise levels over high or low noise levels, which can aid in representation learning. For example, too little noise would not present a sufficiently challenging task to the denoising decoder neural network 130, thus reducing the reliance on the encoder neural network 120; while too much noise might require latent vectors... By encoding the fine details of the generated data items, denoising becomes a simple reconstruction.
[0088] As an example only, it can be achieved by reversing... To obtain something similar to the inverted noise scheduling 410 and through A set of parameterized curves, in which and These are the start hyperparameters and the end hyperparameters (for example, e.g., or , );and It's the sigmoid function. As an example, it can be set by, for example... And then, by inverting the resulting function (i.e., determining the inverse function of the resulting function) (in this example), -0.0833 is obtained. An example curve can be obtained by using log(0.0025(-1-1.0050 / (-1.0025+t))). The inverse function can be determined using appropriate computer software (e.g., a computer algebra system such as SymPy).
[0089] As previously mentioned, in some implementations, the denoising decoder neural network is configured to perform the denoising process based on a diffusion model (e.g., DDPM (Denoising Diffusion Probability Model) or DDIM (Denoising Diffusion Implicit Model) model).
[0090] In some implementations, the denoising decoder neural network is configured to generate output data items by: at each time step in the time step, processing the current data item, the embedding of the time value for that time step, and the latent vector using the denoising decoder neural network to generate a denoised output including an estimated noise data item for that time step; and updating the current data item using the estimated noise data item for that time step to compensate for the noise in the current data item, resulting in an updated version of the current data item. The output data item can then be obtained as an updated version of the current data item at the final time step in the time step. The current data item used for the initial step of this process can be, for example, a random data item sampled from a noise distribution. In these implementations, the denoising process can operate in the space of the data items (e.g., image space).
[0091] In the implementation, the denoising process uses time steps with strides; that is, instead of updating the version of the current data item at each time step, the total number of time steps used during training is used. Divide the data into several reduction steps, and update the current data item at each of these reduction steps.
[0092] In some implementations, the source and target data items for the training data items can be the same. In this case, the neural network system can be trained as an autoencoder and used for data item reconstruction. In other implementations, as previously described, the source and target data items for the training data items are different.
[0093] In cases where both the source and target data items include images, these images may include various enhancements or distortions of the original image, or they may depict 3D objects from different poses and perspectives, or they may simply share the same semantic category with each other.
[0094] In implementation, source data items can be obtained by modifying target data items, or vice versa. For example, obtaining multiple training data items may involve: obtaining multiple target data items; and for each of the multiple target data items, generating at least one source data item from the target data item by modifying the target data item, or generating a target data item from at least one source data item by modifying at least one source data item.
[0095] As an example, source data items can be obtained by randomly "augmenting" the target data items, or vice versa. Generally, one or both of the source and target data items can be randomly augmented. In some implementations, one or more source data items in the training data can be augmented by adding noise to the source data items (before processing them using the encoder neural network). This can improve the performance of the trained system on later tasks.
[0096] When the source and target data items include image data items representing an image, modifications may involve one or more of the following: randomly resizing the image, randomly cropping the image, randomly flipping the image in space, and applying RandAugment (Cubuk et al., arXiv:1909.13719). Random cropping may involve selecting random patches of the image and then expanding those patches to the original size of the image. Flipping the image may involve applying a horizontal or vertical flip to the image.
[0097] For images, further possibilities include one or more of the following: color dithering, color dropping, Gaussian blur, solarization, rotation, masking a portion of the image, and adversarial perturbations. Color dithering can include altering one or more of the brightness, contrast, saturation, and hue of some or all pixels of an image through random offsets. Color dropping can include converting the image to grayscale. Gaussian blur can include applying a Gaussian blur kernel to the image; other types of kernels can be used for other types of filtering. Solarization can include applying an overexposed color transformation to the image; other color transformations can be used. Masking can include setting the pixels of random patches of the image to uniform values, such as zero. Applying adversarial perturbations can include applying a perturbation that increases the likelihood that the encoder neural network 120 will generate an incorrect latent vector representation (e.g., the perturbation is determined to maximize the error when performing a task using latent vectors).
[0098] For other types of data items, corresponding modifications can be applied. That is, resizing, cropping, flipping, data item value dithering (local modification), global modification of data item values, and so on can be applied to any type of data item. As an example, for audio data items, example modifications could further include: modifications to amplitude, such as randomly increasing or decreasing the amplitude of the audio; or modifications to the frequency characteristics of the audio, such as randomly filtering the audio.
[0099] When the training data items include multiple source data items, each source data item can be processed by an encoder neural network to generate a corresponding latent vector representing that source data item. These latent vectors can then be aggregated to obtain an aggregated latent vector. The aggregated latent vector, the noisy version of the target data item, and the embedding 124 of the time value can then be processed by a denoising decoder neural network 130 to generate a denoised output.
[0100] In some implementations, aggregating latent vectors may involve determining the mean of the latent vectors. In other implementations, aggregating latent vectors may involve processing the latent vectors using a Transformer neural network, for example, by processing a sequence of latent vectors to generate an output representing the aggregation of latent vectors.
[0101] Transformer networks can be characterized by a series of self-attention neural network layers. Each self-attention neural network layer has an attention layer input for each element of the input and is configured to apply an attention mechanism to the attention layer input to generate an attention layer output for each element of the input.
[0102] In the implementation, the denoising decoder neural network 130 includes a series of neural network layers. The latent vector can be partitioned into a set of sub-vectors, for example... or Subvectors (see below) can be used to modulate the (output) activation of neurons in the neural network layer when the denoising decoder neural network 130 is conditioned on the latent vector 122.
[0103] For example, each subvector can be used to modulate the activation of a corresponding layer in the denoising decoder neural network, or, for U-Net, the activation of a corresponding layer pair in the denoising decoder neural network. In the case of U-Net (or its variants), this layer pair can be corresponding layers in the corresponding contraction and expansion paths of U-Net, i.e., downsampling or upsampling layers. For example, latent vectors It can be divided into Subvectors ,in This refers to the number of layers in U-Net, and each subvector... It can be used to modulate the corresponding layer pairs. .
[0104] Since the different layers of U-Net correspond to different resolutions of the feature representation of data items, this can promote specialization among latent subvectors. Figure 5 ( Figure 1 The small diagram above illustrates the different layers of the U-Net and therefore the potential vectors when using this type of modulation. Different sub-vectors can represent different types of features. For example, outer layers (i.e., input and output layers) can represent relatively high-frequency features corresponding to latent sub-vectors representing, for example, color and texture, while inner layers can represent relatively low-frequency features corresponding to latent sub-vectors representing, for example, object or scene pose or structure. Intermediate layers can represent intermediate-frequency features corresponding to latent sub-vectors representing, for example, object details.
[0105] Some training implementations involve randomly setting subvectors in the set to zero before modulating the activation of the neural network layer using the set of subvectors, for example by randomly setting... A subset of the denoising is zeroed out. This can be called "layer masking" and effectively allows two versions of the denoising model to be implemented in the same denoising decoder neural network, one conditional and one unconditional, and to be trained jointly. Alternatively, at inference time (i.e., when denoising), the denoised output 134 can be modified for compensation, for example by adapting the output to determine As before, It is optional. As an example, the layer masking rate can be in the range of 0.01 to 0.5, for example, approximately 0.1.
[0106] Layer masking can reduce the dependence of the denoising decoder neural network on specific subvectors, allowing subvectors to be decoupled and specialized independently, thereby promoting untangling between representations encoded by subvectors. This can facilitate image editing and style mixing because it allows the decoder to be selectively conditional at a chosen level of granularity, such as the structure or location of the image, while allowing other aspects such as lighting, texture, or color palette to vary unconditionally.
[0107] When no layer masking is used, in the latent vector Conditional training can involve so-called classifier-free training. This can involve randomly masking or otherwise removing the entire latent vector from the input of the denoising decoder neural network 130. This allows for training neural networks to generate denoised outputs in both cases, with and without guidance from conditional data, as described, for example, in Ho and Salimans, arXiv:2207.12598.
[0108] In some implementations, modulating the activation of a neural network layer using subvectors (when conditioned the denoising decoder neural network 130 on the latent vector 122) involves using subvectors to modulate the activation of the layer. Perform an affine transformation.
[0109] In some implementations, modulating the activations of a neural network layer involves normalizing the activations to obtain normalized activations, and then scaling and / or shifting the normalized activations using subvectors. Normalization is optional; many different normalization techniques exist, such as batch normalization, layer normalization, or group normalization. In some implementations, the subvectors can be processed by one or more linear layers to obtain scaled modulation subvectors and / or bias modulation subvectors that can be used to perform the corresponding scaling and / or shifting. Shifting can refer to adding or subtracting values.
[0110] As an example only, the activation at the modulation layer can be determined using group normalization (Wu et al., “Group normalization”, arXiv:1803.08494, 2018) by identifying the modulated activation for the layer as To control the normalized scaling and bias, where and Each from It is obtained by linear projection. In some implementations, and Combining and And used for modulation .
[0111] In some implementations, a two-stage modulation technique is used, where the activations are scaled and / or shifted using time-value embeddings before scaling and / or shifting the subvectors. For example, this can be achieved by determining the modulated activations for a layer as... To include the time embedding 124, where and Each from It is obtained by linear projection of sinusoidal embedding.
[0112] Alternatively, instead of using subvectors to globally modify the layer's activation, activation can be modified locally. This can involve using cross-attention blocks to process the activation of one or more layers in a neural network, where the cross-attention blocks... Attention is used to update activations on a set of subvectors. Many different types of attention mechanisms exist. The output of a cross-attention block can be fed into the next (subsequent) layer of the denoising decoder neural network.
[0113] As an example, a cross-attention block can be configured to apply QKV attention by computing the similarity between a query (Q) and a set of key (K)-value (V) pairs. The set of key-value pairs can be determined from subvectors, and the query from activations. The output of the cross-attention block can include a weighted sum of values, weighted by a similarity function between the query and each corresponding key. The similarity function can include, for example, a dot product, cosine similarity, or other similarity measures; the query, keys, and values can all be vectors. For example, the query transformation (e.g., by a matrix) Definition), key transformation (e.g., from a matrix) (definition) and value transformation (e.g., by matrix) Each of the terms in the definition can be applied to the corresponding input of the cross-attention block (the subvector can be used for both key and value inputs) to derive the corresponding query vector. Key vector Sum value vector These vectors are used to determine the attention-processed sequence of the output.
[0114] exist Updating activations using attention on a set of subvectors can involve using QKV attention, where the query vector is derived from the activation vector. Exported, and the key vector and value vector are from It is derived from a set of subvectors.
[0115] Using cross-attention to modify the activation of the layer can perform better when generating new 3D views; for image editing, reconstruction and representation learning, two-stage modulation can perform better.
[0116] In some implementations, the encoder neural network 120 is trained at a higher learning rate than the denoising decoder neural network 130. This allows the encoder neural network to adapt to the training data faster than the denoising decoder neural network, which can help the encoder neural network guide the denoising decoder neural network during training.
[0117] One way to achieve this is to reduce the initialized weights of the encoder. k This is multiplied (by reducing the standard deviation of the initial distribution), and then the training process is allowed to amplify these weights back. k This effectively doubles the amount of time spent on the project. k Scaling the encoder gradient. As an example, k It can be greater than 1 and less than 10, for example .
[0118] In some implementations, the training data items can be preprocessed to resize and / or normalize them. In other implementations, the output data items can be resized, for example, to increase their resolution, after they have been generated.
[0119] As previously mentioned, in some implementations, the training data term includes an image data term. Then, the source data term, target data term, noise data term, and output data term each include pixel values for pixels in a static or moving image. The image can be a 2D or 3D image in color or monochrome. As defined herein, "image" includes, for example, a point cloud from a LiDAR system, "pixel" includes points in the point cloud, and references to moving images or videos include a time series of the point cloud. Images can include real-world images, such as those captured by a camera.
[0120] In some implementations, each training data item includes multiple source data items, each including a corresponding source image, and the target data item includes a target image. The source images and target images each include corresponding views of the same scene. Then, the source images (and optionally one or more target images) may each include different corresponding views of the scene. The denoising decoder neural network 130 can then be trained to generate output data items that include an output image, which is a further, different (i.e., new) view of the scene.
[0121] This could involve determining a viewpoint embedding for each pixel in each source image, representing the coordinates of the viewpoint for the pixel in the corresponding view of the scene. The encoder neural network can then process each source data item in the source data items and the viewpoint embedding for the pixels of the corresponding source image to generate a latent vector representing the source image.
[0122] For 2D images, a view can include translation of the image, such as in the x or y direction. The coordinates of the viewpoint relative to a pixel can include the x and y coordinates of that pixel. For example, for each Source image view 110A, each pixel of the image can have an associated 2D Coordinates, which are used to determine the viewpoint embedding 110B. To determine the viewpoint embedding, the coordinates can be normalized to a range. .
[0123] The system can be trained using multiple views of an object or scene (i.e., source images with different 2D coordinates), and the target image, more specifically a noisy version of the target image, can similarly have associated 2D coordinates used to determine the viewpoint embedding for the target image, which is another view of the object or scene.
[0124] In the case of 3D images, a view can include a perspective view of the object or scene.
[0125] As an example, the coordinates of a viewpoint relative to a pixel can include, for example, the direction of a ray. In particular, the viewing direction relative to that pixel, corresponding to the direction of the ray entering the scene from that pixel, and its spatial positioning (location). For example, the location of the origin of the ray. That is, the direction of the ray. You can define the direction from which a pixel enters the 3D scene (generally, different pixels will have slightly different viewing directions).
[0126] As another example, the viewpoint coordinates for a pixel can include the camera pose of a real or imaginary camera viewing the object or scene. This camera pose can be represented as a single vector capturing its position and orientation in polar coordinates. As an example, the vector can be derived from... (That is, the relative camera transformation from the source image to the target image) is determined; as another example, this vector can be determined by concatenating two absolute viewpoints. To determine. This vector can be encoded using sine embedding as described previously.
[0127] When the coordinates of a viewpoint relative to a pixel include the direction of a ray, there are many ways to define and represent such a ray. For example, in one approach, the image can be viewed as defined on the focal plane of an imaginary camera: the camera translates the angle between the incident light and its optical axis into a displacement on the focal plane. In another approach, the ray can be projected onto a sphere, for example. The direction can be represented in, for example, polar or Cartesian coordinates.
[0128] As a specific example, ray It can be represented as Each pixel of an image can have an associated ray position (origin) and orientation, such that the (2D) image has an associated ray position and orientation grid to represent a 3D object or scene. The image can be a source image, a target image, or, in inference, the current image (data item). The origin of the ray can correspond to the position of the pixel in the focal plane of the camera that captures the image of the object or scene.
[0129] An embedding can be generated for each of the ray position and orientation; for example, a sinusoidal embedding as previously described can be used (regardless of how the viewpoint is represented). To determine the viewpoint embedding, the ray position and orientation can be normalized to a range. .
[0130] For example, location can be defined as coordinates. (For example, where) (It is fixed) or For example, normalization to range In, and the direction is defined, for example, as a two-dimensional vector in a spherical coordinate system. (in and ), or defined as a three-dimensional unit vector. Therefore, for example, for An image, where each pixel can have associated 3D coordinates (with 4, 5, or 6 dimensions), used to determine the corresponding viewpoint embedding. In some implementations, ray positions and directions can be linked, for example, linked as... In some implementations, the ray position and direction can be combined into a parameter sum, for example... ,in This is the scaling factor hyperparameter. Scaling factor You can choose in various ways, such as choosing based on experience, or choosing to... Normalized to a unit length, or by using... The value is selected by projecting onto the image plane or onto a sphere centered on the object / scene.
[0131] When the source image is a real-world image captured using a camera, the position and direction of the ray for each pixel can be determined using known techniques based on the camera's so-called intrinsic parameters.
[0132] Generally, intrinsic parameters define the mapping from the image sensor's coordinate frame to pixel coordinates in the image captured by the image sensor. For example, intrinsic parameters can define an intrinsic transformation matrix that transforms a point from the image sensor's coordinate frame to the pixel coordinate frame. Examples of intrinsic parameters include focal length, aperture, field of view, and resolution. Broadly speaking, the camera translates the angle of incident light (relative to the camera's optical axis) into displacement on the focal plane. For example, given a camera pose... In this case, closed-form calculations can be used to derive the ray including its origin and direction. A 2D mesh. This can be done using standard software packages such as COLMAP.
[0133] When using sinusoidal position embedding, for either 2D or 3D coordinates, optionally, sin( ωt ) and cos( ω t The independent variable of ) can be scaled by a scaling factor, for example, ,in This is a hyperparameter representing the scaling factor. The scaling factor is chosen to increase the distinctiveness between embeddings at different locations, that is, to associate different locations with different embeddings. Figure 6 This illustrates the selection of such a scaling factor, represented as a range on the x-axis. The position within the scale and the frequency component represented on the y-axis are used for both the embedded sine and cosine components. The graph illustrates excessively high scaling factors (left), satisfactory scaling factors (middle), and excessively low scaling factors (right). When using scaling factors, their values can be determined empirically.
[0134] Generally, viewpoint coordinates can be provided for the source image, the target image, or both. The viewpoint can be an absolute viewpoint, or a relative viewpoint of the source image with respect to the target image, or vice versa. The viewpoint embedding can be any embedding of the viewpoint coordinates, such as the sine embedding as previously described.
[0135] As previously described, some implementations of the training technique include: determining a viewpoint embedding for each pixel in the target image; and using a denoising decoder neural network to process the viewpoint embeddings for the pixels of the source image, the noisy version of the target data item (including the embedding), the embedding of the temporal value, and the latent vector to generate a denoised output.
[0136] This training can involve randomly masking the viewpoint embeddings, for example, by setting them to zero, before processing them using the encoder neural network 120, and again before processing them using the denoising decoder neural network. This can help the system use only partial information during inference.
[0137] For example, the trained denoising decoder neural network can be used to generate images in a requested pose (e.g., without a specified object), or in an arbitrary pose (view) to generate defined images (e.g., of a requested object). That is, during inference, information such as latent vectors can be used to specify objects in the image, and / or pose information can be used to specify the pose of the object.
[0138] The training process described above can use unlabeled training data. Each training data item only needs to include one or more source data items and target data items, such as source images and target images, as described above. A wide range of publicly available datasets exist that can be used to train neural network systems. Generally, training data items can correspond to the types of data items that will be processed and / or generated after training.
[0139] As some illustrative examples, datasets that can be used include: ImageNet (objects); CelebA-HQ (Liu et al., 2018; people); ShapeNet (Chang et al., arXiv:1512.03012, 2015; 3D objects); GoogleScanned Objects (GSO; Downs et al., arXiv:2204.11918, 2022; 3D objects); Co3D (Reizenstein et al., “Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction”, arXiv:2109.00512, https: / / ai.meta.com / datasets / co3d-dataset / ; 3D multi-view images of common real-world objects); Caltect-UCSD Birds (CUB-200-2011; Welinder P. et al., “Caltech-UCSD Birds200”, California Institute of Technology, CNS-TR-2010-001, 2010, 2011; can be used to evaluate untangling).
[0140] Furthermore, to generate new 2D or 3D views of objects or scenes, source and target images can be captured simply by moving the camera around an internal or external environment (such as an office or building), recording the camera position, and understanding the camera's internal parameters, particularly the optical center and focal length. Once trained, new 2D or 3D images of that environment can be generated.
[0141] Once trained, tasks can be performed using an encoder-only neural network, a denoising-only decoder neural network, or both. Examples of visual tasks and other tasks that can be performed are described later.
[0142] Figure 7 This is a flowchart of an example process for generating output data items. The process generally operates by incrementally reducing the level of noise in the output data items at multiple time steps. The process uses a denoising decoder neural network; for convenience, refer to [reference needed]. Figure 1 The process is described by a denoising decoder neural network 130. Figure 7 The process can be performed by one or more computers in one or more locations.
[0143] In this implementation, the process includes obtaining an initial version of the current data item (step 700), for example, by sampling from a noise distribution. The initial version of the current data item may simply include noise.
[0144] At each of a series of time steps (which may be time steps with strides), after training as described above, a denoising decoder neural network (e.g., denoising decoder neural network 130) is used to process the current data item and the embedding of the time value for that time step to generate a denoised output 134 (step 702). The denoised output 134 includes the estimated noisy data item for that time step.
[0145] Then, the current data item can be updated using the estimated noise data item for that time step (step 704). As an example, the current data item can be updated using the estimated noise data item to compensate for noise in the current data item, for example, by using the estimated noise data item as the current data item. As another example, the current data item can be updated, for example, by subtracting the estimated noise data item from the current data item (where the estimated noise data item includes an estimate of the noise in the current data item). One or both of the current data item and the estimated noise data item can be scaled. In some cases, such as those utilizing a nondeterministic mapping like DDPM, noise can be added to the updated current data item (effectively sampling from a distribution whose mean depends on the estimated noise data item). In this way, an updated version of the current data item is obtained.
[0146] Generally, update steps can be performed based on known diffusion or consistency model techniques; for example, DDPM or DDIM updates can be used. As an example only, a DDPM update can involve determining... The value of is as described above.
[0147] The output data item may include (is) an updated version of the current data item at the final time step (step 706). Generally, the final time step is a time step after a predetermined number of denoising time steps, as previously described.
[0148] In some implementations, the output data items are generated unconditionally, for example, by omitting or masking (e.g., setting to zero) the latent vector input to the denoising decoder neural network. In other implementations, the output data items are generated conditionally, for example, based on the latent vector provided to the denoising decoder neural network or a partially masked latent vector.
[0149] Some implementations of this method involve obtaining latent vectors representing one or more properties of the output data item, such as the latent vectors described above. This is then processed by a denoising decoder neural network to generate output data items with specified characteristics.
[0150] As an example, obtaining a latent vector may involve: obtaining a data item (e.g., an image); and processing the data item using an encoder neural network 120 to generate a latent vector representing the data item. Then, one or more elements of the latent vector representing the data item can be modified to obtain the latent vector for processing by the denoising decoder neural network 130.
[0151] During training as described above, the encoder neural network can generate unwrapped latent variable representations of the data items, which facilitates the modification of aspects of the output data items by modifying the latent vectors processed by the denoising decoder neural network.
[0152] In some implementations, particularly where training involves layer masking, generating the output data item may involve processing the current data item and the time-value embeddings and masked versions of the latent vectors using a denoising decoder neural network to generate a second denoised output that includes a second estimated noise data item for that time step. The current data item can then be updated using a combination of the estimated noise data item for that time step and the second estimated noise data item for that time step. For example, the current data item can then be updated using the difference between the estimated noise data item and the second estimated noise data item, for example, to determine... As described above.
[0153] As previously described, in some applications, the current data item includes the current image, the output data item includes the output image, and generating the output data item includes determining the pixel values for the current image and the pixel values for the output image.
[0154] In some applications, latent vectors can represent compressed versions of the input image. For example, a latent vector representing a compressed version of the input image may have been obtained by processing the input image using an encoder neural network to generate the latent vector. This compressed latent vector can, for example, be stored or transmitted (e.g., via a network). Generating output data items that include the output image can reconstruct a version of the input image, for example, representing the corresponding semantic content.
[0155] When the output data item includes an image, generating the image may involve: obtaining viewpoint data that defines a target viewpoint for the output image; determining a pixel viewpoint for each pixel in the current image from the target viewpoint; and determining a viewpoint embedding for each pixel in the current image. The denoising decoder neural network 130 can then process the pixel values for the current image, the viewpoint embeddings for the current image's pixels, the embeddings for the temporal values at that time step, and (optionally) the latent vectors to generate the output image. The pixel viewpoints can be obtained as described above; the viewpoint embeddings can be determined and processed as described above.
[0156] As an example, for a new 2D image, the target viewpoint (view) for the output image can define the 2D location of the image, and the pixel viewpoint for each pixel can include the viewpoint for each pixel. coordinate.
[0157] As another example, for a new 3D image, the target viewpoint (view) for the output image can define the pose or orientation of the imaginary camera capturing the output image. The pixel viewpoint for each pixel can include a viewpoint for each pixel as... The ray, or the camera pose vector for each pixel. As previously described.
[0158] Generally, generating a new view of an output data item (such as an image of an object or scene) involves obtaining one or more source data items (e.g., images) and corresponding viewpoint embeddings (e.g., pixels for the source images), which represent the respective views of the source data items. The source data item, or each source data item (e.g., an image), is processed by an encoder neural network 120 to generate a latent vector representing that data item. In the presence of multiple source data items, latent vectors can be aggregated as previously described. The (aggregated) latent vectors are then used to conditionally generate new data items (e.g., new images) with new viewpoints from the trained denoising decoder neural network 130.
[0159] Figure 8 This is a flowchart of an example procedure for performing data item processing tasks; for convenience, refer to... Figure 1 The encoder neural network 120 is used to describe the process. Figure 8 The process can be performed by one or more computers in one or more locations.
[0160] The process involves, in particular, receiving latent vectors from the encoder neural network 120 after training as described above (step 800); and performing data item processing tasks using the latent vectors (step 802).
[0161] As an example, a data item processing task may include an image processing task. The data item then includes an image, the latent vector of which has been generated by an encoder neural network by processing the pixels of the image, and performing an image processing task may include using the latent vector, for example, to predict one or more properties of the image, such as, in a classification task or other prediction task, predicting the classification or category or property (e.g., color / color scheme, shape, texture, pose, lighting, semantic attributes, etc.) of the image or one or more objects in the image.
[0162] In another example, a data item processing task may include a control task. For instance, a data item may include an image, the latent vector of which has been generated by an encoder neural network by processing the pixels of the image, and performing a data item processing task may include using the latent vector to control a mechanical agent acting in a real-world environment to perform mechanical tasks.
[0163] The following is a description of the techniques described in this article and several example uses of neural networks for data item generation and editing.
[0164] The method for generating output data items described above can be used to unconditionally generate or define latent vectors of the properties of data items. The latent vector representation is used as a condition to generate data items of any type. Since the latent vector representation is usually unwrapped, the values of the different elements of the latent vector can represent semantically meaningful variations of the data item and can be chosen to define selected properties of the data item.
[0165] As described above, data items can include images, and latent vectors can define, for example, objects and / or properties of objects to be represented by the image. In the case of a moving image, this can be generated to represent the motion characteristics of objects in the image, such as the motion of vehicles on a road or human motion such as walking or running.
[0166] As another example, data items can include audio data items representing an audio signal, such as values as the digital waveform of the audio signal, or as a spectrogram, such as a mel-spectrogram. Latent vectors can define the content of the audio signal; for example, they can define words in natural language to be represented by the audio signal as speech, and / or other characteristics such as emotion, speaker characteristics such as age or gender, etc. As another example, latent vectors can define musical notes or other sounds to be generated by an instrument and / or the type of instrument.
[0167] As another example, a data item can represent one or more chemical molecules, such as one or more proteins or ligands, for example, represented as a point cloud. A latent vector can define one or more properties of a chemical molecule, such as in terms of its or their physical or chemical structure or properties. Data items can be used to determine the 3D structure of a chemical molecule, for example, to identify ligands such as drugs or one or more binding sites for that ligand. This can be used as part of a screening process to identify another chemical molecule that binds to one chemical molecule. Such a screening process can involve evaluating the interaction of one or more candidate ligands with the structure of a target (e.g., a target protein), and then selecting one or more candidate ligands from the candidate ligands based on the results of the evaluation. For example, the target can include a receptor or an enzyme, and the ligand can be an agonist or antagonist of the receptor or enzyme. The ligand can be a drug or a ligand for an industrial enzyme. This process can also involve: synthesizing the molecule (e.g., the ligand) identified by the screening process; and optionally also testing the activity (e.g., biological activity) of the molecule (e.g., the ligand) in vitro and / or in vivo.
[0168] As another example, a data item can represent the output of a scientific or medical instrument, such as an electrocardiograph or a body scanner like an MRI machine. A latent vector can define one or more properties of the data item, such as whether the data item represents a signal from a healthy body or a diseased body. The data item can then be compared to a corresponding data item obtained from the patient, and the comparison can be made to identify the possible presence or absence of a disease.
[0169] In any of the above applications, the latent vector can, but does not need to, be obtained by processing data items of a similar type to the output data items using the trained encoder neural network 120, and then modifying the latent vector according to the desired properties. For example, the values of some elements of the latent vector (e.g., elements corresponding to the desired properties or characteristics of the data items to be generated) can be retained, and the values of other elements of the latent vector can be modified.
[0170] More specifically, the described technique involves training an encoder neural network 120 to generate latent vectors. The latent vector includes human-interpretable variation factors, meaning that the values of the vector's elements correspond to specific characteristics of the generated data item. For example, in the case of an image, variation factors could correspond to physical aspects of the image (such as viewpoint or lighting) or semantic aspects of the image (such as maturity (e.g., kitten vs. adult cat)).
[0171] As previously described, the generated data item can be a reconstructed or edited version of the data item encoded by the (trained) encoder neural network 120 with the values of one or more elements of the latent vector representation of the data item modified.
[0172] In the case of image data items, this can be used to modify the semantic content of the image, and / or the view represented by the image (e.g., a perspective view), and / or the attributes of one or more objects represented in the image, such as shape, color, pose, etc.
[0173] In the case of image data items, this can be similarly used to modify the semantic content of audio, and / or characteristics of audio such as pitch or timbre, and / or attributes of one or more objects represented in the audio, etc.
[0174] When a data item represents one or more chemical molecules, this can be used to modify the physical (e.g., structure) or chemical properties of the molecules, for example, to increase or decrease the interaction between the molecule and another molecule or atom.
[0175] Figure 9 shows an example of a modified image generated by the denoising decoder neural network 130 when trained and used as described above.
[0176] Specifically, Figure 9A It shows that by according to (in From a latent vector To another potential vector Examples of images generated by linearly traversing the latent space, which exemplify smooth transitions between semantic attributes.
[0177] Figure 9B It shows in After training on a set of images, the latent vector is changed. Examples of images generated from the elements of the latent vectors, illustrating that these elements correspond to meaningful untangled variation factors, which in the examples are maturity, expression, and color. Alternatively, this can be achieved by applying latent vectors... Principal component analysis was performed to identify potential directions of greatest change. To find the most meaningful elements; in this diagram, examples of these potential directions are based on It performs a traversal.
[0178] Figure 10 illustrates the generation of a new 3D view of an object from a requested camera viewpoint using the techniques described above, given one or more source views conditioned on a viewpoint. Generally, only 1 to 10 source views are needed, for example, 1 to 3 source views. Figure 10A An example is shown where a new 3D perspective view of an object is generated from two different source image views. Figure 10B Examples of new 3D perspective views of various objects, each generated from a single source image view, are shown.
[0179] The following is a description of several example uses of the representations generated using the techniques described in this article.
[0180] The trained encoder neural network 120 can be used to process data items to generate a representation of the data item as a latent vector, which can then be used in subsequent processing tasks. Many known techniques and systems exist for processing representations of data items to perform specific tasks; the trained encoder neural network 120 can be used to provide a front end for preprocessing data for any of these techniques / systems.
[0181] When the data item includes an image, the latent vector representation of the image can then be processed to classify the image or one or more objects in the image, or to identify the presence of one or more objects in the image, for example, by determining a score for each of the possible classification categories for the image. If the image is a moving image, this can be correspondingly used to identify the presence of one or more actions in the image, or to classify actions in the image, or to predict actions or events in the image. Other tasks can be performed in a similar manner, such as 3D pose estimation tasks to estimate the pose of one or more objects represented in an image, shape estimation tasks, tasks involving recognizing aspects of an image using color, counting tasks involving counting objects or objects of a particular type, or tasks involving understanding spatial relationships between objects or object attributes. Generally, the output of a system used to perform such tasks can include any continuous or discrete representation of the desired output data.
[0182] As another example, where the data item includes images (e.g., images captured by a camera) as observations of a real-world environment, the latent vector representation of these images can be used as input to a control system of a mechanical agent, such as a robot or vehicle operating in a real-world environment. The latent vectors can provide a compact representation of varying factors related to the mechanical agent's operation, thereby facilitating agent control. For example, latent vectors can represent relevant objects in the real-world environment, as well as aspects of the real-world environment, in a compact form. Therefore, latent vectors can be used by the control system to make decisions and / or take actions in the environment, such as to complete tasks performed by the agent or to control the agent's direction of movement and / or speed.
[0183] When the data items include audio data items (e.g., a digitized audio waveform or spectrogram representation of an audio signal), the latent vector representation of the audio can then be processed to classify the audio or one or more sound objects in the audio, or to identify the presence of one or more sound objects in the audio. As some examples, the system can perform recognition or classification tasks (such as speech or voice recognition tasks, voice or speaker classification tasks, audio labeling tasks (in which case the output may be a category score or label for a data item or a segment of a data item)), or similarity determination tasks (e.g., audio copy detection or search tasks), in which case the output may be a similarity score.
[0184] As a further example, the system for which the trained encoder neural network 120 provides its front end can be a multimodal machine learning model, such as a visual language model. The trained encoder neural network can process image data items or audio data items to provide potential vector representations for further processing by such a model.
[0185] The model output from such a model can include any form of output appropriate for the machine learning task performed by the multimodal machine learning model. For example, the model output can include text in natural or computer language that defines the outcome of the task, such as for tasks like image captioning, visual question answering, or object detection or instance segmentation. Alternatively, the model output can include data that defines image, video, or audio objects, for example, in generative tasks; or the model output can include non-textual action selection data for selecting actions to be performed by an agent controlled by the model.
[0186] Some example multimodal machine learning models in which the trained encoder neural network 120 can be used include: Flamingo (Alayrac et al., arXiv:2204.14198); ALIGN (Jia et al., arXiv:2102.05918); PaLI (Chen et al., arXiv:2209.06794); and PaLI-X (Chen et al., arXiv:2305.18565). Some examples of multimodal machine learning models in which the control agent can use the trained encoder neural network are described in: PaLM-E (Driess et al., arXiv:2303.03378); RT-1 (Brohan et al., arXiv:2212.06817); and RT-2 (Brohan et al., arXiv:2307.15818).
[0187] As an example of the quality of the representations learned by the trained encoder neural network 120, a linear probing classification task was performed after training on ImageNet1K (1000 object classifications), thereby improving the latent vectors. An overfitted linear classifier is used to predict object classification. The described technique achieves an accuracy of 72% (top-1 prediction), which is a high score for representations learned using unsupervised learning.
[0188] In this specification, the term "configured" is used in connection with computing systems and environments, as well as computer program components. A computing system or environment is considered "configured" to perform specific operations or actions when it possesses the necessary software, firmware, hardware, or combinations thereof that enable it to perform specific operations or actions during operation. For example, configuring a system may involve installing a software library with specific algorithms, updating firmware with new instructions for data manipulation, or adding hardware components to gain enhanced processing power. Similarly, one or more computer programs are "configured" to perform those intended operations or actions when they contain instructions that, when executed by a computing device or hardware, cause the device to perform specific operations or actions.
[0189] The embodiments and functional operations described in this specification can be implemented in various forms, including digital electronic circuit systems, software, firmware, computer hardware (covering the disclosed structures and their equivalents), or any combination thereof. The subject matter can be implemented as one or more computer programs, which are essentially modules of computer program instructions encoded on a tangible, non-transitory storage medium for execution by a computing device or hardware or for controlling the operation of a computing device or hardware. The storage medium can be a storage device such as a hard disk drive or solid-state drive (SSD), a storage medium, a random or serial access memory device, or a combination thereof. Additionally or alternatively, program instructions can be encoded on transmitted signals such as machine-generated electrical, optical, or electromagnetic signals, designed to carry information for transmission to a receiving device or system for execution by the computing device or hardware. Furthermore, implementations can leverage emerging technologies such as quantum computing or neuromorphic computing for specific applications and can be deployed in distributed or cloud-based environments where components reside on different machines or within cloud infrastructure.
[0190] The term "computing device or hardware" refers to the physical components involved in data processing and encompasses all types of devices and machines used for this purpose. Examples include processors or processing units, computers, multiple processors or computers working together, graphics processing units (GPUs), tensor processing units (TPUs), and specialized processing hardware such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). In addition to hardware, computing devices or hardware may also include code that creates the execution environment for computer programs. This code can take the form of processor firmware, protocol stacks, database management systems, operating systems, or combinations of these elements. In a general-purpose computing (GPGPU) environment where code specifically designed for GPU execution (often referred to as kernels or shaders) is employed, embodiments can particularly benefit from leveraging the parallel processing capabilities of the GPU. Similarly, TPUs excel at running optimized tensor operations that are crucial to many machine learning algorithms. By utilizing these accelerators and their specialized programming models, the system can achieve significant acceleration and efficiency gains for tasks involving artificial intelligence and machine learning, particularly in fields such as computer vision, natural language processing, and robotics.
[0191] Computer programs (also known as software, applications, modules, scripts, code, or simply programs) can be written in any programming language, including compiled or interpreted languages, as well as declarative or procedural languages. They can be deployed in various forms, such as standalone programs, modules, components, subroutines, or any other unit suitable for use in a computing environment. A program may or may not correspond to a single file in a file system and can be stored in various ways. This includes being embedded within a file containing other programs or data (e.g., a script within a markup language document), residing in a dedicated file, or distributed across multiple coordination files (e.g., files storing modules, subroutines, or code segments). Computer programs can execute on a single computer or across multiple computers, whether located at a single site or distributed across multiple sites and interconnected via data communication networks. The specific implementation of a computer program can involve a combination of traditional programming languages and programming for GPGPUs or TPUs utilizing specialized languages or libraries designed for them, depending on the chosen hardware platform and desired performance characteristics.
[0192] In this specification, the term "engine" broadly refers to a software-based system, subsystem, or process designed to perform one or more specific functions. An engine is typically implemented as one or more software modules or components mounted on one or more computers, which may be located at a single site or distributed across multiple locations. In some cases, one or more dedicated computers may be used for a particular engine, while in others, multiple engines may operate concurrently on the same one or more computers. Examples of engine functions within the context of AI and machine learning may include data preprocessing and cleaning, feature engineering and extraction, model training and optimization, inference and prediction generation, and post-processing of results. The specific design and implementation of an engine will depend on the overall architecture and the distribution of computational tasks across various hardware components, including CPUs, GPUs, TPUs, and other specialized processors.
[0193] The processes and logic flows described in this specification can be executed by one or more programmable computers that run one or more computer programs to perform functions by manipulating input data and generating output. Additionally, graphics processing units (GPUs) and tensor processing units (TPUs) can be used to implement aspects of these processes and logic flows concurrently, thereby significantly accelerating execution. This approach offers significant advantages for computationally intensive tasks common in AI and machine learning applications, such as matrix multiplication, convolution, and other operations exhibiting high parallelism. By leveraging the parallel processing capabilities of GPUs and TPUs, significant speedup and efficiency gains can be achieved compared to relying solely on CPUs. Alternatively, or in combination with programmable computers and specialized processors, these processes and logic flows can also be implemented using specialized processing hardware, such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs), to achieve even higher performance or energy efficiency in specific use cases.
[0194] Computers capable of executing computer programs can be based on general-purpose microprocessors, special-purpose microprocessors, or a combination of both. They can also utilize any other type of central processing unit (CPU). Additionally, graphics processing units (GPUs), tensor processing units (TPUs), and other machine learning accelerators can be employed to enhance performance, particularly for tasks involving artificial intelligence and machine learning. These accelerators typically work in conjunction with the CPU, handling specialized computations, while the CPU manages overall system operation and other tasks. Typically, the CPU receives instructions and data from read-only memory (ROM), random access memory (RAM), or both. The basic components of a computer include a CPU for executing instructions and one or more memory devices for storing instructions and data. The specific configuration of the processing unit and memory will depend on factors such as the complexity of the AI model, the amount of data being processed, and the desired performance and latency requirements. Implementations can be carried out on a wide range of computing platforms, from small embedded devices with limited resources to large-scale data center systems with high-performance computing capabilities. Systems may include storage devices such as hard disks, SSDs, or flash memory for persistent data storage.
[0195] Computer-readable media suitable for storing computer program instructions and data encompass all forms of non-volatile memory, media, and storage devices. Examples include semiconductor memory devices such as read-only memory (ROM), solid-state drives (SSDs), and flash memory devices; hard disk drives (HDDs); optical media; and optical discs such as CDs, DVDs, and Blu-ray discs. The specific type of computer-readable media used will depend on factors such as the size of the data, access speed requirements, cost considerations, and the desired level of portability or durability.
[0196] To facilitate user interaction, embodiments of the subjects described in this specification can be implemented on computing devices equipped with display devices (such as liquid crystal displays (LCDs) or organic light-emitting diode (OLED) displays) for presenting information to users. Input can be provided by the user through various means, including keyboards, touchscreens, voice commands, gesture recognition, or other input modalities, depending on the specific device and application. Additional input methods may include sound, voice, or tactile input, while feedback to the user may take the form of visual, auditory, or tactile feedback. Furthermore, the computer can interact with the user by exchanging documents with the user's device or application. This may involve sending web content or data in response to a request, or sending and receiving text messages or other forms of messages via mobile devices or messaging platforms. The choice of input and output modalities will depend on the specific application and the desired form of user interaction.
[0197] Machine learning models can be implemented and deployed using machine learning frameworks such as TensorFlow or JAX. These frameworks provide comprehensive tools and libraries that facilitate the development, training, and deployment of machine learning models.
[0198] Embodiments of the subject matter described in this specification can be implemented within a computing system comprising one or more components, depending on the specific application and requirements. These components may include: backend components, such as backend servers or cloud-based infrastructure; optional middleware components, such as middleware servers or application programming interfaces (APIs), for facilitating communication and data exchange; and frontend components, such as client devices having a user interface through which users can interact with the implemented subject matter, a web browser, or an app. For example, the described functionality may be implemented only on the client device (e.g., for on-device machine learning) or deployed as a combination of frontend and backend components for more complex applications. Where present, these components can be interconnected using any form or medium of digital data communication, such as communication networks, such as local area networks (LANs) or wide area networks (WANs), including the Internet. The specific system architecture and component selection will depend on factors such as the scale of the application, the need for real-time processing, data security requirements, and the desired user experience.
[0199] Computing systems can include geographically separated clients and servers that interact via communication networks. The specific type of network (such as a local area network (LAN), wide area network (WAN), or the Internet) will depend on the accessibility and scale of the application. The client-server relationship is established through computer programs running on the respective computers and designed to communicate with each other using appropriate protocols. Depending on the nature of the data being exchanged and the system's security requirements, these protocols may include HTTP, TCP / IP, or other specialized protocols. In some embodiments, the server transmits data or instructions to a user device acting as a client, such as a computer, smartphone, or tablet. The client device can then process the received information, display the results to the user, and potentially send data or feedback back to the server for further processing or storage. This allows for dynamic interaction between the user and the system, enabling a wide range of applications and functionalities.
[0200] While this specification contains numerous details of specific implementations, these details should not be construed as limiting the scope of any invention or the scope of any claims, but rather as descriptions of features that may be characteristic of particular embodiments of a particular invention. Certain features described in this specification within the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments. Furthermore, although features may be described above as operating in certain combinations and even initially claimed in this way, in some cases one or more features from the claimed combination may be removed from the combination, and the claimed combination may involve sub-combinations or variations thereof.
[0201] Similarly, although operations are depicted in the accompanying drawings and described in a specific order in the claims, this should not be construed as requiring such operations to be performed in the specific order shown or in sequential order, or requiring the performance of all illustrated operations to achieve the desired result. In some contexts, multitasking and parallel processing may be advantageous. Furthermore, the separation of the various system modules and components in the embodiments described above should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0202] Specific embodiments of this subject matter have been described. Other embodiments are within the scope of the appended claims. For example, the actions recited in the claims can be performed in a different order and still achieve the desired result. As an example, the processes depicted in the drawings do not necessarily require a specific order or sequence to achieve the desired result. In some cases, multitasking and parallel processing can be advantageous.
Claims
1. A computer-implemented method for training a neural network system, the neural network system comprising an encoder neural network and a denoising decoder neural network, the denoising decoder neural network being used to generate output data items by incrementally reducing the level of noise in the output data items at multiple time steps, the method comprising: Obtain multiple training data items, each of which includes at least one source data item and a target data item, wherein the source data item and the target data item represent views of an object; And for each training iteration in multiple training iterations: Obtain one of the training data items; The encoder neural network is used to process the at least one source data item in the training data item to generate at least one latent vector representing the at least one source data item; Obtain the time value of one of the time steps; The denoising decoder neural network is used to process the noisy version of the target data item, the embedding of the time value, and the latent vector to generate a denoised output including the estimated noisy data item for that time step, to compensate for the noise in the noisy version of the target data item. as well as The denoising decoder neural network and the encoder neural network are trained by backpropagating the gradient of the objective function, which depends on the accuracy of estimating the noise in the noisy version of the target data item from the estimated noisy data item.
2. The method of claim 1, further comprising: The noise data items are sampled from the noise distribution; as well as The noisy version of the target data item is determined using the noise data item; The noise data item is scaled by a scaling factor that depends on the time value; and The objective function depends on the difference between the noise data item and the estimated noise data item.
3. The method of claim 2, wherein the scaling factor varies with the time value and has a vertical asymptote at one or both of the time value of the initial time step and the time value of the final time step in the time step.
4. The method of claim 1, 2, or 3, wherein the denoising decoder neural network is configured to generate the output data item by: at each time step in the time step, The denoising decoder neural network is used to process the current data item, the embedding of the time value for that time step, and the latent vector to generate the denoised output including the estimated noisy data item for that time step; as well as The current data item is updated using the estimated noise data item for that time step to compensate for the noise in the current data item, so as to obtain an updated version of the current data item. The output data item includes the updated version of the current data item at the final time step in the time step.
5. The method of any one of claims 1 to 4, wherein obtaining the plurality of training data items comprises: Obtain multiple target data items, and for each of the multiple target data items: The at least one source data item can be generated from the target data item by modifying the target data item, or the target data item can be generated from the at least one source data item by modifying the at least one source data item.
6. The method according to any one of claims 1 to 5, further comprising: Noise is added to the at least one source data item before processing it using the encoder neural network.
7. The method of any one of claims 1 to 6, wherein one or more training data items in the training data items comprise a plurality of the source data items, the method further comprising: The encoder neural network is used to process the plurality of source data items to generate a corresponding latent vector representing each of the source data items; Aggregate the latent vectors to obtain aggregated latent vectors; and The denoising decoder neural network is used to process the noisy version of the target data item, the embedding of the time value, and the aggregated latent vector to generate the denoised output.
8. The method of any one of claims 1 to 7, wherein the denoising decoder neural network comprises a series of neural network layers, and wherein processing the noisy version of the target data item, the embedding of the time value, and the latent vector using the denoising decoder neural network to generate the denoised output comprises: Divide the potential vector into a set of subvectors; as well as The set of subvectors is used to modulate the activation of neurons in the neural network layer.
9. The method of claim 8, wherein modulating the activation of the neural network layer using the set of subvectors comprises: For one or more neural network layers, the activation of the neural network layer is normalized to obtain normalized activation, and the normalized activation is modulated by scaling and / or shifting one of the subvectors.
10. The method of claim 9, comprising: The normalized activation is scaled and / or shifted using the embedding of the time values before being scaled and / or shifted using the subvector.
11. The method of any one of claims 8 to 10, further comprising: Before modulating the activation of the neural network layer using the set of subvectors, the subvectors in the set of subvectors are randomly set to zero.
12. The method of claim 8, wherein modulating the activation of the neural network layer using the set of subvectors comprises: For one or more neural network layers, a cross-attention block is used to process the activation of that neural network layer, the cross-attention block using attention on the set of subvectors to update the activation.
13. The method of any one of claims 1 to 12, wherein training the denoising decoder neural network and the encoder neural network by backpropagating the gradient of the objective function comprises: The encoder neural network is trained with a higher learning rate than the denoising decoder neural network.
14. The method of any one of claims 1 to 13, wherein the training data item comprises an image data item, and wherein the at least one source data item, the target data item, and the output data item each comprise pixel values for pixels of the image.
15. The method of claim 14, wherein one or more training data items in the training data items comprise a plurality of source data items each comprising a corresponding source image, wherein the target data item comprises a target image, wherein the source images and the target images comprise corresponding views of the same scene, wherein at least the source images, and wherein the denoising decoder neural network is trained such that the denoising decoder neural network is capable of generating an output data item comprising an output image, the output image being a further different view of the same scene; The method further includes: For each pixel in each source image, a viewpoint embedding is determined, which represents the coordinates of the viewpoint of that pixel in the corresponding view of the scene; and The use of the encoder neural network to process the at least one source data item in the training data item includes: using the encoder neural network to process each source image in the training data item and viewpoint embeddings for pixels of the source image to generate the latent vector representing the source image.
16. The method of claim 15, further comprising: Determine the viewpoint embedding of each pixel in the target image; and The process of using the denoising decoder neural network to process the noisy version of the target data item, the embedding of the time value, and the latent vector to generate the denoised output includes: using the denoising decoder neural network to process the viewpoint embedding of pixels for the source image, the noisy version of the target data item, the embedding of the time value, and the latent vector to generate the denoised output.
17. The method of claim 15 or 16, wherein the source image and the target image comprise 3D images of the scene; and wherein the coordinates of the viewpoint of the pixel in a corresponding view of the scene comprise: i) Spatial positioning and viewing direction corresponding to the direction of the ray entering the scene from the pixel, or ii) a vector representing the camera pose of the camera viewing the scene.
18. The method of any one of claims 15 to 17, further comprising: The viewpoint embedding is randomly masked before processing using the encoder neural network, or, when dependent on claim 16, the viewpoint embedding is randomly masked before processing using the denoising decoder neural network.
19. A computer-implemented method for training a neural network system, the neural network system comprising an encoder neural network and a denoising decoder neural network, the encoder neural network for encoding an image, and the denoising decoder neural network for generating an output image by incrementally reducing the level of noise in the output image at multiple time steps, the method comprising: A plurality of training images are obtained, each of the plurality of training images including at least one source image and a target image, wherein the source image and the target image represent views of an object; And for each training iteration in multiple training iterations: Obtain one of the training images; The encoder neural network is used to process the at least one source image in the training image to generate at least one latent vector representing the at least one source image; Obtain the time value of one of the time steps; The denoising decoder neural network is used to process the noisy version of the target image, the embedding of the time value, and the latent vector to generate a denoised output including an estimated noisy image for that time step, for compensating for noise in the noisy version of the target image. as well as The denoising decoder neural network and the encoder neural network are trained by backpropagating the gradient of the objective function, which depends on the accuracy of estimating the noise in the noisy version of the target image from the estimated noisy image.
20. A computer-implemented method for generating an output data item by incrementally reducing the level of noise in the output data item at multiple time steps, the method comprising: Obtain the initial version of the current data item, and at each time step in a series of time steps: The current data item and the embedding of the time value for that time step are processed using a denoising decoder neural network to generate a denoised output including an estimated noisy data item for that time step, the denoising decoder neural network having been trained by performing the corresponding operations of the method as described in any one of claims 1 to 18. The current data item is updated using the estimated noise data item for that time step to compensate for the noise in the current data item, so as to obtain an updated version of the current data item. The output data item includes the updated version of the current data item at the final time step.
21. The method of claim 20, wherein obtaining the initial version of the current data item comprises sampling the initial version of the current data item from a noise distribution.
22. The method of claim 20 or 21, further comprising: Obtain a latent vector representing one or more characteristics of the output data item; as well as The denoising decoder neural network is used to process the embeddings of the current data item and the time value of the time step, as well as the latent vector, to generate the denoised output.
23. The method of any one of claims 20 to 22, further comprising: The denoising decoder neural network is used to process the embedding of the current data item and the time value of the time step, as well as a masked version of the latent vector, to generate a second denoising output that includes a second estimated noisy data item for the time step. as well as The current data item is updated using a combination of the estimated noise data item for that time step and the second estimated noise data item for that time step to compensate for the noise in the current data item.
24. The method of any one of claims 20 to 23, wherein the current data item includes a current image, wherein the output data item includes an output image, and wherein generating the output data item includes determining pixel values for the current image and pixel values for the output image.
25. The method of claim 24 when dependent on claim 22, wherein the latent vector representing one or more characteristics of the output data item includes a latent vector representing a compressed version of the input image, and wherein generating the output data item including the output image reconstructs a version of the input image.
26. The method of claim 24 or 25, further comprising: Obtain viewpoint data that defines the target viewpoint for the output image; Determine the pixel viewpoint for each pixel in the current image from the target viewpoint; For each pixel in the current image, determine the viewpoint embedding for that pixel's viewpoint; and The method uses the denoising decoder neural network to process the values of pixels in the current image, the viewpoint embedding of pixels in the current image, the embedding of time values for the time step to generate the denoised output, and when the method is subordinate to claim 22, it uses the denoising decoder neural network to process the values of pixels in the current image, the viewpoint embedding of pixels in the current image, the embedding of time values for the time step, and the latent vector to generate the denoised output.
27. A computer-implemented method for performing a data item processing task, the method comprising: The latent vector is received from an encoder neural network, which has been trained by performing the corresponding operations of the method as described in any one of claims 1 to 19; as well as The latent vector is used to perform the data item processing task.
28. The method of claim 27, wherein the data item comprises an image, wherein the latent vector has been generated by the encoder neural network by processing the pixels of the image, and wherein performing the data item processing task includes using the latent vector to predict one or more features of the image.
29. The method of claim 27, wherein the data item comprises an image, wherein the latent vector has been generated by the encoder neural network by processing the pixels of the image, wherein performing the data item processing task includes using the latent vector to control the actions of a mechanical agent acting in a real-world environment to perform mechanical tasks.
30. A system comprising: One or more computers; as well as One or more storage devices communicatively coupled to one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operation of a corresponding method as described in any one of claims 1 to 29.
31. A non-transitory computer storage medium for storing one or more instructions, which, when executed by one or more computers, cause the one or more computers to perform the operation of a corresponding method as claimed in any one of claims 1 to 29.