Novel pose synthesis
By using the DRAW architecture and encoder-decoder architecture, the challenge of object pose manifold learning under sparse view conditions in natural image datasets is solved, enabling high-quality synthesis of new views in the natural image domain while maintaining the consistency of object identity.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MICROSOFT TECHNOLOGY LICENSING LLC
- Filing Date
- 2020-04-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing techniques struggle to effectively learn the pose manifold of objects in natural image datasets, especially under sparse view conditions, making it difficult to preserve object identity across views and synthesize novel views.
The Domain Transfer View Composition (DRAW) architecture is adopted. By converting the reference image of the object into a depth map, and using a depth rotator and identity restoration module to synthesize a new view in the natural image domain, the encoder-decoder architecture decomposes and reassembles shape and appearance information to achieve consistent object identity across views.
Under sparse view conditions, high-quality new views are successfully synthesized, maintaining the consistency of object identity, and applicable to natural image datasets without dense view supervision.
Smart Images

Figure CN113906478B_ABST
Abstract
Description
Background Technology
[0001] Objects viewed from different angles can traverse manifolds in image space. Characterizing these manifolds can be useful in computer vision, including 3D scene understanding and view-invariant object recognition. However, these manifolds can be difficult to learn. Summary of the Invention
[0002] Examples of computational devices and methods relating to novel poses for synthesizing objects are disclosed. One example provides a method that includes receiving a reference image of an object corresponding to an original viewpoint; converting the reference image of the object into a depth map of the object; synthesizing a new object depth map corresponding to a new viewpoint; and generating a new image of the object from the new viewpoint based on the new depth map of the object and the reference image of the object.
[0003] The present invention is provided in a simplified form to describe the selection of concepts further described in the following detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementing solutions to any or all of the disadvantages pointed out in any part of this disclosure. Attached Figure Description
[0004] Figure 1 Examples of two objects with the same shape but different appearances and their trajectories as a function of viewpoint in image space are shown.
[0005] Figure 2 An example of a computer-implemented Domain Transfer View Composition (DRAW) architecture is illustrated.
[0006] Figure 3 An example of a domain transfer model is shown.
[0007] Figure 4 An example of an identity recovery model is shown, which includes encoder-decoder architectures for various combinations of domains and views.
[0008] Figure 5 Example inputs and outputs from the domain transition model are shown.
[0009] Figure 6 Example images of rotated depth maps compared to ground-based images are shown.
[0010] Figure 7 An example comparison of L1 and SSIM scores with and without a depth rotator with 3D refinement is shown.
[0011] Figure 8Examples of the Simple Image-to-Image Transformation Model (HAL) and the Weak Identity Recovery Module (WIR) are shown.
[0012] Figure 9 Two examples are shown comparing the source image (reference image), the target depth map (new depth map), and the prediction of the object (new image) with the ground reality image of the object.
[0013] Figure 10 A qualitative comparison is shown between the synthesized image output by DRAW and the image output from other models.
[0014] Figure 11 The result of DRAW view composition on the table image is shown.
[0015] Figure 12 A comparison of the view synthesis results for chairs from ShapeNet is shown.
[0016] Figure 13A , 13B Figure 13C illustrates a flowchart of an example method for depicting a new pose for a composite object.
[0017] Figure 14 A block diagram of an example computing system is shown. Detailed Implementation
[0018] As mentioned above, objects viewed from different angles can traverse manifolds in image space. Characterizing these manifolds can be useful in computer vision, including 3D scene understanding and view-invariant object recognition.
[0019] However, these manifolds can be difficult to learn. For example, many object datasets do not contain dense sampling of different object views. Some popular datasets, such as ImageNet (accessible from image-net.org) and Common Objects in Context (COCO, accessible from cocodataset.org), support diversity for each class of objects, rather than diversity of views for arbitrary individual objects, because a dense collection of different views of each object across different object datasets can be a labor-intensive capture. In contrast, synthetic image datasets, such as ModelNet (“3D ShapeNets: A Deep Representation for Volumetric Shapes” by Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912-1920, 2015) or ShapeNet (“ShapeNet: An Information-Rich 3D Model Repository” by A. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al., arXiv: 1512.03012, 2015), may contain a large number of object views. However, such synthetic image datasets may be significantly different from the natural image domain. Therefore, models learned from synthetic data may be difficult to use in nature.
[0020] Since it can be difficult to obtain natural image datasets with dense view sampling, one possible solution for pose manifold modeling could be based on domain adaptation or transfer. Domain transfer methods have been used to solve problems such as object recognition, image segmentation, or image-to-image translation. However, domain transfer typically focuses on transferring images, image regions, or features across domains.
[0021] Therefore, a potential solution to the pose manifold modeling problem might be to combine these techniques with methods for novel view synthesis, which have shown promise for synthetic data. This approach could include generating pose trajectories in the synthetic domain using view synthesis, and then using domain transfer techniques to map the individual views to the natural image domain. However, these techniques may fail to preserve object identity across views. This problem exists in... Figure 1 The image shows a first object 104 and a second object 108 that have the same shape but different appearances. Figure 1A first trajectory 112 of the pose of the first object 104 and a second trajectory 116 of the pose of the second object 108 in the natural image space, which is a function of viewpoint, are shown. A synthetic trajectory 120 traversed by a synthetic image 124 of the object's CAD model is also shown. A set of densely sampled synthetic object views can be transferred to the natural image space of objects 104 and 108, where sparse views are available. However, since the CAD model may not be able to characterize the appearance of each natural object, the trajectory of each object 104, 108 can be mapped to a single trajectory 128 in the synthetic domain, as shown by the dashed line in the natural image trajectory. Therefore, the view synthesized along trajectory 128 may oscillate between objects depicting similar shapes but different appearances, such as... Figure 1 As shown at the bottom.
[0022] To overcome these issues, a consistent object identity can be achieved by transferring the entire pose trajectory across domains, rather than transferring individual views. This approach may be similar to techniques involving the illusion of view changes on scene images. However, methods for creating the illusion of view changes may assume dense view supervision, which may only be available in the composition or video domain.
[0023] Therefore, examples involving pose trajectory transfer to perform novel view synthesis when only one or a few images of the target object are available are disclosed, which is common in natural image datasets. The disclosed examples can allow the generation of novel views based on sparse target object image data, while preserving identity across views.
[0024] The novel view synthesis disclosed in this paper can be considered a special case of image-to-image transfer, where the source and target images represent different views. Several methods for novel view synthesis have been proposed. However, these methods can explicitly infer shapes from 2D image data. Furthermore, while image transfer may aim to synthesize styles or textures, view transfer may "illusion" unseen shape information.
[0025] Several methods for domain adaptation in visual tasks are also proposed. General domain adaptation may aim to bridge synthetic and natural domains by aligning their statistics. Some examples of domain adaptation schemes fuse color and depth features used for pose estimation. Conversely, as described in more detail below, the examples disclosed in this paper utilize viewpoint supervision through image-to-image transfers, which can be used to separate appearance and shape and recover object identity. In this way, unsupervised domain transfers can be achieved, leveraging depth information to bridge natural and synthetic domains and perform bidirectional transfers.
[0026] Beyond the examples presented in this paper, other methods for novel view synthesis have been proposed. Some methods generate pixels in the target view using autoencoders or recurrent networks. To eliminate some artifacts in these schemes, appearance flow-based transfer modules can reconstruct the target view using pixels from the source view and dense flow graph. However, such methods may fail to produce illusions for pixels missing in the source view. Other methods can leverage image completion modules implemented after flow-based image reconstruction to compensate for missing pixels in the source view, as well as separate modules to predict dense flow and pixel illusions. However, these methods rely on using training assemblies with dense pose trajectories, such as large ensembles of views of the same object. For example, some such methods may assume the view is rotated at 16 or 18 times the azimuth angle and utilize additional 3D supervision. This may limit the applicability of such methods, as acquiring and annotating pose labels on natural images can be both time-consuming and expensive. To avoid this labeling process, novel view synthesis methods can be trained on ShapeNet and applied. However, when new views are applied to synthetic natural scenes, such as those on KITTI (“Vision meet Robotics: The KITTI Dataset” by A. Geiger, P. Lenz, C. Stiller and R. Urtasun in International Journal of Robotics Research 32(11):1231–1237, 2013), view changes may be limited to a few frames and may still rely on viewpoint supervision.
[0027] Recent work has addressed human pose transfer, where the goal is to transfer a person across poses. These examples may leverage the availability of multi-pose datasets, such as DeepFashion (Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “DeepFashion: Powering Robust Clothes Recognition and Retrieval with RichAnnotations”, IEEE Conference on Computer Vision and Pattern Recognition Proceedings 2016, pp. 1096–1104). However, beyond the viewpoint, these methods may assume keypoint supervision or utilize pre-trained dense human pose estimation networks. Therefore, these methods may require additional supervision and may be only applicable to human poses.
[0028] Other examples attempt to reconstruct 3D information from 2D images using large-scale 3D CAD datasets, such as ShapeNet and Pix3D (X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Benenbaum, and W. T. Freeman, “Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2974–2983). However, such 3D reconstructions may presuppose 3D supervision.
[0029] In contrast to these methods, and as mentioned above, when only one or a few images of each object are available, the disclosed examples synthesize new poses of the objects by executing executable instructions on a computational system, as is likely common in natural image datasets. This is in Figure 1 The description below illustrates that only a few images are available for each natural image trajectory. The disclosed example utilizes densely filled synthetic pose trajectories to generate dense trajectories in the image domain. As described in more detail below, given a reference image x0 corresponding to an object from the original viewpoint, a synthetic depth map s0 is generated. Complete pose trajectories (e.g., s1, s2, ..., s0) are then generated in the latent space of the CAD-based depth map. N Attitude trajectories in the depth map space based on CAD can be used to provide cross-modal guidance for modeling attitude trajectories in image space, thereby synthesizing new views of objects in image space. Figure 2 The architecture of a computer implementation of an example of a novel view synthesis system is illustrated. Figure 2 The example architecture described herein may also be referred to as a Domain Transfer View Composition (DRAW) architecture. In the depicted architecture, the domain transfer module 204 is first used to convert the reference image x0 of the object into a reference depth map s0 of the object. Then, a depth rotator 208 is applied to composite a new depth map s0 of the object corresponding to the new viewpoint. p However, identity restoration may bring new challenges. For example, Figure 2 As shown, identity restoration can include decomposing the shape and appearance components of the reference image x0, and combining the appearance information with a new depth map s of the synthesized object shape. p The combination of .
[0030] Therefore, the identity recovery network 212 will refer to image x0 and the new depth map s pAs input, multiple predictions predict object views in a combination of domain (image to depth) and viewpoint (reference to the p-th view). Multiple predictions may force the network to more efficiently decompose shape and appearance information, thus enabling the synthesis of more realistic reference object views in new poses. This could potentially allow DRAW to synthesize new views of natural objects without using a training ensemble of natural images with dense views.
[0031] A new view of an object can be synthesized from a reference viewpoint, for example, defined by azimuth and elevation angles in a spherical coordinate system, where the object is located at the center of the view sphere. In some examples, given a reference image x0, N-1 consecutive views x p It can be sampled at a height or elevation angle, with azimuth angles spaced at 2⁻¹ / N radians. While the examples disclosed herein refer to generating views at a single height or elevation, it should be understood that these examples can be extended to the synthesis of images from different elevations.
[0032] Real and synthetic data can be combined to utilize depth as a representation bridging the natural and synthetic domains. For example, DRAW can use RGBD datasets such as Pix3D or RGB-D to learn the mapping between the image and depth domains. DRAW can also use the aforementioned synthetic datasets to learn how to synthesize new views. This is not a direct conversion of the reference view x0 into the desired view x. p Instead, it introduces an intermediate representation that includes each view x p Depth map s p However, image and depth representations may not be paired. For example, depth maps used to learn how to rotate objects in 3D... p There may be no image x p A one-to-one mapping. Instead, each depth map s p You can export one or more CAD models of objects from the same or similar classes under the same or similar views.
[0033] Therefore, this problem can be viewed as one of the domain adaptations, where the source domains available from viewpoint annotations are... Data from (CAD-based depth maps) can be used to improve the target domain. The performance of the task (view composition) in (images), where such data is inaccessible. Therefore, as Figure 2 As shown in the example, view generation can be broken down into simpler tasks: a domain adaptation component that maps an image to a depth map and vice versa, and a geometry component that is implemented as a 3D rotation of an object.
[0034] exist Figure 2 The example shown proposes three modules to implement these tasks: domain transfer module 204, depth rotator module 208, and identity recovery module 212. Domain transfer module Establishing the target domain from natural images to the source domain of the depth map The mapping is as follows:
[0035] (1)
[0036] (2)
[0037] Where x0 and s0 are a reference image and a depth map with the same azimuth angle, respectively.
[0038] For p = 1, ..., N-1, the depth rotator module 208 can achieve the following:
[0039] (3)
[0040] The depth rotator module 208 takes the depth map s0 associated with the original reference view and synthesizes the depth maps of all other N-1 views. For example... Figure 2 As shown in the example, this can be implemented through two sub-modules: a cyclic rotator 216 and a thinning operator 220. The cyclic rotator 216 generates novel depth map views. The thinning operator 220 can utilize information from all the synthesized depth maps to refine each of them. The identity restoration module 212 implements this.
[0041] (4)
[0042] (5)
[0043] Combine the original reference view x0 and the composite depth map s p As input to generate a composite view In this way, the identity recovery module 212 can be in s p Restore x0's identity from the viewpoint.
[0044] Referring again to equations (1) and (2), a domain transfer model can be learned using a dataset such as Pix3D with paired images and depth maps. This allows learning a domain transfer model to resemble learning a standard domain transfer problem, where the domain transfer model receives a natural image (e.g., in RGB) and outputs a depth map. Such a transfer can be performed using any suitable image style transfer model.
[0045] Figure 3 An example architecture of the computer-implemented domain transfer module 304 and 3D refinement module 308 is shown. Figure 3The architecture shown is a fully convolutional neural network, which in some examples can be implemented using ResNet blocks. Domain transfer model 304 outputs a depth map and a foreground mask identifying pixels associated with the object. An object depth map 312 can be obtained by combining the depth map and the foreground mask. In some examples, using a foreground mask can result in a sharper object depth map, which can lead to improved depth rotation performance.
[0046] In some examples, the quality of the synthesized depth map It can be evaluated using L1 (minimum mean deviation) loss, such as:
[0047] (6)
[0048] In some examples, the L1 loss can be supplemented with an adversarial loss that distinguishes between synthetic and real depth maps. The adversarial loss can be derived from the original depth map s. o This is achieved using a pairwise discriminator D between the synthesized depth map and the target depth map, conditionally set to x0. The domain transfer module can be implemented by learning a lossy discriminator and a lossy mapping function. The training is performed iteratively. For example, equation (7) shows an example of the discriminator loss function:
[0049] (7)
[0050] Equation (8) shows the mapping function An example of a loss function:
[0051] (8)
[0052] In equation (8), This indicates that it can be selected to balance. The multiplier of the contribution of each component.
[0053] Adding adversarial loss during the learning of the discriminator and mapping function can help enhance the sharpness of the output and the consistency between the input and output. Therefore, this scheme can be applied to learning any suitable module described in this paper.
[0054] Introducing depth as an intermediate representation for image transformation converts view rotation into a geometric operation that can be learned from a dataset of CAD models. Instead of reconstructing pixel depth from appearance views, such as using a dense appearance flow model, a novel depth view can be synthesized from a reference depth view s0. This scheme can leverage a dataset of CAD data with multiple views for each object, each with a known perspective.
[0055] After performing domain transfer, a novel depth view can be generated using a combination of a depth map generator and a 3D refinement module. The depth generator may be based on a recurrent network, which takes the original depth map as input. o As input, and output a series of depth maps, as follows:
[0056] (9) For p = 1, ..., N-1
[0057] In equation (9), This represents a depth map generator function, where p is the azimuth angle. A depth generator can implement any suitable functionality. For example... This could be based on ConvLSTM, with skip connections between one or more input and output layers, as described in “Multi-view to Novel view: Synthesizing novel views with Self-Learned Confidence” by S.-H. Sun, M. Huh, Y.-H. Liao, N. Zhang, and JJ Lim, pp. 155-171, Proceedings of the European Conference on Computer Vision (ECCV) 2018. Given a set of depth maps {s0, s1, ..., s...} from N viewpoints... N-1 A depth generator might aim to minimize its loss. An example loss function for a depth generator is shown below:
[0058] (10)
[0059] The refinement module 220 can enhance the consistency between adjacent views by refining the 3D convolutional neural network of each synthesized view using information from nearby synthesized views. For example, the N depth maps synthesized by the depth rotator module can be stacked into a 3D volume s′, as shown in equation (11):
[0060] (11)
[0061] In equation (11), This indicates a cascade along the third dimension. In some examples, this is to ensure the end view (e.g., s) N-1 The refinement of ) can be achieved by using cyclic filling on the third dimension. The volume s′ can be handled by equation (12):
[0062] (12)
[0063] This can be achieved through multi-layer 3D convolutions with skip connections to produce a 3D volume of cascaded fine depth maps:
[0064] (13)
[0065] 3D refinement can be supervised by an L1 loss function, such as:
[0066] (14)
[0067] L1 loss can be determined by a pairwise volume discriminator D between the CAD-based depth map volume s and the composite volume s″. v The adversarial loss is supplemented by the condition s′. Equation (15) gives an example of the discriminator loss: (15)
[0069]
[0070] and It can be supervised by the following:
[0071] (16)
[0072] In past domain transfer methods, as described above, the source and target domains can be mapped one-to-one, where each example in the source domain produces a different image in the target domain. In the disclosed example, the transfer between the image and the depth map is not this case, because... Figure 1 As shown, objects of the same shape may have different appearances. Therefore, the mapping between the image and the depth map may not be bijective. While this may not pose a problem for domain transfer modules implementing many-to-one mappings, it can mean that it is difficult to uniquely recover the object's identity from its depth map. Therefore, in addition to the depth map s p The identity restoration model can also access the original reference image x0 to achieve the mapping of equation (4).
[0073] In a supervised regression setting, this mapping can be derived from the triple (x0, s) p x p Learning can be done in this context. However, such datasets can be difficult to locate or assemble. It may even be difficult to find datasets containing multiple views of the same object with viewpoint annotations. For example, datasets like those in Pix3D may only have a few views per object, and the views may not be aligned (i.e., views may change depending on the object). Given the lack of such data, Learning from unpaired data is more challenging than image-to-image transfer because it involves... Decompose the appearance and shape information of x0 and combine the appearance information with s p The shape information is combined.
[0074] To perform this decomposition and combination, an encoder-decoder architecture can be employed. The encoder decomposes its input into a pair of shape and appearance parameters via a combination of a structure predictor and an appearance predictor. The structure predictor implements the mapping from the input image x to the shape parameter p as shown in equation (17):
[0075] (17)
[0076] Similarly, the appearance predictor can achieve the mapping from the input image x to the appearance parameter a as shown in equation (18):
[0077] (18)
[0078] The decoder takes a vector of connected appearance and shape parameters and decodes the latent representation into an image, combining these parameters to reconstruct its output.
[0079] While the shape of an object is captured by both the image and the depth map, its appearance is captured only by the image. This difference can be used to force a decomposition. For example, shape information derived from domain A and appearance information derived from domain B can be combined to reconstruct an image of the object in domain B that can be generated under a view that can be used in domain A. Therefore, using the image domain and shape domain as A and B, images can be synthesized using four possible combinations of domains (image and depth map) and views (reference and target). By matching each of these four types of synthesized images with the four types of real images, the network can learn to decompose and combine shape and appearance representations.
[0080] In multi-view settings, the four combinations may not be available because x p To predict the target. However, this idea can be achieved using the remaining three combinations: reference image (x0), reference depth map (s0), and target view depth map (s...). p ). Figure 4 An example of a computer-implemented identity recovery model is shown, which includes an encoder-decoder architecture for various combinations of domains and views. Figure 4 In the example, dashed lines indicate the data flow during training, and solid arrows indicate the data flow during inference. Figure 4 The encoder shown can be applied to the reference image x0 and the depth map s. p This may generate a pair of shape and appearance parameters for each input:
[0081] (19)
[0082] (20)
[0083] (twenty one)
[0084] (twenty two)
[0085] The decoder can then be applied to the four possible combinations of these parameter vectors to synthesize the following four images:
[0086] (twenty three)
[0087] (twenty four)
[0088] (25)
[0089] (26)
[0090] like Figure 4 As shown on the right, equations (23)-(26) represent all possible combinations of the shape and appearance from the real reference image x0 and the shape and appearance from the corresponding depth map. To force the decomposition into shape and appearance, the structure predictor, appearance predictor, and decoder share parameters. This means that an encoder and a decoder are effectively learned. During inference, the target image x0... p The following is obtained:
[0091] (27)
[0092] A hybrid of supervised and unsupervised learning can be used to train the identity recovery model. Due to x0, s0, and s p Available, they can be combined separately. and Provides direct supervision. This can be encoded into a supervised loss function:
[0093] (28)
[0094] In some examples, equation (28) can be supplemented by adversarial loss, where the combination and These are considered false pairs and cannot be distinguished from the true pair (x0, s0). Such a pairwise discriminator can be trained using the following loss function:
[0095] (29)
[0096] Encoders and decoders can be learned using loss:
[0097] (30)
[0098] DRAW can be trained in two stages to decouple domain transfer and viewpoint synthesis. (Depth Rotator Module) and its discriminator DV The loss can be optimized. Any suitable 2D image reconstruction loss can be used, such as the loss shown in equation (10) above. Once... Once trained, it can be frozen and embedded. Figure 2 In the system.
[0099] The training of the domain transfer and identity recovery modules can then be solved in an end-to-end manner, using Equation (31) as the loss function to train the discriminator and Equation (32) as the loss function to train the domain transfer and identity recovery parts.
[0100] (31)
[0101] (32)
[0102] The DRAW model was evaluated using a combination of the natural image Pix3D dataset and the synthetic 3D CAD ShapeNet dataset. To ensure diversity of viewpoints and identities in both datasets, the DRAW was evaluated in two categories: chairs and tables.
[0103] First, each module of DRAW is evaluated individually on the chair category. Domain transfer is evaluated between Pix3D and ShapeNet, view synthesis is evaluated on ShapeNet, and identity restoration is evaluated on Pix3D. L1 and structural similarity metrics (SSIM) are used as quantitative synthesis metrics.
[0104] See below for reference Figure 10In more detail, the performance of the entire DRAW model trained using Pix3D and ShapeNet is compared with three other view synthesis models trained on ShapeNet and fine-tuned on Pix3D. Model 10-1 is described in "Single-view to Multi-view: Reconstructing Unseen Views with a Convolutional Network" by M. Tatarchenko, A. Dosovitskiy, and T. Brox, arXiv: 1511.06702, 6, 2015. Model 10-2 is described in "View Synthesis by Appearance Flow" by T. Zhou, S. Tulsiani, W. Sun, J. Malik, and AAEfros, Springer, pp. 286-301, European Conference on Computer Vision 2016. Model 10-3 is described in S.-H. Sun, M. Huh, Y.-H. Liao, N. Zhang and JJ Lim, “Multi-view to Novel view: Synthesizing novel views with Self-Learned Confidence”, pages 155-171 of the proceedings of the 2018 European Conference on Computer Vision (ECCV).
[0105] Initial pose scores are used to quantify the quality of synthesized images. These initial scores can be calculated by plotting the KL divergence between the conditional label distribution and the marginal distribution to assess the quality and diversity of generated images across categories. The initial network used for classification training provides the label distribution. For view synthesis, the goal is to provide pose diversity rather than category diversity. Therefore, the initial model is trained to classify 18 different azimuth angles on ShapeNet, and the following initial scores are calculated using the pose label prediction distribution:
[0106] (33)
[0107] On ShapeNet, 72 images of 256×256 size were synthesized for each CAD model, using 18 azimuth angles and elevation angles of {0°, 10°, 20°, 30°}. For training, 558 objects were used, while 140 objects were used for testing. Pix3D combined 2D natural images and 3D CAD models. Images and depth maps from Pix3D were cropped and resized to 256×256. Understandably, while DRAW may not require multiple images aligned to each object, these can be helpful in evaluating identity recovery. The training and test sets were split based on objects to ensure that images with the same object did not appear in both training and testing. This resulted in 758 training images from 150 objects and 140 test images from 26 objects.
[0108] Now for reference Figure 5 Examples are given illustrating some of the inputs and outputs of a computer-implemented domain transfer module. Figure 5 The image shows the source image, predicted depth map, and target depth map of two images of the chair. Figure 5 As shown in the example, the predicted depth map output by the domain transfer module may be very close to the ground reality (e.g., the source image).
[0109] A depth rotator with 3D thinning is compared to a depth rotator without 3D thinning. Both models are trained on 18 ShapeNet views. Given a reference depth map, the model's task is to synthesize the remaining 17 depth maps.
[0110] Figure 6 Example images of computer-generated rotating depth maps compared to ground-based imagery are shown. Figure 6 As shown in the example, the depth map output by the depth rotator may be close to the actual ground, but refinement can improve the rendering of fine shape details.
[0111] Figure 7 An example comparison of L1 and SSIM scores is shown for computer-implemented depth rotators with and without 3D refinement. For L1, a lower value indicates a higher quality depth map. For SSIM, a higher value indicates a higher quality depth map. Figure 7 As shown in the example, refinement can improve both metrics across all views.
[0112] Now for reference Figure 8 Combine the two baseline identity recovery models with Figure 4 A comparison was made between the identity recovery models. Figure 8This paper presents a simple computer-implemented image-to-image illusion (HAL) model and an example of a computer-implemented weak identity recovery (WIR) module. The HAL model simply treats identity recovery as an image-to-image transformation problem. The HAL model may only have access to the object's depth map. p Therefore, the HAL model may create an illusion about the appearance of objects. The WIR model is... Figure 4 A simple variant of the identity recovery model. The WIR model receives x0 and s p However, it can be used in comparison. Figure 4 The identity recovery model has fewer and / or weaker untangling constraints because it may not need to synthesize all combinations of shapes and appearances.
[0113] All models are trained on pairs of RGB-D images corresponding to different viewpoints of the same object in Pix3D. During inference, the RGB image from the first view and the depth map from the second view are used to predict the RGB image from the second view. Due to the lack of supervision over the target RGB image, HAL and WIR are optimized using only adversarial loss.
[0114] Figure 8 A quantitative comparison of all three identity recovery models is also provided. Figure 8 As shown, HAL's performance is weaker. In both identity recovery models, Figure 4 The additional decomposition constraints of the identity recovery model result in improved performance compared to WIR.
[0115] Figure 9 Two examples are shown: a source image of the object (reference image), a computer-generated depth map of the object, a new depth map of the object, and a computer-generated prediction of the object synthesized through identity restoration (new image), compared with a ground-based image of the object. Figure 9 As shown in the example, the synthesis quality across large perspective transformations can be very high.
[0116] Now for reference Figure 10 The synthesized image output by DRAW is qualitatively compared with images output from models 10-1, 10-2, and 10-3 mentioned above. DRAW is optimized on Pix3D images, where its shape rotator is trained on ShapeNet. Model 10-3 was trained and tested using multiple views, but DRAW generates images from a single image. Figure 10 The entire trajectory of the synthetic image shown.
[0117] Since there is no target image in this example, L1 and SSIM are not calculated. Instead, the model is compared via initial scores, with the results for the table categories listed in Table 1. A view of the composite result using DRAW on the table image is shown below. Figure 11 As shown.
[0118] Table 1: Initial scores of models 10-1, 10-2, 10-3 and DRAW on the tabular images from Pix3D.
[0119] 10-1 10-2 10-3 DRAW initial score 9.77 9.24 9.78 10.21
[0120] DRAW was compared with pixel generation methods (10-1), appearance flow methods (10-2), and another recent approach (10-3). These models were trained on ShapeNet and applied to Pix3D test images.
[0121] As shown in Table 1, DRAW achieved the highest initial score. Model 10-2 yielded relatively poor synthetic results. This is likely due to the challenging lighting and textures of natural images, which makes dense appearance flow mapping fundamentally different from flow mapping in the synthetic domain. Applying previous methods to natural image datasets may require fine-tuning via viewpoint annotations, which DRAW does not utilize.
[0122] To compare with ShapeNet, DRAW was trained using domain images and depth maps extracted from ShapeNet. The remaining methods were tested on the same images as described above. Figure 12 A comparison of the synthesized view results of chairs from ShapeNet is shown. For example... Figure 12 As shown, all results are comparable.
[0123] In summary, DRAW synthesizes the pose trajectories of objects from reference images. This can be achieved using cross-modal pose trajectory transfer, based on i) mapping the RGB image to a 2D depth map, ii) transforming the depth map to simulate 3D object rotation, and iii) remapping to image space. DRAW can be trained using ensembles of real images with sparse views, as in Pix3D and ShapeNet. The pose trajectories can be synthesized in the synthesis domain and transferred to image space in a manner that achieves object identity consistency. An identity recovery network capable of decomposing and reconstructing appearance and shape information helps achieve this consistency. Comparisons with other view synthesis methods show that DRAW can produce images with better quality, structural integrity, and instance identity.
[0124] Figure 13A , 13B Figures 13C and 13C show flowcharts illustrating an example of a computer-implemented method 1300 for depicting a new pose for synthesized objects. (Refer to the above and...) Figure 1-12 The components shown in 14 provide the following description of method 1300, but it should be understood that method 1300 may also be executed in other contexts using other suitable components.
[0125] refer to Figure 13A At 1302, method 1300 includes receiving a reference image of an object corresponding to the original viewpoint. At 1304, method 1300 includes converting the reference image of the object into a reference depth map of the object. As shown at 1306, in some examples, converting the reference image of the object into a reference depth map of the object may include inputting the reference image of the object into a domain transfer module and receiving the reference depth map of the object from the domain transfer module.
[0126] At 1308, method 1300 may include receiving a foreground mask from the domain transfer module, the foreground mask identifying pixels associated with an object. As shown at 1310, in some examples, the domain transfer module includes a domain transfer model. At 1312, method 1300 may include training the domain transfer model on a dataset of paired images and depth maps.
[0127] Now for reference Figure 13B In 1314, method 1300 includes synthesizing a new depth map of the object corresponding to the new viewpoint. As shown in 1316, synthesizing a new depth map of the object corresponding to the new viewpoint may include inputting a reference depth map of the object into a depth map generator, receiving a new depth map of the object from the depth map generator, and refining the new depth map of the object using a 3D depth refinement module. In 1318, method 1300 may include receiving a sequence of new depth maps from the depth map generator. In such an example, refining the new depth maps may include using a 3D convolutional neural network to enhance the consistency between the sequences of new depth maps.
[0128] Now for reference Figure 13C In 1320, method 1300 includes generating a new image of the object from a new viewpoint based on a new depth map of the object and a reference image of the object. In some examples, in 1322, generating a new image of the object may include mapping the reference image of the object to appearance parameters, mapping the new depth map of the object to shape parameters, and combining the shape parameters and appearance parameters to generate a new image of the object from a new viewpoint.
[0129] As shown in 1324, generating a new image of the object may include inputting a reference image of the object and a new depth map of the object into the identity recovery model and receiving the new image of the object from the identity recovery model. In some examples, as shown in 1328, method 1300 may include training the identity recovery model on unpaired depth and image data. In 1330, method 1300 may include training the identity recovery model by: mapping the reference image to a reference shape parameter using a first structural encoder, mapping the new depth map to a new shape parameter using a second structural encoder, mapping the reference image to a reference appearance parameter using a first appearance encoder, mapping the new depth map to a new appearance parameter using a second appearance encoder, and combining each shape parameter with an appearance parameter to generate an image. In 1332, method 1300 may include training the identity recovery model using supervised learning and unsupervised learning. In some examples, in 1334, method 1300 may include directly supervised training using a reference image of the object, a reference depth map, and a new depth map.
[0130] Figure 14 An example of a computing system 1400 is schematically shown, which can perform one or more of the methods and processes described above. The computing system 1400 is shown in a simplified form. The computing system 1400 may take the form of one or more personal computers, server computers, tablet computers, home entertainment computers, network computing devices, gaming devices, mobile computing devices, mobile communication devices (e.g., smartphones), and / or other computing devices.
[0131] The computing system 1400 includes a logic machine 1402 and a storage machine 1404. The computing system 1400 may optionally include an illustration subsystem 1406, an input subsystem 1408, a communication subsystem 1410, and / or... Figure 14 Other components not shown.
[0132] The logic machine 1402 includes one or more physical devices configured to execute instructions. For example, a logic machine may be configured to execute instructions that are part of one or more applications, services, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform tasks, implement data types, transition the state of one or more components, achieve technical effects, or otherwise achieve desired results.
[0133] A logical machine may include one or more processors configured to execute software instructions. Additionally or alternatively, a logical machine may include one or more hardware or firmware logical machines configured to execute hardware or firmware instructions. The processor of the logical machine may be single-core or multi-core, and the instructions executed thereon may be configured for sequential, parallel, and / or distributed processing. Components of the logical machine may optionally be distributed across two or more independent devices that may be remotely located and / or configured for coordinated processing. Aspects of the logical machine may be virtualized and executed by remotely accessed networked computing devices configured in a cloud computing configuration.
[0134] Storage machine 1404 includes one or more physical devices configured to hold instructions executable by a logical machine to implement the methods and processes described herein. When such methods and processes are implemented, the state of storage machine 1404 can be changed—for example, to store different data.
[0135] Storage machine 1404 may include removable and / or built-in devices. Storage machine 1404 may include optical memory (e.g., CD, DVD, HD-DVD, Blu-ray disc, etc.), semiconductor memory (e.g., RAM, EPROM, EEPROM, etc.), and / or magnetic memory (e.g., hard disk drive, floppy disk drive, tape drive, MRAM, etc.). Storage machine 1404 may include volatile, non-volatile, dynamic, static, read / write, read-only, random access, sequential access, location-addressable, file-addressable, and / or content-addressable devices.
[0136] It should be understood that the storage machine 1404 includes one or more physical devices. However, aspects of the instructions described herein may alternatively be propagated by a communication medium (e.g., electromagnetic signals, optical signals, etc.) that is not maintained by the physical device for a finite duration.
[0137] The aspects of logic machine 1402 and storage machine 1404 can be integrated together into one or more hardware logic components. Such hardware logic components may include field-programmable gate arrays (FPGAs), programmable and application-specific integrated circuits (PASICs / ASICs), programmable and application-specific standard products (PSSPs / ASSPs), system-level devices such as chips (SOCs), and complex programmable logic devices (CPLDs).
[0138] The term "program" can be used to describe one aspect of a computing system 1400 implemented to perform a specific function. In some cases, a program can be instantiated via a logical machine 1402 that executes instructions held by a storage machine 1404. It should be understood that different programs can be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Similarly, the same program can be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The term "program" can include a single or a group of executable files, data files, libraries, drivers, scripts, database records, etc.
[0139] It should be understood that, as used herein, a "service" is an application that can be executed across multiple user sessions. A service may be available to one or more system components, programs, and / or other services. In some implementations, a service may run on one or more server computing devices.
[0140] When included, the display subsystem 1406 can be used to present a visual representation of the data stored by the storage machine 1404. This visual representation may take the form of a graphical user interface (GUI). Since the methods and processes described herein change the data held by the storage machine, thereby changing the state of the storage machine, the state of the display subsystem 1406 can also be transformed to visually represent changes in the underlying data. The display subsystem 1406 may include one or more display devices using virtually any type of technology. Such display devices may be combined with the logic machine 1402 and / or the storage machine 1404 in a shared enclosure, or such display devices may be peripheral display devices.
[0141] When included, the input subsystem 1408 may include or interface with one or more user input devices, such as a keyboard, mouse, touchscreen, or game controller. In some embodiments, the input subsystem may include or interface with a selected Natural User Input (NUI) component. Such a component may be integrated or peripheral, and the translation and / or processing of input actions may be performed on-machine or off-machine. Example NUI components may include a microphone for speech and / or speech recognition; an infrared, color, stereo, and / or depth camera for machine vision and / or gesture recognition; a head tracker, eye tracker, accelerometer, and / or gyroscope for motion detection and / or intent recognition; and an electric field sensing component for assessing brain activity.
[0142] When included, the communication subsystem 1410 can be configured to communicatively couple the computing system 1400 to one or more other computing devices. The communication subsystem 1410 may include wired and / or wireless communication devices compatible with one or more different communication protocols. As a non-limiting example, the communication subsystem may be configured to communicate via a wireless telephone network or a wired or wireless local area network or wide area network. In some embodiments, the communication subsystem may allow the computing system 1400 to send and / or receive messages to and / or from other devices via a network such as the Internet.
[0143] Another example provides a method for synthesizing a novel pose of an object, executed on a computing system, the method comprising: receiving a reference image of the object corresponding to an original viewpoint; converting the reference image of the object into a reference depth map of the object; synthesizing a new depth map of the object corresponding to a new viewpoint; and generating a new image of the object from the new viewpoint based on the new depth map of the object and the reference image of the object. Converting the reference image of the object into a reference depth map of the object may additionally or alternatively include inputting the reference image of the object into a domain transfer module; and receiving the reference depth map of the object from the domain transfer module. The method may additionally or alternatively include receiving a foreground mask from the domain transfer module, the foreground mask identifying pixels associated with the object. The domain transfer module may additionally or alternatively include a domain transfer model, and the method may additionally or alternatively include training the domain transfer model on a dataset of paired images and depth maps. Synthesizing a new depth map of the object corresponding to the new viewpoint may additionally or alternatively include inputting the reference depth map of the object into a depth map generator; receiving the new depth map of the object from the depth map generator; and refining the new depth map of the object using a 3D depth refinement module. The method may additionally or alternatively include receiving a new depth map sequence from a depth map generator and refining the new depth maps; and may additionally or alternatively include using a 3D convolutional neural network to enforce consistency between the new depth map sequences. Generating a new image of the object may additionally or alternatively include mapping a reference image of the object to appearance parameters; mapping a new depth map of the object to shape parameters; and combining the shape parameters and appearance parameters to generate a new image of the object from a new viewpoint. Generating a new image of the object may additionally or alternatively include inputting a reference image of the object and a new depth map of the object into an identity recovery model; and receiving a new image of the object from the identity recovery model. The method may additionally or alternatively include training the identity recovery model on unpaired depth and image data. Training the identity recovery model may additionally or alternatively include mapping the reference image to a reference shape parameter using a first structural encoder; mapping the new depth map to a new shape parameter using a second structural encoder; mapping the reference image to a reference appearance parameter using a first appearance encoder; mapping the new depth map to a new appearance parameter using a second appearance encoder; and combining each of the reference shape parameter and the new shape parameter with one of the reference appearance parameter and the new appearance parameter to generate an image. This method may additionally or alternatively include training the identity recovery model using supervised and unsupervised learning. It may also additionally or alternatively include directly supervised training using a reference image of the object, a reference depth map, and a novel depth map.
[0144] Another example provides a computing device including: a processor; and a storage device storing processor-executable instructions to receive a reference image of an object corresponding to an original viewpoint; convert the reference image of the object into a reference depth map of the object; synthesize a new depth map of the object corresponding to a new viewpoint; and generate a new image of the object from the new viewpoint based on the new depth map of the object and the reference image of the object. Converting the reference image of the object into a reference depth map of the object may additionally or alternatively include inputting the reference image of the object into a domain transfer module; and receiving the reference depth map of the object from the domain transfer module. The instructions may additionally or alternatively be executable to receive a foreground mask from the domain transfer module, the foreground mask identifying pixels associated with the object. Generating a new image of the object may additionally or alternatively include mapping the reference image of the object to appearance parameters; mapping the new depth map of the object to shape parameters; and combining the shape parameters and appearance parameters to generate a new image of the object from the new viewpoint. Generating a new image of the object may additionally or alternatively include inputting the reference image of the object and the new depth map of the object into an identity restoration model; and receiving the new image of the object from the identity restoration model. Training the identity recovery model may additionally or alternatively include mapping a reference image to a reference shape parameter using a first structural encoder; mapping a new depth map to a new shape parameter using a second structural encoder; mapping a reference image to a reference appearance parameter using a first appearance encoder; mapping a new depth map to a new appearance parameter using a second appearance encoder; and combining each of the reference shape parameter and the new shape parameter with one of the reference appearance parameter and the new appearance parameter to generate an image. Instructions may additionally or alternatively be executable to directly supervise training using the reference image, reference depth map, and new depth map of the object.
[0145] Another example provides a computing device including: a processor; and a storage device storing processor-executable instructions to receive a reference image of an object corresponding to an original viewpoint; convert the reference image of the object into a reference depth map of the object; synthesize a new depth map of the object corresponding to a new viewpoint; map the reference image of the object to appearance parameters; map the new depth map of the object to shape parameters; and combine the shape parameters and appearance parameters to generate a new image of the object from the new viewpoint.
[0146] It should be understood that the configurations and / or schemes described herein are exemplary in nature, and these specific embodiments or examples should not be considered limiting, as many variations are possible. The specific procedures or methods described herein may represent one or more of any number of processing strategies. Therefore, the various actions shown and / or described may be performed in the order shown and / or described, in another order, in parallel, or omitted. Similarly, the order of the above processes may be changed.
[0147] The subject matter of this disclosure includes all novel and non-obvious combinations and sub-combinations of the various processes, systems and configurations disclosed herein, as well as any and all equivalents thereof.
Claims
1. A method for developing a novel pose for synthesizing an object on a computing system, the method comprising: Receive a reference image of an object corresponding to an original viewpoint, the original viewpoint being an original perspective view on the object; The reference image of the object is converted into a reference depth map of the object, the reference depth map including shape information of the object corresponding to the original viewpoint; By applying a depth rotator to the reference depth map of the object, a new depth map of the object is synthesized, corresponding to a new viewpoint on the object; as well as The reference image of the object and the new depth map of the object are input into the identity recovery model to generate a new image of the object from the new viewpoint based on the appearance information of the object obtained from the reference image and the shape information obtained from the new depth map.
2. The method of claim 1, wherein converting the reference image of the object into the reference depth map of the object comprises: The reference image of the object is input into the domain transfer module; as well as The reference depth map of the object is received from the domain transfer module.
3. The method of claim 2, further comprising receiving a foreground mask from the domain transfer module, the foreground mask identifying pixels associated with the object.
4. The method of claim 2, wherein the domain transfer module includes a domain transfer model, and the method further includes training the domain transfer model on a dataset of paired images and depth maps.
5. The method of claim 1, wherein synthesizing the new depth map of the object corresponding to the new viewpoint comprises: The reference depth map of the object is input into the depth map generator; Receive the new depth map of the object from the depth map generator; as well as Use the 3D Depth Refinement module to refine the new depth map of the object.
6. The method of claim 5, further comprising receiving a new depth map sequence from the depth map generator, wherein refining the new depth map includes using a 3D convolutional neural network to enhance the consistency between the new depth map sequences.
7. The method of claim 1, wherein generating the new image of the object comprises: Map the reference image of the object to appearance parameters; Map the new depth map of the object to shape parameters; as well as The shape parameters and appearance parameters are combined to generate the new image of the object from the new viewpoint.
8. The method according to claim 1, further comprising: The identity recovery model is trained on unpaired depth and image data.
9. The method of claim 1, further comprising training the identity recovery model by: The reference image is mapped to reference shape parameters using a first structural encoder; The new depth map is mapped to new shape parameters using a second structural encoder; The reference image is mapped to reference appearance parameters using a first appearance encoder; The new depth map is mapped to new appearance parameters using a second appearance encoder; as well as Each of the reference shape parameter and the new shape parameter is combined with one of the reference appearance parameter and the new appearance parameter to generate an image.
10. The method of claim 1, further comprising training the identity recovery model using supervised learning and unsupervised learning.
11. The method of claim 10, further comprising using the reference image of the object, the reference depth map, and the new depth map to directly supervise the training.
12. A computing device, comprising: processor; as well as A storage device that stores instructions executable by the processor, to: Receive a reference image of an object corresponding to an original viewpoint, the original viewpoint being an original perspective view on the object; The reference image of the object is converted into a reference depth map of the object, the reference depth map including shape information of the object corresponding to the original viewpoint; By applying a depth rotator to the reference depth map of the object, a new depth map of the object is synthesized, corresponding to a new viewpoint on the object; as well as The reference image of the object and the new depth map of the object are input into the identity recovery model to generate a new image of the object from the new viewpoint based on the appearance information of the object obtained from the reference image and the shape information obtained from the new depth map.
13. The computing device of claim 12, wherein the instructions are further executable to convert the reference image of the object into the reference depth map of the object by: The reference image of the object is input into the domain transfer module; and The reference depth map of the object is received from the domain transfer module.
14. The computing device of claim 13, wherein the instructions are further executable to receive a foreground mask from the domain transfer module, the foreground mask identifying pixels associated with the object.
15. The computing device of claim 12, wherein the instructions are further executable to generate the new image of the object by: Map the reference image of the object to appearance parameters; Map the new depth map of the object to shape parameters; as well as The shape parameters and appearance parameters are combined to generate the new image of the object from the new viewpoint.
16. The computing device of claim 12, wherein the instructions are further executable to train the identity recovery model by: The reference image is mapped to reference shape parameters using a first structural encoder; The new depth map is mapped to new shape parameters using a second structural encoder; The reference image is mapped to reference appearance parameters using a first appearance encoder; The new depth map is mapped to new appearance parameters using a second appearance encoder; as well as Each of the reference shape parameter and the new shape parameter is combined with one of the reference appearance parameter and the new appearance parameter to generate an image.
17. The computing device of claim 16, wherein the instructions are further executable to directly supervise the training using the reference image, the reference depth map, and the new depth map of the object.
18. A computing device, comprising: processor; as well as A storage device that stores instructions executable by the processor, to: Receive a reference image of an object corresponding to an original viewpoint, the original viewpoint being an original perspective view on the object; The reference image of the object is converted into a reference depth map of the object, the reference depth map including shape information of the object corresponding to the original viewpoint; By applying a depth rotator to the reference depth map of the object, a new depth map of the object is synthesized, corresponding to a new viewpoint on the object; as well as The reference image of the object and the new depth map of the object are input into the identity recovery model to generate a new image of the object from the new viewpoint based on the appearance information of the object obtained from the reference image and the shape information obtained from the new depth map, wherein the instructions can also be executed to generate the new image of the object by: Map the reference image of the object to appearance parameters; Map the new depth map of the object to shape parameters; as well as The shape parameters and appearance parameters are combined to generate a new image of the object from the new viewpoint.