Information processing system, information processing method, and program

By converting 3D geometry to two-dimensional maps and training a DNN 3D representation model to minimize rendering errors, the method addresses the limitations of existing technologies, achieving high-resolution and smooth 3D geometry for diverse objects.

WO2026140904A1PCT designated stage Publication Date: 2026-07-02SONY GROUP CORP

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SONY GROUP CORP
Filing Date
2025-12-11
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing methods for obtaining high-resolution and smooth three-dimensional geometry for various subjects, such as photogrammetry, DNN 3D representation models, 3D scanners, and 3D generation AI models, face limitations in handling objects with uniform colors, reflective surfaces, translucent materials, fine structures, and biased training data, leading to low-quality geometry.

Method used

A method involving converting 3D geometry into two-dimensional depth and normal maps for spatial correlation, training a DNN 3D representation model to minimize rendering errors, and integrating different geometry generation processes to achieve high-resolution and smooth 3D geometry.

Benefits of technology

The approach enables the generation of high-resolution and smooth 3D geometry for a wider variety of objects by considering 3D consistency and spatial correlation, overcoming limitations of existing methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure JP2025043241_02072026_PF_FP_ABST
    Figure JP2025043241_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The present disclosure relates to an information processing system, an information processing method, and a program with which it is possible to obtain high-resolution and smooth three-dimensional geometry for various subjects (objects). In the present disclosure, a plurality of first RGB images relating to an object are acquired, a plurality of imaging parameters corresponding to the plurality of first RGB images are extracted, a first two-dimensional geometry of the object is generated on the basis of the object, a second two-dimensional geometry is generated on the basis of inference results from a three-dimensional representation model in which the plurality of imaging parameters are used as an input, and the three-dimensional representation model is trained so that any error between the first two-dimensional geometry and the second two-dimensional geometry and between the first RGB images of which the imaging parameters correspond to each other is minimized. The present disclosure can be applied to an information processing system that generates three-dimensional geometry.
Need to check novelty before this filing date? Find Prior Art

Description

Information Processing System, Information Processing Method, and Program

[0001] The present disclosure relates to an information processing system, an information processing method, and a program, and particularly relates to an information processing system, an information processing method, and a program that are configured to obtain high-resolution and smooth three-dimensional geometry for various subjects (objects).

[0002] In recent years, in video production and game production, the need for technologies to estimate the 3D shape (geometry) of real-world objects has been increasing.

[0003] For example, a technique has been proposed to obtain a 3D shape (geometry) by aligning the geometry of a 3D scanner and the RGB image of a camera through segmentation (see Patent Document 1).

[0004] Japanese Patent Application Laid-Open No. 2021-189600

[0005] However, the technique of Patent Technology 1 obtains a 3D shape by attaching imaging image data to 3D point cloud data, and it is not possible to obtain high-resolution and smooth geometry for various subjects.

[0006] The present disclosure has been made in view of such a situation, and particularly enables obtaining high-resolution and smooth three-dimensional geometry for various subjects.

[0007] An information processing system according to one aspect of the present disclosure includes a circuit configured to acquire a plurality of first RGB images related to an object, extract a plurality of imaging parameters corresponding to the plurality of first RGB images, generate a first two-dimensional geometry for each of the imaging parameters based on the object, generate a second two-dimensional geometry of the object based on an inference result of a three-dimensional representation model that takes the plurality of imaging parameters as an input, and learn the three-dimensional representation model so as to minimize an error between the first RGB image in which the imaging parameters correspond to each other, the first two-dimensional geometry, and the second two-dimensional geometry.

[0008] One aspect of the information processing method of this disclosure is an information processing method that includes: acquiring a plurality of first RGB images relating to an object and performing an extraction process to extract a plurality of imaging parameters corresponding to the plurality of first RGB images; performing a first two-dimensional geometry generation process to generate a first two-dimensional geometry of the object for each of the imaging parameters based on the object; performing a second two-dimensional geometry generation process to generate a second two-dimensional geometry of the object based on the inference result of a three-dimensional representation model that takes the plurality of imaging parameters as input; and performing a learning process to train the three-dimensional representation model to minimize the error between the first RGB images and the first two-dimensional geometry and the second two-dimensional geometry for which the imaging parameters correspond to each other.

[0009] A program in one aspect of this disclosure acquires a plurality of first RGB images relating to an object, extracts a plurality of imaging parameters corresponding to the plurality of first RGB images, generates a first two-dimensional geometry for each of the imaging parameters based on the object, generates a second two-dimensional geometry of the object based on the inference result of a three-dimensional representation model that takes the plurality of imaging parameters as input, and causes a computer to function to train the three-dimensional representation model to minimize the error between the first RGB images and the first two-dimensional geometry and the second two-dimensional geometry for which the imaging parameters correspond to each other.

[0010] In one aspect of this disclosure, a plurality of first RGB images of an object are acquired, a plurality of imaging parameters corresponding to the plurality of first RGB images are extracted, a first two-dimensional geometry is generated for each of the imaging parameters based on the object, a second two-dimensional geometry of the object is generated based on the inference result of a three-dimensional representation model that takes the plurality of imaging parameters as input, and the three-dimensional representation model is trained to minimize the error between the first RGB images and the first two-dimensional geometry and the second two-dimensional geometry for which the imaging parameters correspond to each other.

[0011] This figure illustrates an overview of the present disclosure. This figure illustrates an example configuration of the first embodiment of the information processing system of the present disclosure. This is a flowchart illustrating the learning process by the information processing system of Figure 2. This figure illustrates an information processing system that infers a three-dimensional geometry using a three-dimensional representation model learned by the information processing system of Figure 2. This is a flowchart illustrating the inference process by the information processing system of Figure 4. This figure illustrates an example configuration of the second embodiment of the information processing system of the present disclosure. This figure illustrates an example of correction processing by the correction processing unit of the information processing system of Figure 6. This is a flowchart illustrating the learning process by the information processing system of Figure 6. This figure illustrates an example configuration of the third embodiment of the information processing system of the present disclosure. This is a flowchart illustrating the learning process by the information processing system of Figure 9. This figure illustrates an example configuration of the third embodiment of the information processing system of the present disclosure. This is a flowchart illustrating the learning process by the information processing system of Figure 11. This figure illustrates an example configuration of a general-purpose computer.

[0012] Preferred embodiments of this disclosure will be described in detail below with reference to the attached drawings. In this specification and the drawings, components having substantially the same functional configuration are denoted by the same reference numerals, and redundant descriptions will be omitted.

[0013] The following describes the configurations for implementing this technology. The explanation will proceed in the following order.

[0014] 1. Outline of this Disclosure 2. First Embodiment 3. Second Embodiment 4. Third Embodiment 5. Fourth Embodiment 6. Example of Execution by Software

[0015] <<1. Overview of this Disclosure>> This disclosure aims to enable high-resolution and smooth three-dimensional geometry for a variety of subjects (objects). Therefore, we will first explain the overview of this disclosure.

[0016] In video production and game development, there is a growing need for technology that estimates (infers) the 3D shape (geometry) of subjects (objects) that consist of real-world objects.

[0017] Estimation (inference) methods include photogrammetry, DNN 3D representation models, 3D scanners using specialized equipment, and 3D generation AI models.

[0018] Photogrammetry is a method of estimating the geometry of a subject (object) based on the correspondence of feature points in RGB images taken of the subject from multiple imaging locations.

[0019] DNN (Deep Neural Network) 3D representation models, such as NeRF (Neural Radiance Fields) / 3DGS (3D Gaussian Splatting), are known to determine the geometry of an object by optimizing the rendered image and performing multi-view correlation.

[0020] A 3D scanner uses specialized equipment to scan an object in three dimensions and determine its geometry.

[0021] 3D generation AI (Artificial Intelligence) models generate the geometry of an object from images and text related to that object.

[0022] However, each of the methods for determining the various geometries described above has subjects (objects) for which it is fundamentally impossible to determine the geometry appropriately, making it difficult to obtain high-resolution and smooth geometry for a wide variety of subjects (objects).

[0023] For example, in photogrammetry, it is known that the quality of geometry obtained from uniform colors or patterns, which make it difficult to extract feature points, and from highly reflective subjects (objects) is poor.

[0024] Furthermore, it is known that DNN 3D representation models such as NeRF / 3DGS produce low-quality geometry from translucent subjects (objects) and subjects (objects) with fine structures such as hair and trees.

[0025] Furthermore, it is known that 3D scanners produce low-quality geometry from black or highly reflective objects, and that the size of objects from which high-quality geometry can be obtained is limited depending on the scanner method.

[0026] Furthermore, it is known that 3D generation AI models are limited in their generated shapes due to bias in the training data, sometimes outputting geometry that differs from the actual subject (object), and also have low resolution.

[0027] Furthermore, there are methods to combine (fusion) the 3D geometries that are the final outputs of these various methods, but simple methods can inherit the disadvantages of each method, making it difficult to successfully combine them.

[0028] Furthermore, in rendering image optimization methods, approaches that utilize 3D geometry for regularization (e.g., point cloud constraints) have been shown to have difficulty considering 3D spatial correlations, making it difficult to achieve maximum performance.

[0029] Therefore, in this disclosure, a three-dimensionally consistent three-dimensional geometry is converted into a two-dimensional geometry (depth map and normal map) that is easier to spatially correlate, and then used for regularization of a DNN three-dimensional representation model that obtains the three-dimensional geometry while performing multi-view correlation during rendering image optimization.

[0030] More specifically, as shown in the upper left of Figure 1, the object (subject) Pt is imaged by multi-view cameras Cm0 to Cm8, and the multi-view camera image (RGB image) Pi and camera parameters Cp consisting of the imaging position and angle (imaging direction) of cameras Cm0 to Cm8 are obtained.

[0031] Furthermore, while camera Cmi (i = 0 to 8) represents nine camera positions and angles, any other number of positions and angles may be used. Also, the multi-view camera image Pi may be captured by cameras with different camera parameters Cp, or it may be captured by the same camera while changing the camera parameters Cp.

[0032] Furthermore, 3D reconstruction using multi-view camera images Pi is performed to generate a 3D geometry that is consistent in three dimensions. Then, rendering processing using the generated 3D geometry generates a depth map (DMi) and a normal map (NMi) that are 2D geometries that are easy to spatially correlate.

[0033] On the other hand, the DNN 3D representation model M to be trained is a model that infers 3D geometry I3dg based on camera parameters Cp.

[0034] Here, rendering is performed using the 3D geometry I3dg inferred by the DNN 3D representation model M with the camera parameter Cp, so that multi-view camera images Po, as well as depth maps DMo and normal maps NMo, identical to the camera parameter Cp for cameras Cm0 to Cm8 are generated.

[0035] Then, the DNN 3D representation model M is trained to minimize the rendering error (difference between the two) between the 2D geometry Ii2dg, which consists of a multi-view camera image Pi, a depth map DMi, and a normal map NMi, and the 2D geometry Io2dg, which consists of a multi-view camera image Po, a depth map DMo, and a normal map NMo, thereby regularizing the DNN 3D representation model M.

[0036] By regularizing the DNN 3D representation model M through this learning process, it becomes possible to optimize it by considering 3D consistency and spatial correlation of adjacent regions at the resolution level of the RGB images that constitute the multi-view camera image Pi. As a result, the DNN 3D representation model M can obtain high-resolution and smooth 3D geometry of the object (subject) for each input camera parameter Cp.

[0037] Furthermore, by training the DNN 3D representation model M using 2D geometry Ii2dg obtained from 3D geometry reconstructed from multi-view camera images Pi by a process with different characteristics from the DNN 3D representation model M, it becomes possible to appropriately fuse (fuse) 3D geometry generation processes with different characteristics from the DNN 3D representation model M. This makes it possible to obtain high-resolution and smooth 3D geometry for a wider variety of objects (subjects).

[0038] Furthermore, when fusing 3D geometry generation processes that have different characteristics from the DNN 3D representation model M, a higher quality fusion can be achieved by using RGB images to apply correction processing to the 2D geometry generated by rendering from the 3D geometry that has different characteristics from the DNN 3D representation model M.

[0039] <<2. First Embodiment>> Next, with reference to Figure 2, an example of the configuration of the first embodiment of an information processing system that enables obtaining high-resolution and smooth three-dimensional geometry for various objects (subjects) of the present disclosure will be described.

[0040] The information processing system 31 in Figure 2 trains a DNN 3D representation model 61 to estimate the 3D geometry of an object (subject) based on camera images taken of the object from multiple positions and angles (angles of the imaging direction).

[0041] More specifically, the information processing system 31 in Figure 2 consists of a camera parameter acquisition unit 51, a 3D reconstruction unit 52, a rendering processing unit 53, a DNN 3D representation model learning unit 54, and a rendering error calculation unit 55.

[0042] The camera parameter acquisition unit 51 acquires camera parameters consisting of the camera's imaging position and angle (imaging direction) relative to the object, which are included in the multi-view camera images taken of the object (subject) from multiple viewpoints, and outputs them to the 3D reconstruction unit 52, the rendering processing unit 53, and the DNN 3D representation model learning unit 54.

[0043] The 3D reconstruction unit 52 reconstructs 3D data of an object based on a plurality of multi-view camera images and corresponding camera parameters by a method different in characteristics from a DNN 3D representation model 61 to be learned, such as photogrammetry processing or a 3D generation AI model, and outputs the reconstructed data to the rendering processing unit 53.

[0044] The rendering processing unit 53 executes rendering processing based on the 3D data of the object, generates a depth map and a normal map for each camera parameter, and outputs them as 2D geometry to the rendering error calculation unit 55.

[0045] The DNN 3D representation model learning unit 54 learns a DNN 3D representation model 61 that infers the 3D geometry of an object based on the camera parameters when capturing multi-view camera images.

[0046] More specifically, the DNN 3D representation model learning unit 54 includes a DNN 3D representation model 61 and a rendering processing unit 62.

[0047] The DNN 3D representation model 61 is, for example, NeRF (Neural Radiance Fields) / 3DGS (3D Gaussian Splatting), etc. It infers the 3D geometry of an object based on the camera parameters when capturing multi-view camera images, and outputs the 3D geometry as an inference result to the rendering processing unit 62 during learning.

[0048] Note that the DNN 3D representation model 61 is a learning target, and at the initial stage of learning, even if it infers the 3D geometry of an object based on the camera parameters when capturing multi-view camera images, it cannot make appropriate inferences.

[0049] The rendering processing unit 62 generates an RGB image, a depth map, and a normal map for each camera parameter when capturing multi-view camera images by rendering processing based on the 3D geometry that is the inference result of the DNN 3D representation model 61, and outputs them as 2D geometry to the rendering error calculation unit 55.

[0050] The rendering error calculation unit 55 consists of multi-view camera images and acquires a two-dimensional geometry based on the multi-view camera images by combining multiple RGB images including camera parameters, a corresponding depth map supplied by the rendering processing unit 53, and a corresponding normal map.

[0051] Furthermore, the rendering error calculation unit 55 acquires the RGB images, depth maps, and normal maps for each camera parameter, supplied by the rendering processing unit 62 of the DNN 3D representation model learning unit 54, as 2D geometry based on the inference results of the DNN 3D representation model 61.

[0052] The rendering error calculation unit 55 then calculates the rendering error difference for each of the following: the RGB image, depth map, and normal map that constitute the two-dimensional geometry based on multi-view camera images with the same camera parameters, and the RGB image, depth map, and normal map that constitute the two-dimensional geometry based on the inference results of the DNN three-dimensional representation model 61. This difference is then supplied to the DNN three-dimensional representation model learning unit 54 as loss.

[0053] The DNN 3D representation model learning unit 54 trains the DNN 3D representation model 61 to minimize the loss (rendering error) supplied by the rendering error calculation unit 55.

[0054] In other words, the DNN 3D representation model learning unit 54 can train the DNN 3D representation model 61 by considering 3D consistency and spatial correlation of adjacent regions at the resolution level of RGB images based on multi-view camera images. As a result, once the DNN 3D representation model 61 is trained, it becomes possible to obtain a high-resolution and smooth 3D geometry of an object for each camera parameter.

[0055] Furthermore, by using a 3D reconstruction unit 52 equipped with a 3D geometry generation method having different characteristics from the DNN 3D representation model 61, such as photogrammetry or a 3D generation AI model, and training it with 2D geometry obtained from 3D data reconstructed from multi-view camera images, it becomes possible to appropriately fuse (fuse) 3D geometry generation methods having different characteristics from the DNN 3D representation model 61. This makes it possible to obtain high-resolution and smooth 3D geometry for a wider variety of objects (subjects).

[0056] <Learning process by the information processing system in Figure 2> Next, referring to the flowchart in Figure 3, we will explain the learning process of the DNN 3D representation model 61 by the information processing system 31 in Figure 2.

[0057] In step S31, the camera parameter acquisition unit 51, the 3D reconstruction unit 52, and the rendering error calculation unit 55 acquire a multi-view camera image consisting of an RGB image that includes camera parameters, which consist of information on the camera's imaging position and angle (imaging direction), captured by the camera from multiple viewpoint positions relative to the object.

[0058] In step S32, the camera parameter acquisition unit 51 extracts camera parameters included in the multi-view camera image and supplies them to the 3D reconstruction unit 52, the rendering processing unit 53, and the DNN 3D representation model learning unit 54.

[0059] In step S33, the 3D reconstruction unit 52 reconstructs the object (subject) in three dimensions based on the multi-view camera images and their respective camera parameters, generates 3D data, and outputs it to the rendering processing unit 53.

[0060] In step S34, the rendering processing unit 53 generates a depth map and a normal map for each camera parameter through rendering processing based on the reconstructed 3D data, and outputs them as 2D geometry to the rendering error calculation unit 55.

[0061] In step S35, the DNN 3D representation model 61 in the DNN 3D representation model learning unit 54 infers the 3D geometry for each camera parameter of the multi-view camera image and outputs the resulting 3D geometry to the rendering processing unit 62.

[0062] In step S36, the rendering processing unit 62 generates RGB images, depth maps, and normal maps corresponding to multi-view camera images for each camera parameter by rendering based on the 3D geometry which is the inference result of the DNN 3D representation model 61, and outputs these to the rendering error calculation unit 55 as 2D geometry based on the 3D geometry which is the inference result of the DNN 3D representation model 61.

[0063] In step S37, the rendering error calculation unit 55 calculates a rendering error as a loss for each camera parameter, which consists of the difference between the 2D geometry based on the multi-view camera images, including the multi-view camera images, and the 2D geometry based on the inferred 3D geometry, and supplies it to the DNN 3D representation model learning unit 54.

[0064] In step S38, the DNN 3D representation model learning unit 54 trains the DNN 3D representation model 61 to minimize the loss resulting from rendering errors.

[0065] After the above series of learning processes are repeated a predetermined number of times, or until the rendering error becomes smaller than a predetermined value, the learning of the DNN 3D representation model 61 is completed.

[0066] Through the above processing, it becomes possible to train the DNN 3D representation model 61 considering 3D consistency and spatial correlation of adjacent regions at the resolution level of RGB images based on multi-view camera images. As a result, it becomes possible to obtain high-resolution and smooth 3D geometry of objects according to the camera parameters.

[0067] Furthermore, through the above processing, a 3D reconstruction unit 52, which has different characteristics from the DNN 3D representation model 61 consisting of NeRF and 3DGS, can reconstruct 3D data based on RGB images consisting of multi-view camera images. Since this 3D data can be used to train a 2D geometry, it becomes possible to appropriately fuse (fuse) 3D geometry generation methods that have different characteristics from the DNN 3D representation model 61. This makes it possible to obtain high-resolution and smooth 3D geometry for a wider variety of subjects.

[0068] <Information processing system using a DNN 3D representation model trained through the learning process> Next, we will describe an example of the configuration of an information processing system using the DNN 3D representation model 61 that has been trained through the learning process described above.

[0069] After training, the DNN 3D representation model 61 can infer 3D geometry for each camera parameter of multiple RGB images that make up the multi-view camera images used in the training process.

[0070] Figure 4 shows an example of the configuration of an information processing system using a DNN 3D representation model 61 that has been trained through a learning process. Specifically, in the information processing system of Figure 4, the camera parameter acquisition unit 51 extracts camera parameters for each of the multiple RGB images that make up the multi-view camera image used in the learning process.

[0071] Then, if, for example, a camera parameter arbitrarily selected by the user is selected from the extracted camera parameters, the DNN 3D representation model 61 infers and outputs the 3D geometry recognized from the camera's imaging position and angle (imaging direction) identified by the selected camera parameter.

[0072] The DNN 3D representation model 61 is trained through the learning process performed by the information processing system 31 in Figure 2, enabling the creation of high-resolution and smooth 3D geometry for a variety of subjects (objects) according to camera parameters.

[0073] <Inference processing by the information processing system in Figure 4> Next, referring to the flowchart in Figure 5, we will explain the inference processing by the DNN 3D representation model 61 that has been trained by the training process in Figure 3.

[0074] In step S51, the camera parameter acquisition unit 51 acquires a multi-view camera image consisting of RGB images taken with a multi-view camera of the object used in the learning process.

[0075] In step S52, the camera parameter acquisition unit 51 extracts camera parameters consisting of information on the camera's imaging position and angle from each of the acquired multi-view camera images, which are composed of RGB images.

[0076] In step S53, the DNN 3D representation model 61 infers and outputs the 3D geometry at the viewpoint position corresponding to the camera imaging position and angle, for any selected camera parameter from among the camera parameters extracted by the camera parameter acquisition unit 51.

[0077] As a result, the DNN 3D representation model 61 has been trained through the learning process described above, making it possible to infer high-resolution and smooth 3D geometry.

[0078] Although Figure 4 shows an example of an information processing system consisting only of a trained DNN 3D representation model 61 and a camera parameter acquisition unit 51, the same functionality can be achieved in the information processing system 31 of Figure 2 by operating only the camera parameter acquisition unit 51 and the DNN 3D representation model 61.

[0079] <<3. Second Embodiment>> In the above, we have described an example in which the DNN 3D representation model 61 is trained to minimize the loss resulting from the rendering error between the 2D geometry, which consists of a depth map and a normal map, generated by rendering based on the 3D data of the object reconstructed by the 3D reconstruction unit 52, and the 2D geometry based on the 3D geometry which is the inference result of the DNN 3D representation model 61.

[0080] However, the two-dimensional geometry consisting of a depth map and a normal map generated by rendering based on the 3D data of the object reconstructed by the 3D reconstruction unit 52 may be corrected to achieve more accurate learning using a more accurate two-dimensional geometry.

[0081] Figure 6 shows an example configuration of an information processing system 31A that corrects the two-dimensional geometry, consisting of a depth map and a normal map, generated by rendering based on the 3D data of the object reconstructed by the 3D reconstruction unit 52, in order to achieve more accurate learning using a more accurate two-dimensional geometry.

[0082] In the configuration of the information processing system 31A in Figure 6, components that have the same functions as those in the information processing system 31 in Figure 2 are given the same names and reference numerals, and their explanations are omitted.

[0083] In other words, the difference between the information processing system 31A in Figure 6 and the information processing system 31 in Figure 2 is that a correction processing unit 71 has been newly provided between the rendering processing unit 53 and the rendering error calculation unit 55.

[0084] The correction processing unit 71 consists of a generative AI model that has acquired a latent space (prior knowledge) capable of generating diverse and high-quality data through learning based on a large-scale dataset, and corrects the two-dimensional geometry consisting of a depth map and a normal map using multi-view camera images as guide information.

[0085] Here, using the image P1 of the motorcycle as the object shown in the left part of Figure 7 as an example of a multi-view camera image, we will explain the correction process performed by the correction processing unit 71.

[0086] When rendering is performed based on the 3D data reconstructed by the 3D reconstruction unit 52 based on this image P1, phenomena such as the loss of parts of the backrest and mirror stays made of glossy materials such as metal, shown in regions Z1 and Z2 of image P2, the all-black seat area shown in region Z3, and the transparent parts of the lights and turn signals shown in regions Z4 and Z5 are likely to occur.

[0087] Therefore, by applying correction using a correction processing unit 71 consisting of a generative AI model that has acquired a latent space (prior knowledge) capable of generating diverse and high-quality data through learning using a large dataset, it may be possible to correct the missing areas corresponding to regions Z1 to Z5 that occurred in image P2, as shown in image P3 of Figure 7.

[0088] In other words, the correction processing unit 71, which consists of a generated AI model, corrects missing parts such as the backrest, stays, the all-black seat area, and the transparent parts of the lights and turn signals that occur in image P2, which is the rendering result based on the 3D data reconstructed by the 3D reconstruction unit 52, by learning using large-scale motorcycle image data and utilizing the RGB images that make up the multi-view camera images as guide information, as shown in image P3.

[0089] In this way, by correcting the rendering processing results based on the 3D data reconstructed by the 3D reconstruction unit 52, it becomes possible to train a DNN 3D representation model 61 with higher accuracy.

[0090] <Learning process by the information processing system in Figure 6> Next, the learning process by the information processing system 31A in Figure 6 will be explained with reference to the flowchart in Figure 8. Note that, of steps S71 to S79 ​​in the flowchart in Figure 8, the process is the same as that described in the flowchart in Figure 3, except for the process in step S75, so the explanation will be omitted.

[0091] In other words, in the processing of steps S71 to S74, the object (subject) is reconstructed based on the multi-view camera images and their respective camera parameters, and 3D data is generated. Based on the reconstructed 3D data, a rendering process is performed to generate a depth map and a normal map for each camera parameter, and these are output as 2D geometry.

[0092] In step S75, the correction processing unit 71 corrects the two-dimensional geometry consisting of a depth map and a normal map generated for each camera parameter and outputs it to the rendering error calculation unit 55.

[0093] Furthermore, in steps S76 to S79, a 3D geometry is inferred for each camera parameter of the multi-view camera image, and a 2D geometry based on the 3D geometry, which is the inference result of the DNN 3D representation model 61, is output to the rendering error calculation unit 55.

[0094] Then, the rendering error calculation unit 55 calculates a rendering error as a loss for each camera parameter, which is the difference between the corrected 2D geometry based on the multi-view camera images, including the multi-view camera images, and the 2D geometry based on the inferred 3D geometry. This loss is supplied to the DNN 3D representation model learning unit 54, and the DNN 3D representation model 61 is learned to minimize the loss resulting from the rendering error.

[0095] In the above series of learning processes, the DNN 3D representation model 61 is trained so that, for each camera parameter, the rendering error consisting of the difference between the corrected 2D geometry based on the multi-view camera images, including the multi-view camera images, and the 2D geometry based on the inferred 3D geometry, minimizes the loss.

[0096] As a result of the correction, it becomes possible to train using 2D geometry based on multi-view camera images, including higher quality multi-view camera images. Therefore, the DNN 3D representation model 61 can infer higher resolution and smoother 3D geometry through training.

[0097] <<4. Third Embodiment>> In the above, the 3D reconstruction unit 52, which is realized by methods such as photogrammetry processing and 3D generation AI modeling, reconstructs the 3D data of the object based on the multiple multi-view camera images of the object and the corresponding camera parameters, and then generates 2D geometry by rendering.

[0098] However, since it is sufficient to generate 3D data for generating the 2D geometry of the object, the 3D data may be generated by something other than the 3D reconstruction unit 52, which consists of a photogrammetry processing unit and a 3D generation AI model. For example, an existing 3D scanner may be used.

[0099] Figure 9 shows an example configuration of an information processing system 31B in which a 3D scanner is provided instead of the 3D reconstruction unit 52 to obtain 3D data of an object. The difference between the information processing system 31B in Figure 9 and the information processing system 31 in Figure 2 is that a 3D scanner 81 is provided instead of the 3D reconstruction unit 52.

[0100] The 3D scanner 81 generates 3D data of an object by scanning it in three dimensions and outputs it to the rendering processing unit 53. In this process, a correspondence is required between the scanning position of the 3D scanner 81 relative to the object and the camera parameters consisting of the imaging position and angle of the camera that images the object. Therefore, prior alignment is necessary.

[0101] <Learning process by the information processing system in Figure 9> Next, the learning process by the information processing system 31B in Figure 9 will be explained with reference to the flowchart in Figure 10.

[0102] In step S101, the 3D scanner 81 generates 3D data by scanning the object in three dimensions.

[0103] In step S102, the camera parameter acquisition unit 51 and the rendering error calculation unit 55 acquire a multi-view camera image consisting of an RGB image that includes camera parameters consisting of information on the camera's imaging position and angle (imaging direction), captured by a multi-view camera of the object.

[0104] In step S103, the camera parameter acquisition unit 51 extracts camera parameters included in the multi-view camera image.

[0105] In step S104, the camera parameter acquisition unit 51 aligns with the 3D scanner 81, adjusts the camera parameters, and supplies them to the rendering processing unit 53 and the DNN 3D representation model learning unit 54. For example, the camera parameter acquisition unit 51 adjusts the camera parameters so that the coordinate system of the 3D scanner 81 and the coordinate system of the camera parameters are unified.

[0106] In step S105, the rendering processing unit 53 generates a depth map and a normal map for each camera parameter by rendering based on the 3D data generated by the 3D scanner 81, and outputs them as 2D geometry to the rendering error calculation unit 55.

[0107] In step S106, the DNN 3D representation model 61 in the DNN 3D representation model learning unit 54 infers the 3D geometry for each camera parameter of the multi-view camera image and outputs the inference result to the rendering processing unit 62.

[0108] In step S107, the rendering processing unit 62 generates RGB images, depth maps, and normal maps for each camera parameter by rendering based on the 3D geometry which is the inference result of the DNN 3D representation model 61, and outputs these to the rendering error calculation unit 55 as 2D geometry based on the 3D geometry which is the inference result of the DNN 3D representation model 61.

[0109] In step S108, the rendering error calculation unit 55 calculates a rendering error as a loss for each camera parameter, which is the difference between the two-dimensional geometry based on the 3D data generated by the 3D scanner 81, including multi-view camera images, and the two-dimensional geometry based on the inferred three-dimensional geometry, and supplies this to the DNN three-dimensional representation model learning unit 54.

[0110] In step S109, the DNN 3D representation model learning unit 54 trains the DNN 3D representation model 61 to minimize the loss resulting from rendering errors.

[0111] After the above series of learning processes are repeated a predetermined number of times, or until the rendering error becomes smaller than a predetermined value, the learning of the DNN 3D representation model 61 is completed.

[0112] Through the above processing, it becomes possible to train the DNN 3D representation model 61 considering 3D consistency and spatial correlation of adjacent regions at the resolution level of RGB images based on multi-view camera images. As a result, it becomes possible to obtain high-resolution and smooth 3D geometry of objects according to the camera parameters.

[0113] Furthermore, by training the DNN 3D representation model 61 using the 2D geometry obtained from the 3D data generated by the 3D scanner 81, the sophisticated 3D geometry generation process performed by the 3D scanner 81 can be appropriately integrated (fused) into the DNN 3D representation model 61. This makes it possible to obtain high-resolution and smooth 3D geometry for a wider variety of objects (subjects).

[0114] <<5. Fourth Embodiment>> In the above, we have described an example in which 3D data of an object is generated by using a 3D scanner 81 instead of a 3D reconstruction unit 52. Furthermore, the 2D geometry based on the 3D data generated by the 3D scanner 81 may be corrected using the correction processing unit 71 in the information processing system 31A in Figure 6.

[0115] Figure 11 shows an example configuration of an information processing system 31C in which 3D data of an object is generated using a 3D scanner 81, and then the 2D geometry based on the 3D data generated by the 3D scanner 81 is corrected using the correction processing unit 71 shown in Figure 6.

[0116] In other words, the difference between the information processing system 31C in Figure 11 and the information processing system 31B in Figure 9 is that the correction processing unit 71 in the information processing system 31A in Figure 6 is located between the rendering processing unit 53 and the rendering error calculation unit 55.

[0117] This allows for correction of the rendering results based on the 3D data of the object generated by the 3D scanner 81, thereby enabling the training of a more accurate DNN 3D representation model 61.

[0118] <Learning process by the information processing system in Figure 11> Next, the learning process by the information processing system 31C in Figure 11 will be explained with reference to the flowchart in Figure 12. Of the steps S121 to S130 in the flowchart of Figure 12, the process is the same as that described in the flowchart of Figure 10, except for the process in step S126, so the explanation will be omitted.

[0119] In other words, in the processing of steps S121 to S125, a depth map and a normal map are generated for each camera parameter through rendering processing based on the camera parameters of the multi-view camera image and the 3D data generated by the 3D scanner 81, and output as a 2D geometry.

[0120] In step S126, the correction processing unit 71 corrects the two-dimensional geometry consisting of a depth map and a normal map, which are generated for each camera parameter, and outputs it to the rendering error calculation unit 55.

[0121] Furthermore, in steps S127 to S130, a 3D geometry is inferred for each camera parameter of the multi-view camera image, and a 2D geometry based on the 3D geometry, which is the inference result of the DNN 3D representation model 61, is output to the rendering error calculation unit 55.

[0122] Then, the rendering error calculation unit 55 calculates a rendering error as a loss for each camera parameter, which is the difference between the corrected 2D geometry based on the 3D data generated by the 3D scanner 81 including multi-view camera images and the 2D geometry based on the inferred 3D geometry. This loss is supplied to the DNN 3D representation model learning unit 54, and the DNN 3D representation model 61 is learned to minimize the loss resulting from the rendering error.

[0123] In the above series of learning processes, the DNN 3D representation model 61 is trained so that, for each camera parameter, the rendering error consisting of the difference between the corrected 2D geometry based on the 3D data generated by the 3D scanner 81 including multi-view camera images and the 2D geometry based on the inferred 3D geometry is minimized.

[0124] As a result of the correction, learning can be performed using 2D geometry based on 3D data generated by the 3D scanner 81, which includes higher quality multi-view camera images. Therefore, the DNN 3D representation model 61 can infer higher resolution and smoother 3D geometry through learning.

[0125] <<6. Example of execution by software>> Incidentally, the series of processes described above can be executed by hardware, but they can also be executed by software. When the series of processes are executed by software, the programs that make up the software are installed from a storage medium on a computer that has dedicated hardware built into it, or on a general-purpose computer that can perform various functions by installing various programs.

[0126] Figure 13 shows an example of the configuration of a general-purpose computer. This computer has a built-in processing circuit 1001. An input / output interface 1005 is connected to the processing circuit 1001 via a bus 1004. A ROM (Read Only Memory) 1002 and a RAM (Random Access Memory) 1003 are connected to the bus 1004.

[0127] The input / output interface 1005 is connected to an input unit 1006 consisting of input devices such as a keyboard and mouse for the user to input operation commands, an output unit 1007 that outputs images of the processing operation screen and processing results to a display device, a storage unit 1008 consisting of a hard disk drive for storing programs and various data, and a communication unit 1009 consisting of a LAN (Local Area Network) adapter for performing communication processing via a network such as the Internet. In addition, a drive 1010 is connected to removable storage media 1011 such as magnetic disks (including flexible disks), optical disks (including CD-ROMs (Compact Disc-Read Only Memory) and DVDs (Digital Versatile Discs)), magneto-optical disks (including MDs (Mini Discs)), or semiconductor memory for reading and writing data.

[0128] The processing circuit 1001 reads a program stored in the ROM 1002, or from a removable storage medium 1011 such as a magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and installs it into the storage unit 1008. It then executes various processes according to the program loaded from the storage unit 1008 into the RAM 1003. The RAM 1003 also appropriately stores data necessary for the processing circuit 1001 to execute various processes.

[0129] In a computer configured as described above, the processing circuit 1001 loads, for example, a program stored in the memory unit 1008 into the RAM 1003 via the input / output interface 1005 and the bus 1004, and executes it, thereby performing the series of processes described above.

[0130] The program executed by the computer (processing circuit 1001) can be provided by recording it on a removable storage medium 1011, such as a packaged media. The program can also be provided via wired or wireless transmission media, such as a local area network, the internet, or digital satellite broadcasting.

[0131] In a computer, a program can be installed in the storage unit 1008 via the input / output interface 1005 by inserting the removable storage medium 1011 into the drive 1010. Alternatively, a program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the storage unit 1008. Furthermore, programs can be pre-installed in the ROM 1002 or the storage unit 1008.

[0132] The programs executed by the computer may be programs that are processed chronologically in the order described herein, or they may be programs that are processed in parallel or at necessary times, such as when a call is made.

[0133] Furthermore, when the computer in Figure 13 functions as an information processing system 31, 31A to 31C according to the embodiment of this disclosure, the computer's processing circuit 1001 functions as a camera parameter acquisition unit 51, a 3D reconstruction unit 52, a rendering processing unit 53, a DNN 3D representation model learning unit 54, a rendering error calculation unit 55, and a correction processing unit 71 by executing a program loaded onto the RAM 1003. In addition, the storage unit 1008 and the removable storage medium 1011 that constitute the secondary storage device store the information processing program according to this disclosure and various data. The processing circuit 1001 reads and executes the program data from the storage unit 1008 and the removable storage medium 1011, but as another example, these programs may be obtained from other devices via the communication unit 1009. In other words, the secondary storage device is not limited to being inside the computer in Figure 13, such as the storage unit 1008 and the removable storage medium 1011, but may also be located outside the computer in Figure 13. The processing circuit 1001 is an example of an integrated circuit, and CPU (Central Processing Unit), MPU (Micro Processing Unit), GPU (Graphical Processing Unit), APU (Accelerated Processing Unit), ASIC (Application Specific Integrated Circuit), and FPGA (Field Programmable Gate Array) can all be considered integrated circuits.

[0134] Furthermore, in this specification, a system means a collection of multiple components (devices, modules (parts), etc.), regardless of whether all components are located in the same enclosure or not. Therefore, multiple devices housed in separate enclosures and connected via a network, and a single device in which multiple modules are housed in one enclosure, are both considered systems.

[0135] Furthermore, the embodiments of this disclosure are not limited to those described above, and various modifications are possible without departing from the gist of this disclosure.

[0136] For example, this disclosure can take the form of cloud computing, in which a single function is shared and processed collaboratively by multiple devices via a network.

[0137] Furthermore, each step described in the flowchart above can be performed by a single device, or it can be divided and performed by multiple devices.

[0138] Furthermore, if a single step includes multiple processes, those processes can be executed by a single device or shared among multiple devices.

[0139] Furthermore, among the processes described in the embodiments of this disclosure described above, all or part of the processes described as being performed automatically may be performed manually, or all or part of the processes described as being performed manually may be performed automatically by known methods. In addition, the processing procedures, specific names, and information including various data and parameters shown in the above document and drawings may be changed at will unless otherwise specified. For example, the various information shown in each figure is not limited to the information shown.

[0140] Furthermore, each component of the illustrated device is a functional concept and does not necessarily have to be physically configured as shown. In other words, the specific forms of distribution and integration of each device are not limited to those shown, and all or part of them can be functionally or physically distributed and integrated in any unit according to various loads and usage conditions.

[0141] Furthermore, the embodiments of this disclosure described above can be combined as appropriate in areas that do not contradict the processing content. Also, the steps shown in the sequence diagram or flowchart of this embodiment can be changed in order as appropriate. For example, each step may be processed chronologically, repeatedly, or partially in parallel.

[0142] Furthermore, this disclosure may also take the following configurations: <1> An information processing system including a circuit that acquires a plurality of first RGB images relating to an object, extracts a plurality of imaging parameters corresponding to the plurality of first RGB images, generates a first two-dimensional geometry for each of the imaging parameters based on the object, generates a second two-dimensional geometry of the object based on the inference result of a three-dimensional representation model that takes the plurality of imaging parameters as input, and causes the three-dimensional representation model to learn to minimize the error between the first RGB images and the first two-dimensional geometry and the second two-dimensional geometry for which the imaging parameters correspond to each other. <2> The information processing system according to <1>, wherein the first two-dimensional geometry includes a first depth map and a first normal map of the object for each of the plurality of imaging parameters, and the second two-dimensional geometry includes a second RGB image, a second depth map, and a second normal map for each of the plurality of imaging parameters based on the three-dimensional geometry inferred for each of the plurality of imaging parameters by the three-dimensional representation model. <3> The information processing system according to <1> wherein the plurality of imaging parameters include information regarding the imaging position and imaging direction for imaging the object. <4> The information processing system according to <1> wherein 3D data of the object is generated, and the first 2D geometry is generated based on the 3D data of the object. <5> The information processing system according to <4> wherein the 3D data of the object is generated by a generation method having different characteristics from the 3D representation model. <6> The information processing system according to <5> wherein the 3D data is generated by photogrammetry processing or a 3D generation AI model based on the plurality of first RGB images and the plurality of imaging parameters. <7> The information processing system according to <5> wherein the 3D data is generated by a 3D scan using a 3D scanner. <8> The information processing system according to <7> wherein the plurality of imaging parameters are adjusted by alignment with the 3D scanner.<9> The first two-dimensional geometry is corrected by a correction process using the plurality of first RGB images as guide information, and the three-dimensional representation model is learned to minimize the error between the first RGB images, the first two-dimensional geometry corrected by the correction process, and the second two-dimensional geometry, where the imaging parameters correspond to each other. The information processing system according to <1>. <10> The correction process is realized by a generative AI model that acquires a latent space (prior knowledge) relating to the object by learning based on a large dataset. The information processing system according to <9>. <11> The three-dimensional representation model is a DNN (Deep Neural Network) three-dimensional representation model. The information processing system according to <1>. <12> The DNN three-dimensional representation model is NeRF (Neural Radiance Fields) or 3DGS (3D Gaussian Splatting). <13> An information processing method comprising: acquiring a plurality of first RGB images relating to an object and performing an extraction process to extract a plurality of imaging parameters corresponding to the plurality of first RGB images; performing a first two-dimensional geometry generation process to generate a first two-dimensional geometry of the object for each of the imaging parameters based on the object; performing a second two-dimensional geometry generation process to generate a second two-dimensional geometry of the object based on the inference result of a three-dimensional representation model that takes the plurality of imaging parameters as input; and performing a learning process to train the three-dimensional representation model to minimize the error between the first RGB images and the first two-dimensional geometry for which the imaging parameters correspond to each other, and the second two-dimensional geometry. <14> The information processing method according to <13>, wherein the first two-dimensional geometry includes a first depth map and a first normal map for each of the plurality of imaging parameters, and the second two-dimensional geometry includes a second RGB image, a second depth map, and a second normal map for each of the plurality of imaging parameters, based on the three-dimensional geometry inferred for each of the plurality of imaging parameters by the three-dimensional representation model.<15> The information processing method according to <13>, wherein the plurality of imaging parameters include information regarding the imaging position and imaging direction for imaging the object. <16> The information processing method according to <13>, wherein 3D data of the object is generated, and the first 2D geometry of the object is generated based on the 3D data. <17> The information processing method according to <16>, wherein the 3D data is generated by a generation method having different characteristics from the 3D representation model. <18> The information processing method according to <17>, wherein the 3D data is generated by photogrammetry processing or a 3D generation AI model based on the plurality of first RGB images and the plurality of imaging parameters. <19> The information processing method according to <17>, wherein the 3D data is generated by a 3D scan using a 3D scanner. <20> A program that acquires a plurality of first RGB images relating to an object, extracts a plurality of imaging parameters corresponding to the plurality of first RGB images, generates a first two-dimensional geometry of the object for each of the imaging parameters based on the object, generates a second two-dimensional geometry based on the inference result of a three-dimensional representation model that takes the plurality of imaging parameters as input, and causes the computer to function to train the three-dimensional representation model to minimize the error between the first RGB images and the first two-dimensional geometry and the second two-dimensional geometry for which the imaging parameters correspond to each other.

[0143] 31, 31A-31C Information processing system, 51 Camera parameter acquisition unit, 52 3D reconstruction unit, 53 Rendering processing unit, 54 DNN 3D representation model learning unit, 55 Rendering error calculation unit, 61 DNN 3D representation model, 62 Rendering processing unit, 71 Correction processing unit, 81 3D scanner

Claims

1. An information processing system including a circuit that acquires a plurality of first RGB images relating to an object, extracts a plurality of imaging parameters corresponding to the plurality of first RGB images, generates a first two-dimensional geometry for each of the imaging parameters based on the object, generates a second two-dimensional geometry of the object based on the inference result of a three-dimensional representation model that takes the plurality of imaging parameters as input, and trains the three-dimensional representation model to minimize the error between the first RGB images and the first two-dimensional geometry and the second two-dimensional geometry for which the imaging parameters correspond to each other.

2. The information processing system according to claim 1, wherein the first two-dimensional geometry includes a first depth map and a first normal map of the object for each of the plurality of imaging parameters, and the second two-dimensional geometry includes a second RGB image, a second depth map, and a second normal map for each of the plurality of imaging parameters, based on the three-dimensional geometry inferred for each of the plurality of imaging parameters by the three-dimensional representation model.

3. The information processing system according to claim 1, wherein the plurality of imaging parameters include information regarding the imaging position and imaging direction for imaging the object.

4. The information processing system according to claim 1, wherein 3D data of the object is generated, and the first 2D geometry is generated based on the 3D data of the object.

5. The information processing system according to claim 4, wherein the 3D data of the object is generated by a generation method having different characteristics from the three-dimensional representation model.

6. The information processing system according to claim 5, wherein the 3D data is generated by photogrammetry processing or a 3D generation AI model based on the plurality of first RGB images and the plurality of imaging parameters.

7. The information processing system according to claim 5, wherein the 3D data is generated by a three-dimensional scan using a 3D scanner.

8. The information processing system according to claim 7, wherein the plurality of imaging parameters are adjusted by alignment with the 3D scanner.

9. The information processing system according to claim 1, wherein the first two-dimensional geometry is corrected by a correction process using the plurality of first RGB images as guide information, and the three-dimensional representation model is learned to minimize the error between the first RGB images and the first two-dimensional geometry corrected by the correction process, where the imaging parameters correspond to each other, and the second two-dimensional geometry.

10. The information processing system according to claim 9, wherein the correction process is realized by a generative AI model that acquires a latent space (prior knowledge) relating to the object through learning based on a large-scale dataset.

11. The information processing system according to claim 1, wherein the three-dimensional representation model is a DNN (Deep Neural Network) three-dimensional representation model.

12. The information processing system according to claim 11, wherein the DNN three-dimensional representation model is NeRF (Neural Radiance Fields) or 3DGS (3D Gaussian Splatting).

13. An information processing method comprising: acquiring a plurality of first RGB images relating to an object and performing an extraction process to extract a plurality of imaging parameters corresponding to the plurality of first RGB images; performing a first two-dimensional geometry generation process to generate a first two-dimensional geometry of the object for each of the imaging parameters based on the object; performing a second two-dimensional geometry generation process to generate a second two-dimensional geometry of the object based on the inference result of a three-dimensional representation model that takes the plurality of imaging parameters as input; and performing a learning process to train the three-dimensional representation model to minimize the error between the first RGB images and the first two-dimensional geometry and the second two-dimensional geometry for which the imaging parameters correspond to each other.

14. The information processing method according to claim 13, wherein the first two-dimensional geometry includes a first depth map and a first normal map for each of the plurality of imaging parameters, and the second two-dimensional geometry includes a second RGB image, a second depth map, and a second normal map for each of the plurality of imaging parameters, based on a three-dimensional geometry inferred for each of the plurality of imaging parameters by the three-dimensional representation model.

15. The information processing method according to claim 13, wherein the plurality of imaging parameters include information regarding the imaging position and imaging direction for imaging the object.

16. The information processing method according to claim 13, wherein 3D data of the object is generated, and the first 2D geometry of the object is generated based on the 3D data.

17. The information processing method according to claim 16, wherein the 3D data is generated by a generation method having different characteristics from the three-dimensional representation model.

18. The information processing method according to claim 17, wherein the 3D data is generated by photogrammetry processing or a 3D generation AI model based on the plurality of first RGB images and the plurality of imaging parameters.

19. The information processing method according to claim 17, wherein the 3D data is generated by a three-dimensional scan using a 3D scanner.

20. A program that acquires a plurality of first RGB images relating to an object, extracts a plurality of imaging parameters corresponding to the plurality of first RGB images, generates a first two-dimensional geometry of the object for each of the imaging parameters based on the object, generates a second two-dimensional geometry based on the inference result of a three-dimensional representation model that takes the plurality of imaging parameters as input, and causes the computer to function in such a way that it trains the three-dimensional representation model to minimize the error between the first RGB images and the first two-dimensional geometry and the second two-dimensional geometry for which the imaging parameters correspond to each other.