Information processing apparatus, information processing method, and program

JP2024173036A5Pending Publication Date: 2026-06-26CANON KK

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: CANON KK
Filing Date: 2023-06-01
Publication Date: 2026-06-26

AI Technical Summary

Technical Problem

Conventional methods for virtual viewpoint image generation using geometric techniques result in high calculation costs, making them inefficient for accurate three-dimensional reconstruction.

Method used

An information processing device utilizing a deep neural network (DNN) to estimate three-dimensional shapes from two-dimensional images, synchronized by multiple cameras, and employing a discriminant function to select optimal cameras for reconstruction, reducing calculation costs while maintaining accuracy.

Benefits of technology

Achieves highly accurate three-dimensional reconstruction with reduced computational costs by optimizing camera selection and network architecture, enabling efficient processing of large-scale events like sports matches.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 00000000_0000_ABST

Patent Text Reader

Abstract

To achieve an accurate three-dimensional reconfiguration while reducing calculation cost.SOLUTION: An information processing apparatus photographs a subject by using one or more cameras and reconfigures the subject in a three-dimensional shape from the photographed images, and the information processing apparatus has: determination means that determines the one or more cameras used for the three-dimensional reconfiguration with reference to the state of the subject; and generation means that generates three-dimensional shape data by using an estimator that estimates a three-dimensional shape by using the images photographed by the determined cameras as input.SELECTED DRAWING: Figure 6

Need to check novelty before this filing date? Find Prior Art

Description

[Technical field]

[0001] The present invention relates to an information processing device, an information processing method, and a program for three-dimensional reconstruction based on captured images. [Background technology]

[0002] During television broadcasts of games such as baseball or rugby, a technology that instantly reconstructs the players' movements in 3D to provide viewers with footage seen from perspectives that cannot normally be captured is attracting attention (hereinafter referred to as virtual viewpoint image generation technology).

[0003] In addition, there are technologies that use deep learning to perform 3D reconstruction. It is expected that the use of deep learning will enable the acquisition of 3D reconstruction results at a level that was previously unachievable.

[0004] Non-Patent Document 1 discloses that SMPL is used to detect a human area in an input 2D image, and then the human pose and shape parameters are estimated by deep learning to reconstruct the target person in 3D.

[0005] Non-Patent Document 2 discloses a virtual viewpoint image generation technology called NeRF. It discloses that a deep network is used to represent a scene, and 3D reconstruction is performed from a sparse set of input images from multiple cameras, and then the scene is rendered into a 2D image seen from a virtual camera. [Prior art documents] [Non-patent literature]

[0006] [Non-Patent Document 1] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and MJ Black, “SMPL: A skinned multi-person linear model,” ACM TOG, vol. 34, no. 6, p. 248, 2015. [Non-Patent Document 2] Ben Mildenhall,Pratul P Srinivasan,Matthew Tancik,Jonathan T Barron,Ravi Ramamoorthi,and Ren Ng.Nerf:Representing scenes as neural radiance fields for view synthesis.In European conference on computer vision,pages 405-421.Springer,2020. Summary of the Invention [Problem to be solved by the invention]

[0007] However, if the methods described in Non-Patent Documents 1 and 2 are simply applied to a virtual viewpoint video generation technique implemented by a conventional geometric method such as visual hull, the computational cost becomes enormous.

[0008] Therefore, an object of the present invention is to realize highly accurate three-dimensional reconstruction while suppressing calculation costs. [Means for solving the problem]

[0009] The information processing device of the present invention is an information processing device that photographs a subject with one or more cameras and 3D reconstructs the subject from the photographed images, and has a determination means that determines one or more cameras to be used for 3D reconstruction based on the state of the subject, and a generation means that receives as input an image photographed by the determined camera and generates 3D shape data using an estimator that estimates the 3D shape. Effect of the Invention

[0010] According to the present invention, it is possible to realize highly accurate three-dimensional reconstruction while suppressing calculation costs. [Brief description of the drawings]

[0011] [Figure 1] 1 is a diagram illustrating an example of a system configuration according to a first embodiment. [Diagram 2] FIG. 2 is a diagram illustrating an example of a camera arrangement according to the first embodiment. [Diagram 3] 1 is an example of a schematic diagram of a DNN according to a first embodiment. [Figure 4] FIG. 4 is a diagram illustrating an example of a virtual camera position according to the first embodiment. [Diagram 5] FIG. 2 is a diagram showing an example of a human model according to the first embodiment. [Figure 6] 4 is an example of a flowchart of a learning method and three-dimensional reconstruction according to the first embodiment. [Figure 7] 13 is an example of a flowchart of three-dimensional reconstruction according to the fifth embodiment. [Figure 8] 23 is an example of a flowchart of 3D reconstruction according to the sixth embodiment. [Figure 9] FIG. 23 is a diagram showing an example of a rendering result according to the sixth embodiment. [Figure 10] FIG. 23 is a diagram showing an example of a person estimation result according to the seventh embodiment. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0012] Hereinafter, embodiments of the present invention will be described with reference to the drawings. The following embodiments do not limit the present invention, and not all of the combinations of features described in the present embodiments are necessarily essential to the solution of the present invention. The configurations of the embodiments may be appropriately modified or changed depending on the specifications of the device to which the present invention is applied and various conditions (conditions of use, environment of use, etc.). In the following embodiments, the same or similar configurations are given the same reference symbols, and duplicated descriptions are omitted.

[0013] <First embodiment> In this embodiment, the surroundings of a sports stadium are photographed with N cameras, and the state of the players playing the game, which is the subject, is reconstructed in 3D using the camera settings of virtual viewpoint image generation technology. The 3D reconstruction is performed using a DNN (Deep Neural Network) that estimates a 3D shape from a single 2D image, which will be described below. Here, the DNN is an example of an estimator.

[0014] An example of the camera arrangement used in this embodiment is shown in FIG. 2. FIG. 2 is a diagram showing a plurality of cameras 21 arranged around a stadium 20. The plurality of cameras 21 synchronously capture the competition field. By arranging and capturing images in this manner, it is possible to perform a three-dimensional reconstruction of the play by athletes within the competition field. An example of a system configuration including the plurality of cameras 21 arranged in this case is shown in FIG. 1(A). As shown in FIG. 1(A), the system according to this embodiment has an information processing device 10, a capture group 11, a capture group 12, ..., a capture group 13, and a clock generator 14.

[0015] The information processing device 10 comprises a CPU 101, a RAM 102, a ROM 103, a large-capacity storage device 104, an operation unit 105, and a display unit 106.

[0016] The CPU 101 executes various processes using computer programs and data stored in the RAM 102 and the ROM 103. In this way, the CPU 101 executes or controls various processes which will be described as controlling the operation of the entire information processing device 10.

[0017] The RAM 102 has an area for storing computer programs and data loaded from the ROM 103 or the mass storage device 104, and an area for storing data received from the multiple capture groups. The RAM 102 also has a work area used when the CPU 101 executes various processes. In this way, the RAM 102 can provide various areas as needed.

[0018] The ROM 103 stores setting data for the information processing device 10, computer programs and data relating to startup, computer programs and data relating to basic operations, and the like.

[0019] The mass storage device 104 is a hard disk drive device, etc. The mass storage device 104 stores an OS (operating system) and computer programs and data for causing the CPU 101 to execute or control various processes described as being performed by the information processing device 10. The data stored in the mass storage device 104 also includes data related to a DNN model for performing 3D reconstruction.

[0020] Computer programs and data stored in the mass storage device 104 are loaded into the RAM 102 as appropriate under the control of the CPU 101 and are processed by the CPU 101 .

[0021] The operation unit 105 is a user interface such as a keyboard, a mouse, a touch panel, etc., and the user can input various instructions to the CPU 101 by operating it.

[0022] The display unit 106 has a screen such as a liquid crystal screen or a touch panel screen, and can display the results of processing by the CPU 101 as images, characters, etc. The display unit 106 may be a projection device such as a projector that projects images and characters.

[0023] The CPU 101, RAM 102, ROM 103, mass storage device 104, operation unit 105, and display unit 106 are all connected to a system bus 107. Note that the configuration of the information processing device 10 is not limited to the configuration shown in FIG.

[0024] The information processing device 10 is a computer device such as a PC (personal computer), a smartphone, or a tablet terminal device that includes a set of input / output devices.

[0025] Next, the above capture groups 11, 12, and 13 will be described. In this embodiment, the capture groups are prepared in the number of cameras shown as 21 in Fig. 2. Note that Figs. 1(A) and (B) show only three capture groups as an example. However, multiple cameras may be included in one capture group.

[0026] The capture group 11 has a control box 111 and a camera 112. The capture group 12 has a control box 121 and a camera 122. The capture group 13 has a control box 131 and a camera 132. The control boxes 111, 121, and 131 receive the time issued by the clock generator 14 and assign the time to the images captured by the cameras 112, 122, and 132. By referring to the assigned time, the information processing device 10 can perform 3D reconstruction while maintaining synchronization after receiving each image.

[0027] As a modified pattern of Fig. 1(A), the configuration shown in Fig. 1(B) may be used. The difference from Fig. 1(A) is that the capture groups 11, 12, ... 13 have nodes PC110, 120...130. In particular, when performing 3D reconstruction using one input image as in this embodiment, 3D reconstruction is performed for each camera in each capture group, and the information processing device 10 determines the virtual camera position for rendering. Then, the optimal capture group may be selected and received as the 3D reconstruction result to be adopted at that time.

[0028] Next, an example of the configuration of a DNN that performs 3D reconstruction from one input image used in this embodiment is shown in Figure 3(A). In the example shown in Figure 3(A), the input 30 is one 2D image captured by a camera, and the output 32 represents a 3D reconstructed object (3D shape data). The input 30 may be the captured image itself, or it may be an image in which only the human region is cut out and the background is removed by performing semantic region segmentation on the input image using a method such as Mask-RCNN.

[0029] The DNN 31 used for this conversion between the two is represented as an Hourglass network, which is a pair of a downsampling network and an upsampling network. The Hourglass network is an example of an architecture for converting important structures from an input image into a low-level representation and then converting the representation into a high-level representation for 3D reconstruction. Other architectures may be used. The DNN 31 receives the two-dimensional image of the input 30, estimates volume data of a three-dimensional object, and predicts the inside / outside judgment of the volume data as a continuous random field to obtain a mesh representation of the estimation target.

[0030] When a target shape is given as ground truth, the coordinates of the object are learned by defining the inside of the target object as 1 and the outside as 0, and determining a continuous 3D occupancy field, as shown in Equation 1. In this case, the purpose is achieved by setting the output surface of the target object as a surface with the above inside / outside determination value of 0.5.

[0031]

number

[0032] In addition, here X w is a three-dimensional vector representing the world coordinates, f s * is a function that converts to 0 / 1 based on the ground truth (f s (The function is expressed as and estimates continuous values.)

[0033] And the loss in learning, L sis calculated using the formula shown in Equation 2. In Equation 2, the input 2D image is basically associated with the corresponding correct 3D mesh data, and the score obtained is the sum of the squared error between the score obtained by the inside / outside determination of the 3D volume data estimated from the input image and the score obtained by the inside / outside determination of the correct 3D mesh data. This correct 3D mesh data is an example of correct data.

[0034]

number

[0035] In Equation 2, the number of points related to the coordinates used in the inside / outside determination is P, which may be all the voxels occupying the space to be estimated or the number sampled from them. i represents the ID of the observation point. x ci represents the coordinates on the two-dimensional image of the input 30 by the camera with the camera ID c, and represents the three-dimensional point X wi g(I(x ci )) represents a feature amount calculated from a two-dimensional image of the input 30 captured by a camera with a camera ID of c. In this embodiment, the camera ID c is a unique number (a number greater than or equal to 1 and less than the number of cameras) assigned to each camera.

[0036] Also, h c (X wi ) is a 3D point X in world coordinates wi represents a transformation equation for transforming the input 30 into a camera coordinate system defined by the camera itself with a camera ID of c that captured the 2D image. Therefore, the calculated square error is calculated as a value defined by the capturing camera c.

[0037] As described above, a DNN is trained to estimate a 3D object from a 2D image input by training to minimize the loss calculated by Equation 2.

[0038] The trained DNN model holds one model corresponding to each camera shown in FIG. 2. Even if the same DNN model is shared by all cameras, it is unlikely that the 3D reconstruction results will be exactly the same if images taken at exactly the same moment are input. In particular, when the estimation target is a person, the shape, posture, and orientation change from time to time, so it is necessary to determine the optimal input according to the state of the estimation target, which changes each time. In the following, we will explain how to select the optimal input image at any time to improve the accuracy of 3D reconstruction by the trained DNN.

[0039] First, for simplicity, a case where the estimation target is a person and there is only one person will be described as an example. In this case, it is assumed that the person to be 3D reconstructed is located in the stadium described in FIG. 2. It is assumed that the person is present at the position shown by 44 in FIG. 4(A). The coordinates of the person in the stadium are defined by the x and y coordinates of the x axis shown by 45 and the y axis shown by 46. Then, the moving path of the virtual camera (VC) when performing video rendering after the 3D reconstruction is determined as 47. Then, it is assumed that the virtual camera (VC) moves in the order of 41, 42, and 43 on the path in chronological order, and outputs an image with a field of view captured while facing the center direction of the target person 44.

[0040] In FIG. 4A, for the sake of simplicity, the position of the person is indicated by only 44, but the person is assumed to be moving freely within the stadium.

[0041] Now, since obviously there are no real cameras at the positions of virtual cameras (VC) 41, 42, and 43, the optimal camera for generating the image that should be visible from each virtual camera (VC) position is selected from the cameras installed around the stadium.

[0042] In order to select the optimal camera for real-time 3D reconstruction, it is necessary to learn a discriminant for selection in advance. The steps involved in learning are explained using the flowchart in Figure 6(A).

[0043] In a stadium, the space to be reconstructed is limited. In the example of Fig. 4(A), it is only the inside of the stadium. The number of cameras used for shooting is also limited to the number of cameras installed around the stadium. The number of cameras installed is set to M (M is an integer equal to or greater than 2).

[0044] Therefore, in S601 in FIG. 6(A), a stadium and a camera similar to the real environment are reproduced by 3D CG rendering software. At this time, the internal parameters, external parameters, and distortion coefficients of each camera are also made to match the parameters of the real environment as much as possible, so that the image captured in the real environment is reproduced in the virtual environment. Next, in S602, a 3D human model is placed in the stadium reproduced by CG by varying various parameters. The various parameters here refer to the person's orientation ρ, posture θ, shape β, and coordinates (x, y). The person's orientation ρ is the angle defined by the person's front facing, and the front facing is defined as the direction of the person's front direction when the 3D human model is converted into a standard pose, often called the Canonical T-pose. For example, the front facing of the human model in the state of FIG. 5(A) can be defined as the direction of the front facing 50 when this human model is reverse-transformed into the Canonical T-pose in FIG. 5(B). The angle between the forward facing direction and the x-axis direction defined by 45 in Fig. 4(A) at this time is the direction of the person, and the range is defined as 0≦ρ<2π. Next, the parameters defined by the posture θ and the shape β may be any parameters that define the posture and shape of the person. In the optimal embodiment, a template mesh is stored as a base for the shape of the person. The template mesh is deformed by the above β to fit various body shapes of people, and is further deformed by θ to fit various postures.

[0045] Figure 5(C) shows an example of the results of deforming a simple template mesh using β and θ for the input in Figure 5(A). When deforming a template mesh, it is sufficient to roughly approximate the shape. β defined here is a parameter that makes each part of the human shape thicker or thinner, longer or shorter, so a parameter of about 10 dimensions is assumed. In addition, θ, which defines the posture, depends on the number of joints and degrees of freedom of the assumed 3D human model, but is assumed to be less than 100 dimensions.

[0046] Above, ρ, θ, and β were described as parameters for expressing a three-dimensional human model by deformation of a template mesh. These parameters are for determining the optimal camera for three-dimensional reconstruction as a result of various deformations of the template mesh, and therefore are low-level expressions and simpler than the level of detail of the three-dimensional object that is the final desired three-dimensional reconstruction result. Here, ρ, θ, and β are examples of low-level parameters. An important parameter other than ρ, θ, and β is the coordinate (x, y). This is a two-dimensional parameter of (x, y) represented by two-dimensional coordinates defined by the x-axis 45 and y-axis 46 in FIG. 4(A) already described.

[0047] Therefore, a person photographed by a camera installed in the stadium can be defined by the person's orientation ρ, posture θ, shape β, and coordinates (x, y) described above.

[0048] Next, in S603, the 3D human model parameters (ρ, θ, β, x, y) described above are changed in appropriate steps, and 2D images of the 3D human model captured by all cameras are rendered. Then, each image is assigned the ID (1≦c≦M) of the capturing camera and the 3D human model parameters described above and saved.

[0049] In S604, 3D reconstruction is performed by inputting all the stored 2D images to the trained DNN illustrated in Fig. 3(A). This makes it possible to compare the estimated 3D person model (3D shape data) with a 3D person model that is the ground truth and can be reproduced from the 3D person model parameters assigned to the input 2D images.

[0050] In S605, the reconstruction result is compared with the object of the Ground Truth to calculate the difference. The difference is calculated by combining all camera IDs and 3D human model parameters that have been changed in appropriate steps. The difference here is the score L defined in Equation 2. s Here, the camera ID can be any ID that simply identifies the camera, but in order to simplify subsequent calculations, it is preferable to assign consecutive numbers in the order in which the cameras are installed, for example, clockwise or counterclockwise around the stadium.

[0051] The camera that captured the input image that was able to reconstruct a 3D shape with a small difference between the estimated 3D person model and the 3D person model that is the ground truth is the camera that matches the shooting conditions of the person. Therefore, when the 3D person model parameters (ρ, θ, β, x, y) related to the person to be photographed are input, the camera that matches the shooting conditions of the person can be determined by obtaining a discriminant function Q that obtains a camera ID c that minimizes the difference score Ls with the ground truth, and the 3D person model parameters (ρ, θ, β, x, y) related to the person to be photographed are input. The formula for this can be defined as Equation 3.

[0052]

number

[0053] Well, score L s(c) is already defined in Equation 2, and this value is determined by the inside / outside judgment of the 3D person model. As already explained, the 3D person model is determined by the person model parameters (ρ, θ, β, x, y), so it is possible to find the discriminant function Q. The input person model parameters are subjected to dimensionality reduction into a one-dimensional camera ID. Since the right-hand side of Equation 3 can already be uniquely calculated, a discriminant equation for finding c, which is an appropriate camera ID for the input, can be obtained by performing a general supervised dimensionality reduction and finding an appropriate scale space. For example, Q(ρ, θ, β, x, y) can be found by Fisher discriminant analysis (FDA).

[0054] Next, we will explain the flow during actual operation. Since the function Q that automatically determines the camera to be used for 3D reconstruction is obtained when the person model parameters of the person to be 3D reconstructed are known, estimation is performed according to the flowchart in Figure 6(B) during actual operation.

[0055] In S611, the orientation ρ, posture θ, shape β, and coordinates (x, y) of the person 44 to be reconstructed in 3D at the estimated time t are obtained. At this time, when ρ, θ, and β are obtained for the input of a 2D image, the network architecture illustrated in FIG. 3(B) is used. FIG. 3(B) and FIG. 3(A) are similar, but different. FIG. 3(A), which has already been explained, is an hourglass-shaped DNN (31) that outputs a 3D person model (32) with a detailed resolution level from a 2D image input (30). In contrast, FIG. 3(B) has the same input as FIG. 3(A), as shown in 33, but the information output through the encoder shown in 34 is lower-level information compared to FIG. 3(A), and outputs the orientation ρ (35), shape β (36), and posture θ (37) as is. The reason for using a DNN with a simple output here is that this step is performed for the purpose of obtaining information for determining the optimal camera for 3D reconstruction, and is for performing high-speed processing while maintaining high generalization performance. Therefore, the output in Fig. 3(B) is simplified and is the template mesh deformed by ρ, β, and θ, so it remains at the level shown in 38 compared to the high-level output result 32 in Fig. 3(A). Furthermore, the person coordinates (x, y) are found by a general person detector.

[0056] The above describes the methods for determining ρ, θ, β, x, and y. However, the accuracy of the estimation process does not change significantly if the target is captured within the camera. Therefore, it is advisable to determine each parameter, for example, by using the median value of the above parameters calculated from several randomly selected cameras.

[0057] Although the above describes a method for estimating the posture and shape of a person using DNN as shown in Fig. 3(B), other methods that can perform similar estimation may be used. For example, posture estimation using a two-dimensional image input may be performed using skeleton estimation, and the objective may be achieved by associating it with optimal shooting conditions.

[0058] Next, in S612, the human model parameters ρ, θ, β, x, and y are input to the already acquired discriminant function Q to determine the camera ID.

[0059] Finally, in S613, the image captured by the determined camera is input to a DNN for 3D reconstruction to estimate a 3D person model.

[0060] As a result of the above, it is possible to obtain a highly accurate result of estimating the shape of a 3D human model. To obtain the final rendering result, the textures created from cameras with close camera direction angles are attached to the target object as the virtual camera path in Fig. 4(A) moves from 41 to 42 to 43, and the final rendering result is obtained.

[0061] The above has demonstrated a method for achieving accurate 3D reconstruction using DNN.

[0062] In addition to the above method of determining the texture, the objective may be achieved by using a DNN that estimates color information for each point on the surface of each estimation object.

[0063] In addition, in this embodiment, only a two-dimensional image is input, but similar results can be obtained by estimating a visual hull, which can be implemented using the camera configuration shown in Fig. 2, and inputting the three-dimensional point cloud estimated here. Alternatively, this can be achieved by acquiring and inputting a depth map, and learning that links the depth information with the three-dimensional shape to be output.

[0064] <Second embodiment> The second embodiment is a modification of the first embodiment. In the first embodiment, a method for determining the camera ID used for 3D reconstruction from the estimation result of only the human model parameters has been described. In the present embodiment, a method for determining the camera ID by referring to not only the human model parameters but also parameters related to the virtual camera during video rendering is adopted.

[0065] The system configuration to which the information processing device according to the second embodiment is applied is the same as that of the first embodiment and can be represented in FIG. 1, and the specific stadium and camera settings are also the same as the example shown in FIG. 2.

[0066] Hereinafter, only the differences from the first embodiment will be described. In the first embodiment, the inside / outside determination of a target object is defined by Equation 1, and a method for learning the coordinates of the object by determining a continuous 3D occupancy field has been described. In this case, Equation 3 shows an equation for selecting a camera that minimizes the error as the optimal camera for shape estimation by defining a score for evaluating an error during estimation by comparing the 3D shape of the ground truth with the estimated 3D shape by Equation 2. In this embodiment, since the difference when rendering is performed with a virtual camera is calculated, the score obtained by comparing 3D shapes is not used, and the shape estimation error during 2D video rendering after 3D reconstruction is considered.

[0067] In two-dimensional video rendering, only the error after conversion into a two-dimensional image from the viewpoint of the virtual camera (VC) is considered. Therefore, the three-dimensional coordinates (x,y,z) of the virtual camera (VC) defined by the x-axis shown by 45 in FIG. 4A, the y-axis shown by 46, and the z-axis perpendicular to these axes are v ,y v ,z v ) and the angle of the optical axis center of the virtual camera (VC) (ρ v ,φ v ) is used to render 3D object information using the parameters of the captured image, which are determined at any time by the virtual camera (VC). The function that projects the 3D object information onto a 2D plane is called π vc (ρ v ,φ v ,x v ,y v ,z v ), then the two-dimensional planar error can be defined by Equation 4.

[0068]

number

[0069] In this case, the camera ID used for 3D reconstruction is obtained not only from the parameters of the 3D reconstruction target itself defined in Equation 3, but also from the parameters defining the position and orientation of the virtual camera (VC). In this case, it is acquired by obtaining the discriminant function Q shown in Equation 5. Since the number of parameters related to the virtual camera (VC) added in Equation 5 is low-order compared to the parameters related to the target person, a general teacher dimensional reduction is performed to obtain an appropriate scale space by the same method as described in the first embodiment. This allows the discriminant function Q for obtaining an appropriate camera ID to be obtained.

[0070]

number

[0071] During estimation, the virtual camera (VC) position is changed to 41, 42, and 43 on the camera path 47 shown in Fig. 4 to estimate the virtual camera parameters (ρv ,φ v ,x v ,y v ,z v ) is automatically determined. Therefore, by determining the target person model parameters (ρ, θ, β, x, y) described in the first embodiment, all the variables on the right side of Equation 5 are determined. This makes it possible to determine an appropriate camera ID by the discriminant function Q during estimation.

[0072] <Third embodiment> In the first and second embodiments, the DNN model that performs 3D reconstruction when images captured by M cameras (M is an integer equal to or greater than 2) installed around the stadium shown in Fig. 2 are input is the same. In other words, it has been explained that a person is reconstructed using the same general-purpose DNN model regardless of which camera is used to capture the image. However, in some cases, such as when the area captured by each camera is small, bias may occur in the capture angles of the person captured by each camera.

[0073] For example, if there is a camera shooting directly down from the ceiling of a stadium, it is unlikely that this camera will capture an image of a person observed from the side. In this case, when the input image is a photograph of the target person from all angles, the accuracy of 3D reconstruction will be improved by using a DNN model that is trained according to the tendency of the image input as the input image, rather than using a single DNN model that has good estimation accuracy for 3D reconstruction overall.

[0074] These may cause bias in the tendency of people to be photographed by the M cameras shown in Fig. 2. Therefore, in the third embodiment, a method of learning and preparing a DNN model to be used for each camera for the M cameras will be described, and how to use them in actual operation will be described. The following is a description according to the flowchart in Fig. 6(C).

[0075] In S620, a loop is executed M times so that different DNN models are generated for each of the inputs from M cameras. In this loop, in S621, learning of a DNN model for a specific camera c (1≦c≦M) is started. Parameters with high generalization performance are set for all input images as the initial DNN model for a camera with a specific camera ID of c. Next, in S622, a 3D human model is placed in a CG-reproduced stadium with various changes to the human model parameters (ρ, θ, β, x, y) in the same way as in the method implemented in S602 of the first embodiment.

[0076] Next, in S623, the various 3D human models are rendered as images captured by camera c, and 3D reconstruction is performed from this rendered image using the current DNN model to estimate the human shape. Then, the arranged 3D human model is compared with the human shape estimated as Ground Truth, and the difference between them is calculated. An average difference is calculated for various results in which the human model parameters are changed in various ways. Next, in S624, it is compared whether the average difference is equal to or less than a preset threshold. If it is not equal to or less than the threshold, in S625, the parameter set of the DNN model is minutely updated so that the difference is reduced, and the process returns to S622. If the average difference is equal to or less than the threshold in S624, the process for camera c ends in S626, c=c+1 is set, and the process returns to S621. These processes are repeated until c=M.

[0077] This process results in obtaining DNN models corresponding to M cameras.

[0078] Next, the flow at the time of estimation will be described with reference to the flowchart of FIG. 6(D). In S627, the person model parameters (ρ, θ, β, x, y) of the person to be reconstructed in 3D at the moment of time t are acquired by the method already described in the first embodiment. Next, in S628, the camera ID suitable for reconstruction is determined by the discriminant function Q already described. The method for obtaining the discriminant function Q used here can be derived in the same manner as the formula and method already described in the first embodiment. In other words, the formula for the error between the Ground Truth and the estimation result defined by Formula 2 or Formula 4 already described is executed in the same manner because the difference between the DNN models is represented by the camera ID c and has the same meaning. In S629, the DNN linked to the camera ID determined in S628 is determined, and in S630, the image captured by the camera is input to the selected DNN, thereby finally performing 3D reconstruction of the target person at time t.

[0079] <Fourth embodiment> In the first, second and third embodiments, for simplicity, the input image is taken by a single camera, as shown in FIG. 3(A). However, when the purpose is 3D reconstruction, it is expected that accuracy will be improved by inputting multiple images. In particular, as shown in FIG. 2, the currently assumed camera setting has multiple cameras installed around the stadium, and the multiple cameras are synchronously photographed by the system shown in FIG. 1. Therefore, selecting multiple images from these synchronously photographed images and inputting them to the DNN will bring good results in implementing accurate 3D reconstruction.

[0080] For example, the relationship between the input and output for the DNN when three images are input is shown in Fig. 3(C). In this figure, a person to be 3D reconstructed is photographed by different cameras in synchronization, and each input image is represented by 301, 302, and 303. Now, the relationship between these three cameras is fixed. The images photographed by the three cameras are input to a DNN 304 equivalent to that already described in the first embodiment, and a 3D reconstruction result 305 is output.

[0081] As stated above, the three cameras input to the DNN are fixed. In other words, the DNN304 has already learned the relationship between the three cameras. Therefore, it is difficult to obtain appropriate 3D reconstruction by selecting three different cameras and inputting them into the DNN304. Since the person to be reconstructed in 3D moves while changing his / her posture, it is difficult to always keep the whole body photographed by all cameras in the same three-camera set. Therefore, it is necessary to estimate the person model parameters (ρ, θ, β, x, y) of the target person and perform 3D reconstruction using a DNN model trained with three cameras that are photographed under favorable conditions for these parameters (especially x and y).

[0082] Even if the number of input cameras is three, 3D reconstruction can be performed in the same flow as in the embodiment in which the number of input cameras is one by giving a camera set ID that is uniquely defined for the camera set. For example, all combinations of selecting three cameras from M cameras are M(M-1)(M-2) / 3×2×1 ways. Therefore, when assigning IDs for all combinations, each DNN is trained according to the flow shown in the third embodiment for each combination, and the loss in each learning is calculated by the score shown in Equation 2 or Equation 4. In this case, the parameter c is calculated as the camera set ID. A discriminant function for discriminating the optimal camera set ID based on the loss calculated by these calculations is calculated by the formula defined by Equation 3 or Equation 5.

[0083] In the fourth embodiment, a specific DNN configuration was explained based on an example of taking pictures with three cameras as a method for inputting multiple images taken in a synchronized manner. However, the number of images input here is not limited to three, and the purpose can be achieved by using any number of images in a similar manner.

[0084] Alternatively, when photographing an object with a single camera, multiple frames that are consecutive in time series can be input into a DNN with the architecture illustrated in Figure 3(C).

[0085] <Fifth embodiment> In the first to fourth embodiments, in order to simplify the explanation, an example in which only one person is the subject of 3D reconstruction is given as shown in Fig. 4(A). In the present embodiment, a case in which multiple people are the subject of 3D reconstruction will be explained.

[0086] In this embodiment, the system configuration example is the same as that in FIG. 1, and the stadium and camera settings are the same as those shown in FIG.

[0087] A specific example of a case where multiple people appear in a stadium is shown in Fig. 4(B). The movement path of the virtual camera (VC) during video rendering is determined as 407, and the virtual camera (VC) moves over this in chronological order of 401, 402, 403. At this time, people 404, 405, 406 to be photographed change their postures and coordinates. What differs from the first embodiment is that the number of people to be subject to video rendering increases or decreases due to fluctuations in the virtual camera position, and the presence of multiple people causes occlusion of the camera installed in the stadium.

[0088] Therefore, in addition to the first embodiment, it is necessary to identify the person to be rendered and perform 3D reconstruction only for that person, and the purpose is achieved by changing the priority of the cameras selected for 3D reconstruction when occlusion occurs.

[0089] Hereinafter, the processing flow according to this embodiment will be described with reference to the flowchart shown in FIG.

[0090] First, in S701, the coordinates (x, y) of all people to be reconstructed in 3D are estimated. This step may use any of the images captured by the M cameras, or may be determined by a single camera capturing the entire image.

[0091] Next, in S702, the virtual camera parameters (ρ v ,φ v ,x v ,yv ,z v ) and specifies the people to be drawn within the angle of view of the virtual camera determined by the parameters. For example, when the virtual camera position is 401, only person 404 is to be drawn, and when the virtual camera position is 402, people 404, 405, and 406 are to be drawn. Also, when the virtual camera position moves to 403, only 404 and 405 are to be drawn, and so on.

[0092] Next, in S703, the person model parameters (ρ, θ, β, x, y) already described are estimated for all the target persons identified. Next, in S704, the camera ID used to reconstruct each target person is determined by the discriminant function Q described in the first and second embodiments. However, as already described, the discriminant function Q is learned as a function to specify a camera ID suitable for reconstruction when one person model exists in the stadium. Therefore, in the current embodiment, since there are multiple people to be reconstructed in the stadium, multiple people may be captured in an image taken with the camera ID determined by the discriminant function Q, and the person to be reconstructed may be occluded by one of the people. Therefore, in S705, it is determined whether the person to be reconstructed is occluded among the people in the image taken with the camera with the determined camera ID.

[0093] The determination of whether or not the image is occluded is performed by detecting the human region using Mask-RCNN or the like and determining whether or not there is overlap in the human region. If it is determined that the image is occluded, the 3D reconstruction result by that camera is unlikely to be good. Therefore, in S706, the candidate camera ID based on the discriminant function is corrected to the next best ID, and the process returns to S705. If it is determined that the image is not occluded, the camera to be used for 3D reconstruction is determined in S707, and 3D reconstruction is performed by inputting this image into the DNN for 3D reconstruction.

[0094] Sixth embodiment In the first to fifth embodiments, the estimation method for 3D reconstructing a person to be 3D reconstructed has been described, which includes a method for obtaining optimal results by changing the optimal camera ID to be input as needed, and a method for obtaining optimal results by changing the trained DNN model depending on the camera ID.

[0095] On the other hand, in each embodiment, the basic assumption was that the DNN architecture used would be fixed.

[0096] In this embodiment, we will explain a method of changing the architecture of the DNN for each 3D reconstruction according to the angle of view of the virtual camera that changes over time during rendering and the state of the person within that angle of view. We will also explain a method that does not use DNN, dynamically switching between geometric methods such as Visual Hull, and performing 3D reconstruction using the optimal method at any time.

[0097] The processing flow in this case will be described with reference to the flowchart in FIG. 8. First, in S801, the coordinates (x, y) of all people to be reconstructed at time t when 3D reconstruction is performed are estimated. Next, in S802, virtual camera parameters (ρ v ,φ v ,x v ,y v ,z v ), and at this time, the person to be rendered within the virtual camera is identified by the method described in the fifth embodiment. Then, in S803, person model parameters of all of the identified people to be reconstructed in 3D are estimated. Then, in S804, the person model parameters of the people to be reconstructed in 3D and the virtual camera parameters are used to determine the method and camera ID to be used when reconstructing each person in 3D. Note that information that was not referred to in the first to fifth embodiments is taken into consideration when determining the various methods. An example of the appearance of multiple people rendered by the virtual camera is shown in FIG. 9, and will be described below with reference to FIG. 9.

[0098] In Fig. 9, the virtual camera parameters (ρ v,φ v ,x v ,y v ,z v ), there are two people drawn within the virtual camera angle of view determined by the above. The two people drawn here are people 92 and 93. As already explained, when reconstructing each person in 3D, the optimal camera for estimating the 3D shape of each person is usually different. The optimal camera for reconstructing each person in 3D can be selected using the discriminant function Q already explained.

[0099] However, the discriminant function Q was obtained as a standard for improving the 3D reconstruction results of the person to be drawn. However, when drawing with a virtual camera, the reconstruction results of person 92, who is far away from the virtual camera, do not need to be estimated at the same level as the reconstruction results of person 91. If person 91 had achieved an error accuracy of one pixel or less during video rendering, person 92 would be drawn at half the scale or less, so it would be sufficient to perform 3D reconstruction according to the drawing scale.

[0100] Therefore, in this case, the human drawing scale in the camera angle of view is (x v ,y v ,z v ) and (x, y) of the person model parameters. For example, consider a case where DNN models with different numbers of images input to the DNN 31 in FIG. 3(A) and the DNN 304 in FIG. 3(C) used for 3D reconstruction are mixed and stored in advance and used. In this case, when reconstructing a person 91 close to the virtual camera in 3D, a DNN model with a large number of input images is used to estimate the shape and convert it into mesh information. In addition, in this case, when reconstructing a person 92 in the distance in 3D, a DNN model with a small number of input images, estimation by a simple network, or a visual hull that can be calculated only from contour information is used to estimate the shape and convert it into mesh information. It is preferable to define these adopted methods in a decision tree or lookup table that can automatically determine in advance by associating the target scale at the time of rendering with the calculation cost and accuracy of various methods, for example.

[0101] Finally, in S805, final 3D reconstruction is performed for the number of people to be rendered using the method determined in S804 and one or more images captured by the camera, completing the 3D reconstruction processing required for rendering the scene at time t.

[0102] Seventh embodiment In the first to sixth embodiments, a method for reconstructing images by determining a camera suitable for 3D reconstruction from images captured synchronously from multiple viewpoints is described. This makes it possible to obtain high-quality results for each frame of video rendering obtained after 3D reconstruction. On the other hand, if various types of degradation occur in the results at the time of reconstruction by arranging the results of 3D reconstruction for each frame using different methods in chronological order and viewing them, this may result in an image that is undesirable to the person viewing the video, as a phenomenon such as flickering on the image.

[0103] A simple concrete example is given in FIG. 10 for explanation. Consider the case where a person 1000 to be reconstructed in 3D exists as shown in FIG. 10(A), and the person is reconstructed in 3D as a scene seen from a virtual camera 1006, as a scene having a certain length in the time series direction. For simplicity, the virtual camera position is limited to 1006, but the person 1000 to be reconstructed in 3D changes its posture, etc. from time to time, so the person model parameters (ρ, θ, β, x, y) transition in a time series. In accordance with the method already described, the virtual camera parameters (ρ, θ, β, x, y) are changed according to the person model parameters (ρ, θ, β, x, y), or the person model parameters (ρ, θ, β, x, y) and the virtual camera parameters (ρ) related to the virtual camera 1006. v ,φ v ,x v ,y v ,z v), the optimum camera suitable for 3D reconstruction is selected by the discriminant function Q already explained. At this time, the selected camera is switched between camera 1004 and camera 1005 as needed depending on the behavior of person 1000. At this time, the actual image example taken by camera 1004 is an image taken at the angle of view exemplified in 1001, and the actual image example taken by camera 1005 is an image taken at the angle of view exemplified in 1003. At this time, when the image 1002 taken by virtual camera 1006 is estimated from the image input taken by each camera by the trained DNN model, the estimated result cannot completely match the true value, so each estimated result outputs a result slightly different from the true correct answer value.

[0104] An example of a different estimation result is shown in FIG. 10(B). Normally, since a real camera is not installed at the position of the virtual camera 1006, it is not possible to obtain a true answer image, but an example of an image when a true answer image can be obtained is shown in 10001. The answer image is an example of correct data. In contrast, an example of the result of performing 3D reconstruction using an image captured by the camera 1004 as an input and rendering an image seen from the position of the virtual camera 1006 is shown in 10002. The shaded area in 10004 is an area where the estimation result differs compared to the true answer image 10001. The reason why this area has a different shape from the true answer image is that it corresponds to the object surface of the estimation result for an area not actually observed by the camera 1004.

[0105] Similarly, an example of the result of performing 3D reconstruction using an image captured by camera 1005 as input and rendering an image viewed from the position of virtual camera 1006 is shown in 10003. The shaded area in 10005 is an area where the estimation result differs compared to true ground truth image 10001. The reason why this area differs in shape from the true ground truth image is that it corresponds to the object surface of the estimation result for an area not actually observed from camera 1005.

[0106] In practice, if the image used for 3D reconstruction when the position of virtual camera 1006 is fixed is always the image captured by camera 1004, then errors after reconstruction will occur only in the area exemplified by 10004, and are unlikely to be a problem when viewed. Similarly, even if images captured by camera 1005 are always used, errors after reconstruction will always occur consistently only in the area exemplified by 10005, and so it is unlikely that these errors will cause problems when viewed.

[0107] However, if the parameters of the target person 1000, such as the posture, change over time, and the selected camera based on the discriminant function Q is selected to alternate between 1004 and 1005 at any time, the error tendency will be inconsistent. For each frame that switches over over time, a frame in which the region 10004 expands and a frame in which the region 10005 expands will alternate, and this will be repeated, which may give the viewer an unpleasant impression as flickering in the reconstructed shape.

[0108] Therefore, an example of an optimal DNN architecture to suppress this phenomenon is shown in Figure 3(D).

[0109] FIG. 3(D) shows an architecture assuming operation for the above-mentioned measures against the adverse effects in a setting including all of the first to sixth embodiments. It shows how the unstable deterioration fluctuation between the above-mentioned 3D reconstructions is suppressed by information being propagated between the DNNs at each time t when transitioning in the time series direction of t=T-1, T, T+1 in the time series processing of the 3D reconstruction. At t=T-1, one or more optimal camera IDs are selected, and there is an encoder 311 that converts these images into an information representation required for 3D reconstruction. Then, through an intermediate feature representation 313 that summarizes the 3D reconstruction, the decoder 312 converts it into a high-level feature representation, and finally obtains a 3D reconstruction result 314. The intermediate feature representation 313 may be any representation learned through the encoder and decoder, but may also be a representation acquired by learning induced to match, for example, ρ, θ, β, etc. that explicitly express the shape of a human model.

[0110] At t=T, just as at t=T-1, images captured by one or more cameras are input to encoder 315. They are then converted into intermediate feature representation 317, to which feature representation 313 at time t=T-1 is input. The feature representation from t=T-1 and the input image at t=T are referenced, and the result is passed through decoder 316, which finally outputs the 3D reconstruction result shown in 318.

[0111] Similarly, at t=T+1, the intermediate feature representation 317 at the time point t=T is input in the middle of the DNN, and these intermediate features are propagated in time series, and reference to the estimation results in past frames is repeated. As a result, compared to the case where independent 3D reconstruction is performed in each frame as assumed in the first to fifth embodiments, past information is referenced in time series, and an effect of suppressing adverse effects such as image flickering due to inconsistent fluctuations can be obtained.

[0112] In addition, FIG. 3(D) shows only a part of the same network structure circulating and inputting the next input as in a general RNN (Recurrent Neural Network), but other methods may be used. For example, as explained in the first to sixth embodiments as the time series transitions from time t=T-1, T, T+1, even if the input camera ID changes every time or the DNN model changes, this architecture can be operated by adjusting the intermediate feature representation. The number of input images may also change every time. For example, even if the number of input images at t=T-1 is 3 and the number of input images at t=T is 1, the same effect can be achieved by performing 3D reconstruction using a DNN model trained according to each input and propagating each intermediate feature representation in a time series. Even if a person present far from the virtual camera at time t is reconstructed in 3D using a visual hull or the like, this can be achieved by previously learning a DNN that converts the 3D reconstruction result into a low-order intermediate feature representation (especially ρ, θ, β).

Claims

1. An information processing device that photographs a subject with one or more cameras and reconstructs the subject in three dimensions using the captured images, An acquisition means for acquiring low-order parameters representing the state of the subject, A determination means for determining one or more cameras to be used for 3D reconstruction, using a function obtained based on the difference between the ground truth data of the subject and the 3D shape estimated by an estimator for estimating the 3D shape, and the lower-order parameters, A generation means that takes the image captured by the aforementioned determined camera as input and generates three-dimensional shape data using an estimator that estimates the three-dimensional shape, An information processing device characterized by having the following features.

2. The information processing apparatus according to Claim 1, characterized in that the function is obtained based on the difference between an image rendered from a virtual camera using the three-dimensional shape data estimated by the estimator and an image rendered from the virtual camera using the correct data of the subject.

3. The information processing apparatus according to claim 1, characterized in that the lower-order parameters representing the state of the subject are estimated by means different from the estimator used for three-dimensional reconstruction.

4. The information processing device according to claim 1, characterized in that the lower-order parameters representing the state of the subject are parameters representing the posture, shape, and direction of the person.

5. The information processing apparatus according to claim 1, characterized in that the determination means determines the state of the virtual camera used for video rendering after the three-dimensional shape data has been generated, in addition to the lower-order parameters that represent the state of the subject.

6. The information processing apparatus according to claim 1, characterized in that the determination means determines whether an obstruction occurs between the virtual camera and the subject from the viewpoint of the virtual camera, in addition to a lower-order parameter representing the state of the subject and a parameter representing the state of the virtual camera.

7. The information processing apparatus according to claim 1, characterized in that when the generation means performs three-dimensional reconstruction using the camera selected by the determination means, it uses an estimator that has been trained so that the estimator used for three-dimensional reconstruction differs according to the trend of the image captured by each camera.

8. The information processing apparatus according to claim 1, characterized in that when the generation means performs three-dimensional reconstruction using the camera selected by the determination means, it uses intermediate features previously estimated in order to stabilize the estimation result over time.

9. The information processing apparatus according to claim 1, characterized in that the image captured by the camera determined above is a two-dimensional image.

10. The information processing apparatus according to claim 1, characterized in that the generation means generates the three-dimensional shape data using a three-dimensional point cloud estimated from a depth map acquired based on an image captured by the determined camera as input.

11. The determination means determines a camera set consisting of multiple cameras that capture images synchronously as the camera to be used for three-dimensional reconstruction. The information processing apparatus according to claim 1, characterized in that the generation means inputs a plurality of images captured by a plurality of cameras included in the camera set to the estimator to generate the three-dimensional shape data.

12. The information processing apparatus according to claim 11, wherein the determination means determines a camera set ID to be used for generating the three-dimensional shape data from among the camera set IDs associated with each of the plurality of camera sets, based on a low-order parameter representing the state of the subject.

13. The information processing apparatus according to claim 1, wherein the determination means obtains the rendering scale of the subject at the time of rendering based on the position of the subject and the position of the virtual camera that renders the three-dimensional shape data, and determines an estimation method for generating the three-dimensional shape data based on the rendering scale.

14. The information processing apparatus according to claim 1, wherein the generation means selects either an estimation method using the estimator or an estimation method using a geometric method based on the state of the subject, the state of the virtual camera, or the rendering scale of the subject, and generates the three-dimensional shape data using the selected estimation method.

15. The information processing apparatus according to claim 1, wherein the acquisition means acquires the function using an image rendered by a virtual camera corresponding to the one or more cameras while changing the state of a three-dimensional model corresponding to the subject, in a virtual environment that reproduces at least one of the intrinsic parameters, extrinsic parameters, and distortion coefficients of the one or more cameras in the real environment, and ground truth data based on the three-dimensional model.

16. An information processing method that involves photographing a subject with one or more cameras and reconstructing the subject in three dimensions using the captured images, An acquisition step to acquire low-order parameters representing the state of the subject, A decision step to determine one or more cameras to be used for 3D reconstruction using a function obtained based on the difference between the ground truth data of the subject and the 3D shape estimated by an estimator for estimating the 3D shape, and the lower-order parameters. The generation process involves using an estimator that takes images captured by the aforementioned determined camera as input to generate 3D shape data, and An information processing method characterized by having the following features.