Image processing system, image processing method, and program

The image processing system enhances endoscope imaging by training a NeRF model with customized loss functions to generate high-precision images from arbitrary viewpoints, addressing the limitations of current systems and improving medical examination efficiency.

JP2026110351APending Publication Date: 2026-07-02INTERNATIONAL UNIVERSITY OF HEALTH AND WELFARE +2

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
INTERNATIONAL UNIVERSITY OF HEALTH AND WELFARE
Filing Date
2024-12-20
Publication Date
2026-07-02

Smart Images

  • Figure 2026110351000001_ABST
    Figure 2026110351000001_ABST
Patent Text Reader

Abstract

This provides a novel technique for observing the subject of inspection from a free and unconstrained perspective. [Solution] The image processing system 1 includes a pose generation unit 103 that generates a new camera pose, a second type of camera pose, based on a first type of camera pose, which is an estimated camera pose; and a model learning unit 104 that learns the NeRF model using a first loss function for color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose, a second loss function for depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose, and a third loss function for depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose.
Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] This disclosure relates to an image processing system, an image processing method, and a program. [Background technology]

[0002] A well-known examination method involves using an endoscope to capture images of organs such as the stomach, and then using those images to check for the presence of cancer or other lesions in those organs. Examinations using images captured by an endoscope greatly contribute to the detection of lesions. However, there are limitations to the position and orientation of the endoscope when moving it within the patient's body to capture images. In other words, the viewpoint of the endoscope is limited, making it difficult to observe the subject of examination from a free viewpoint.

[0003] In this regard, Non-Patent Document 1 discloses a technique for reconstructing the pose of the endoscope camera and a 3D (three-dimensional) model of the entire stomach from images taken with an endoscope camera of the stomach, using SfM (Structure from Motion).

[0004] Furthermore, as a technology related to this disclosure, Non-Patent Document 2 discloses an algorithm called NeRF (Neural Radiance Fields). Non-Patent Document 3 discloses the introduction of KL divergence loss into the learning process of NeRF. [Prior art documents] [Non-patent literature]

[0005] [Non-Patent Document 1] AR Widya, Y. Monno, K. Imahori, M. Okutomi, S. Suzuki, T. Gotoda, and K. Miki, “3D Reconstruction of Whole Stomach from Endoscope Video Using Structure-from-Motion”, in Proc. EMBC, 2019. [Non-Patent Document 2] B. Mildenhall, PP Srinivasan, M. Tancik, JT Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, in Proc. ECCV, 2020. [Non-Patent Document 3] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, “Depth-supervised NeRF: Fewer Views and Faster Training for Free”, in Proc. CVPR, 2022. [Overview of the Initiative] [Problems that the invention aims to solve]

[0006] By further applying Poisson surface reconstruction to a 3D model obtained using the technology described in Non-Patent Document 1, it is possible to generate textured meshes. While it is possible to synthesize images from novel viewpoints using these reconstructed 3D textured meshes, it is not possible to synthesize high-precision images.

[0007] Therefore, there is a need for novel technologies that allow for the observation of subjects from a free and unbiased perspective. [Means for solving the problem]

[0008] The image processing system relating to this disclosure is An image acquisition unit that acquires multiple images of organs captured by an endoscope camera, An image processing unit that uses the aforementioned plurality of images to estimate the camera pose, which is the position and orientation of the endoscope camera at the time the images were captured, and generates point cloud data of the organs, A pose generation unit generates a new camera pose, a second type of camera pose, which is different from the first type of camera pose, based on the estimated first type of camera pose, The model learning unit performs machine learning on the NeRF (Neural Radiance Fields) model. It has, The aforementioned model learning unit, A first loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image, A second loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and depth information which is training data identified from the point cloud data, A third loss function representing the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose and depth information which is training data identified from the point cloud data, The NeRF model is trained using at least the following.

[0009] The image processing method relating to this disclosure, Multiple images of organs are acquired using an endoscope camera. Using the aforementioned multiple images, the camera pose, which is the position and orientation of the endoscope camera at the time the images were captured, is estimated, and point cloud data of the organs is generated. Based on the estimated camera pose, which is the first type of camera pose, a new camera pose, which is a second type of camera pose, is generated that is different from the first type of camera pose. We perform machine learning on the NeRF model, In the machine learning of the aforementioned NeRF model, A first loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image, A second loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and depth information which is training data identified from the point cloud data, A third loss function representing the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose and depth information which is training data identified from the point cloud data, The NeRF model is trained using at least the following.

[0010] The program related to this disclosure is Image acquisition step: Obtaining multiple images of organs using an endoscope camera, Image processing steps include using the aforementioned multiple images to estimate the camera pose, which is the position and orientation of the endoscope camera at the time the images were captured, and generating point cloud data of the organs. A pose generation step that generates a new camera pose, a second type of camera pose, which is different from the first type of camera pose, based on the estimated first type of camera pose, Model training steps for machine learning of the NeRF model Have the computer run it, In the aforementioned model learning step, A first loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image, A second loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and depth information which is training data identified from the point cloud data, A third loss function representing the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose and depth information which is training data identified from the point cloud data, The NeRF model is trained using at least the following. [Effects of the Invention]

[0011] This disclosure provides a novel technique for observing the object of inspection from a free viewpoint. [Brief explanation of the drawing]

[0012] [Figure 1] This block diagram shows an example of the configuration of the image processing system according to Embodiment 1. [Figure 2] This is a schematic diagram illustrating the processing using the NeRF model. [Figure 3] This flowchart shows an example of the learning process flow of a NeRF model by the information processing device according to Embodiment 1. [Figure 4] This flowchart shows an example of the flow of the free-viewpoint image generation operation by the information processing device according to the embodiment. [Figure 5] This is a block diagram showing an example of the configuration of an image processing system according to Embodiment 2. [Figure 6] This figure shows examples of images before and after conversion by the image conversion unit. [Figure 7] This is a schematic diagram illustrating CycleGAN. [Figure 8] This flowchart shows an example of the learning process flow of a NeRF model by the information processing device according to Embodiment 2. [Figure 9] This is a schematic diagram of a NeRF model according to a modified example of Embodiment 2. [Figure 10] This is a block diagram showing an example of a computer configuration. [Modes for carrying out the invention]

[0013] The embodiments will be described below with reference to the drawings. For clarity of explanation, the following descriptions and drawings have been omitted and simplified as appropriate. In addition, the same elements are denoted by the same reference numerals in each drawing, and redundant explanations have been omitted where necessary.

[0014] <Embodiment 1> Figure 1 is a block diagram showing an example of the configuration of an image processing system 1 according to Embodiment 1. The image processing system 1 shown in Figure 1 includes an information processing device 100, an endoscope camera 400, an input device 200, and a display device 300.

[0015] The endoscope camera 400 is a camera that images the organs to be examined inside the body of the patient. In this embodiment, as an example, the endoscope camera 400 specifically images the lumen of the stomach, but it may also image the lumen of other organs. For example, the endoscope camera 400 may image the esophagus, duodenum, small intestine, or large intestine. That is, in this embodiment, the organ may specifically be the esophagus, duodenum, small intestine, or large intestine. The endoscope camera 400 captures moving images (video) at a predetermined frame rate. That is, the endoscope camera 400 continuously captures multiple images at a predetermined frame rate. While imaging is taking place, the position and orientation (direction) of the endoscope camera 400 may be changed by a user, such as a doctor. That is, images are continuously captured while the position and orientation of the endoscope camera 400 are changing. However, even though the position and orientation of the endoscope camera 400 are changing, the images are not necessarily captured from the viewpoint desired by the user. In the following description, the position and orientation of the camera will be referred to as the camera pose. The endoscope camera 400 includes, for example, various lenses, an imaging sensor, a signal processing circuit, etc. As the imaging sensor, for example, a sensor such as a CCD (Charge Coupled Device) or CMOS (Complementary Metal-Oxide Semiconductor) is used. The image set 90, consisting of multiple images captured by the endoscope camera 400, is input to the information processing device 100. The image set 90 may be input directly from the endoscope camera 400 to the information processing device 100, or the image set 90 may be input to the information processing device 100 via another device or another communication network.

[0016] The input device 200 is a device for inputting data to the information processing device 100 and is connected to the information processing device 100 in a communicative manner. The input device 200 is, for example, a pointing device or a keyboard. Examples of pointing devices include, but are not limited to, a mouse or a trackball.

[0017] The display device 300 is communicatively connected to the information processing device 100 and displays an image based on the display control of the information processing device 100. The display device 300 can be any known display capable of displaying an image, such as a liquid crystal display, plasma display, or organic EL (Electro-Luminescence) display. The input device 200 and the display device 300 may be integrated as a touch panel.

[0018] Next, the information processing device 100 will be described. The information processing device 100 generates an NeRF (Neural Radiance Fields) model using a set of images 90 captured by an endoscope camera 400 of an organ of a particular patient. The NeRF model is a model for reconstructing a 3D representation of a scene and is generated by deep learning. Then, based on the generated NeRF model, the information processing device 100 generates an image of the organ of the patient from an arbitrary viewpoint. According to NeRF, it is possible to synthesize an image viewed from a novel viewpoint based on 2D images captured from various viewpoints. In contrast to conventional 3D reconstruction methods that generate explicit and discrete 3D representations such as point clouds and meshes, NeRF learns an implicit and continuous 3D representation by an NeRF model composed of a neural network such as a multilayer perceptron (MLP).

[0019] Figure 2 is a schematic diagram illustrating the processing using the NeRF model. NeRF model F Θ This is a neural network model that takes the 3D coordinate x of a point on a ray 50 emitted from a viewpoint and the direction d of the ray 50 as input, and outputs the corresponding RGB color value c and volume density σ. NeRF Model F ΘUsing this method, RGB color values ​​c and volume density σ are obtained for a number of points (a predetermined number of sample points) on the light ray 50. Then, volume rendering 52 is performed using these values ​​to synthesize the pixel color 53 or depth information 54 of the image as seen from the light ray 50 as the line of sight to the object being inspected. By setting various light rays 50 from a single viewpoint according to the camera's pose and performing the above processing, a two-dimensional image (specifically, a color image or depth image) as seen from a desired viewpoint to the object being inspected can be synthesized.

[0020] In this embodiment, in order to accurately synthesize free-viewpoint images of organs such as the stomach, an improved learning method of the conventional NeRF is used to create the NeRF model F. Θ In particular, as shown in Figure 2, in this embodiment, the information processing device 100 uses not only rays from the actually observed field of view 51a, but also rays from the field of view 51b, which were not actually observed, to generate the NeRF model F. Θ This generates the NeRF model F. That is, the model learning unit 104 of the information processing device 100, described later, uses not only the light rays corresponding to the images actually captured by the endoscope camera 400, but also the light rays corresponding to fields of view other than the field of view at the time of imaging by the endoscope camera 400, to generate the NeRF model F. Θ The learning process is performed. The details of the information processing device 100 according to this embodiment will be described below.

[0021] As shown in Figure 1, the information processing device 100 includes an image acquisition unit 101, an image processing unit 102, a pose generation unit 103, a model learning unit 104, a model storage unit 105, a pose reception unit 106, an image generation unit 107, and a display control unit 108. Note that some of the components of the information processing device 100 shown in Figure 1 may be implemented by other information processing devices. That is, the configuration shown as components of the information processing device 100 in Figure 1 may be distributed and processed by multiple information processing devices. Furthermore, the information processing device including the model learning unit 104 may be referred to as a learning device or the like.

[0022] The image acquisition unit 101 acquires multiple images of organs captured by the endoscope camera 400. More specifically, the image acquisition unit 101 acquires a set of images 90 of the lumen of the same subject being examined (organ) captured by the endoscope camera 400.

[0023] The image processing unit 102 uses the set of organ images 90 acquired by the image acquisition unit 101 to estimate the camera pose of the endoscope camera 400 at the time of imaging for each image. The image processing unit 102 also generates point cloud data of the object of examination (organ). More specifically, the image processing unit 102 generates point cloud data of the characteristic points of the lumen of the organ. Specifically, the image processing unit 102 applies SfM to the set of images 90 acquired by the image acquisition unit 101 to estimate the camera pose of the endoscope camera 400 at the time of imaging for each image and to generate point cloud data. Since it is widely known from Non-Patent Literature 1, etc., that these can be obtained by SfM, a detailed explanation of the processing of the image processing unit 102 will be omitted. Hereafter, the camera pose estimated by the image processing unit 102 will also be referred to as the first type of camera pose. The first type of camera pose can be said to be the camera pose corresponding to the image actually captured by the endoscope camera 400.

[0024] The pose generation unit 103 generates a new camera pose, a Type 2 camera pose, which is different from the Type 1 camera pose, based on the camera pose (Type 1 camera pose) estimated by the processing of the image processing unit 102. The Type 2 camera pose can be described as a camera pose that does not correspond to any of the images actually captured by the endoscope camera 400.

[0025] In this embodiment, as an example, the pose generation unit 103 generates the second type of camera pose as follows. The pose generation unit 103 generates k (k is an integer greater than or equal to 1) second type of camera poses that complement the two first type of camera poses for any two images acquired by the image acquisition unit 101. Here, as an example, these two images are images of two consecutive frames, but they do not necessarily have to be such images. Here, the camera pose is represented by P, and the camera pose P is defined as in Equation 1 below. In Equation 1, z is the three-dimensional real coordinate value of the camera's viewpoint, and q is the unit quaternion representing the camera's pose.

[0026] <Equation 1> TIFF2026110351000002.tif941

[0027] The pose generation unit 103 b =(z b , q b ) and P b+1 =(z b+1 , q b+1 ), and generates k new camera poses P k =(z k , q k ) that complement them. Here, z k and q k are represented by the following Equation 2. In Equation 2, α k is a numerical value randomly sampled from the range from 0 to 1.

[0028] <Equation 2> TIFF2026110351000003.tif24110

[0029] The model learning unit 104 performs machine learning on the NeRF model F Θ . The model learning unit 104 performs machine learning (deep learning) on the NeRF model F Θ using the loss function described below, and the learned NeRF model F ΘThis generates the model. The model learning unit 104 then performs model learning processing using a loss function, as described later, according to a predetermined learning algorithm, such as backpropagation.

[0030] The following describes the NeRF model F developed by the model learning unit 104. Θ The details of the learning process will be explained below. In this embodiment, the model learning unit 104 repeatedly performs the learning process using two types of rays called color rays and depth rays as light rays from the camera (camera rays).

[0031] A color ray is a ray from the camera's origin in the direction corresponding to a pixel that makes up the captured two-dimensional image, and the color of the ground truth for the color ray is the color of that pixel in the image. In other words, for the color ray, the color information of the pixels in the image that was actually captured is referenced as training data. A depth ray is a ray from the camera's origin in the direction of a point that makes up the point cloud data generated by the image processing unit 102 and is within the camera's field of view. For the depth ray, depth information calculated using the point cloud data generated by the image processing unit 102 is referenced as training data. Here, the depth information can also be said to be the distance from the camera origin, which is determined by the camera pose, to a point that exists on the depth ray. Alternatively, the depth information can also be said to be the distance between the camera origin, which is determined by the camera pose, and an object that exists on the ray.

[0032] Furthermore, color ground truth can only be used for the field of view corresponding to the captured image, i.e., the observed field of view. Therefore, for color rays, only the color rays corresponding to the observed field of view are available in the NeRF Model F. ΘThese are sampled for training. In other words, only the color rays corresponding to the first type of camera pose are used in the training process. On the other hand, for depth rays, the depth rays corresponding to the observed field of view, i.e., the depth rays corresponding to the first type of camera pose, and the depth rays corresponding to the unobserved field of view, i.e., the depth rays corresponding to the second type of camera pose, are used in the training of the NeRF model F. Θ Samples are taken for training. Below, we will discuss the NeRF model F. Θ The set of color rays used in the learning process is R c ob We will represent it as follows. Also, NeRF model F Θ Of the set of depth rays used in the learning process, the set of depth rays corresponding to the observed field of view is R d ob Represented by R, the set of depth rays corresponding to unobserved fields of view is R d nv We will represent it as follows.

[0033] Furthermore, using the 3D coordinates o of the camera origin, the 3D unit direction vector d of the camera ray, and the parameter t that specifies the distance, the camera ray r(t) will be expressed as shown in Equation 3 below. In other words, the position r(t) along the camera ray is expressed as shown in Equation 3 below. Note that the 3D unit direction vector d will be simply referred to as "direction d" below.

[0034] <Expression 3> TIFF2026110351000004.tif841

[0035] The model learning unit 104 and the image generation unit 107, described later, perform volume rendering of color information for the camera ray shown in Equation 3 using Equation 4, similar to Non-Patent Document 2. That is, in the learning phase, the model learning unit 104 performs NeRF model F according to Equation 4. ΘVolume rendering is performed using the output of . In Equation 4, C^(r) represents the color rendered for the camera ray. Also, T(t) is the cumulative transmittance defined in Equation 5 below. T(t) indicates how much light is absorbed by particles in the medium as the ray from the camera origin travels to the position specified by the value of t. n and t f This is a predetermined value that defines the range of t, and t n This represents a predetermined boundary value of t on the side closer to the camera origin, and t f represents a predetermined boundary value for t on the side farther from the camera origin. Volume density σ(r(t)) is the volume density at position r(t). c(r(t),d) is the color at position r(t) with respect to direction d. σ(r(t)) and c(r(t),d) are defined in the NeRF model F, which is a multilayer perceptron network. Θ When input (r(t),d) is given, the NeRF model F Θ This is a pair of volume density and color output.

[0036] <Expression 4> TIFF2026110351000005.tif2196

[0037] <Formula 5> TIFF2026110351000006.tif1171

[0038] Furthermore, the model learning unit 104 and the image generation unit 107, described later, perform volume rendering of depth information for the camera ray shown in Equation 3 using the following Equation 6. That is, in the learning phase, the model learning unit 104 performs NeRF model F according to Equation 6. Θ Volume rendering is performed using the output of Equation 6. In Equation 6, D^(r) represents the rendered depth corresponding to the camera ray.

[0039] <Formula 6> TIFF2026110351000007.tif2281

[0040] In this embodiment, in order to calculate the loss function described below, the model learning unit 104 uses the set R c ob For camera rays (i.e., color rays) belonging to the set R, rendering of color information and depth information is performed, and the set R d ob or set R d nv For camera rays (i.e., depth rays) belonging to this category, depth information rendering is performed.

[0041] Next, we will explain the loss function calculated by the model learning unit 104. The model learning unit 104 calculates the camera ray R c ob , R d ob , R d nv Using the rendering results described above, the following loss function is calculated. Specifically, the model learning unit 104 calculates a color-based loss and a geometry-based loss.

[0042] First, let's explain the loss based on color. The model learning unit 104 uses the set R c ob For camera rays belonging to the following category, the loss function L is expressed by the following equation 7. color ob Calculate the following. Note that C^(r) is the set R c ob This represents the rendered color for the camera ray belonging to this category. In contrast, C(r) is the ground truth corresponding to C^(r), and is the color identified from the captured image. Thus, the model learning unit 104 sets the NeRF model F for the light ray corresponding to the first type of camera pose (observed field of view). Θ A loss function is calculated that represents the difference between the color information estimated based on the output and the color information, which is the training data identified from the captured image.

[0043] <Formula 7> TIFF2026110351000008.tif23102

[0044] Next, we will explain the geometry-based loss for the observed view. The model learning unit 104 uses the set R d ob For camera rays belonging to the following category, the loss function l is expressed by equation 8 below. d Calculate (r). Note that D^(r) is the set R d ob This represents the rendered depth value for the camera ray belonging to D. pc (r) is a reference value corresponding to D^(r), and is a depth value identified based on the point cloud data generated by the image processing unit 102. In this way, the model learning unit 104 sets the NeRF model F for the light rays corresponding to the first type of camera pose (observed field of view). Θ A loss function is calculated that represents the difference between the depth information estimated based on the output and the training data depth information identified from the point cloud data. As will be described later, a similar loss function is also calculated for rays corresponding to the second type of camera pose (unobserved field of view).

[0045] <Formula 8> TIFF2026110351000009.tif1485

[0046] Furthermore, in this embodiment, the model learning unit 104 also calculates the smoothness loss. The inner surface of organs such as the stomach is typically smooth and continuous. For this reason, the model learning unit 104 calculates the set R c ob or R d ob The NeRF model F is constrained by the requirement that the rendered depth for camera rays belonging to the group has local smoothness. Θ In order to enable learning, the following smoothing loss is calculated. Specifically, the model learning unit 104 uses the set R c ob or R d obFor camera rays belonging to the following category, the loss function l is expressed by equation 9 below. s Calculate (r). s (r) is NeRF model F Θ It can also be said that this represents the smoothness of the organ's surface. u ∂ represents the derivative operation in the horizontal direction (first direction) in the image plane, and v This represents a differential operation with respect to the vertical direction (the second direction orthogonal to the first direction) in the image plane. Thus, in this embodiment, the model learning unit 104 calculates the NeRF model F for the light rays corresponding to the first type of camera pose (observed field of view). Θ The smoothness of the organ surface, calculated using depth information (a set of depth information) estimated based on the output, is further used as the loss function in the NeRF model F Θ The learning process is then performed. As will be described later, a similar loss function is also calculated for rays corresponding to the second type of camera pose (unobserved field of view).

[0047] <Formula 9> TIFF2026110351000010.tif1386

[0048] Furthermore, in this embodiment, the model learning unit 104 also calculates the KL divergence loss introduced in Non-Patent Document 3. When the endoscope camera 400 images the lumen of an organ, it is assumed that the only objects present in the direction of the camera ray's movement are the inner surface of the organ. That is, the camera ray should not intersect with any objects in front of or behind the inner surface of the organ. For this reason, the model learning unit 104 calculates the set R d ob For camera rays belonging to the NeRF model F Θ The NeRF model F is constrained by the fact that the volume density distribution output from it is unimodal. Θ In order to ensure that learning is performed, the KL divergence loss l KLCalculate (r). For details of the KL divergence loss, refer to Non-Patent Document 3. Thus, in this embodiment, the model learning unit 104 uses, as a loss function, the degree of unimodality in the distribution of the volume density output from the NeRF model F for the rays corresponding to the first type of camera pose (observed field of view). Θ to further train the NeRF model F. Note that, as will be described later, a similar loss function is also calculated for the rays corresponding to the second type of camera pose (unobserved field of view). Θ

[0049] In this embodiment, the final geometric-based loss function L for the observed field of view (view) is the sum of the losses described above and is expressed as Equation 10 below. Note that λ depth ob is a hyperparameter. For example, λ d , λ KL , and λ s may be set to 10, 0.1, and 10, respectively, but other values may be used as specific values of the hyperparameters. d = 10, λ KL = 0.1, λ s = 10, but other values may be used as the specific values of the hyperparameters.

[0050] <Equation 10> TIFF2026110351000011.tif50129

[0051] As described above, the geometric-based loss for the observed field of view (view) has been explained. However, the model learning unit 104 also calculates a geometric-based loss function for the unobserved field of view (view) in the same manner. That is, the model learning unit 104 calculates a loss function representing the difference between the depth information estimated based on the output of the NeRF model F for the rays corresponding to the second type of camera pose (unobserved field of view) and the depth information that is the teacher data specified from the point cloud data. Also, the model learning unit 104 uses, as a loss function, the degree of unimodality in the distribution of the volume density output from the NeRF model F for the rays corresponding to the second type of camera pose (unobserved field of view). Θ Θ ​​The smoothness of the organ surface, calculated using depth information (a set of depth information) estimated based on the output, is used as the loss function. In addition, the model learning unit 104 uses the NeRF model F for light rays corresponding to the second type of camera pose (unobserved field of view). Θ The degree of unimodality in the volume density distribution output is used as the loss function.

[0052] In this embodiment, the final loss function L is geometrically based on the unobserved field of view. depth nv λ is the sum of the losses mentioned above and can be expressed as shown in Equation 11 below. For example, λ d =10, λ KL =0.1, λ s You could set it to =10, but other values ​​may be used as the specific values ​​for the hyperparameters.

[0053] <Formula 11> TIFF2026110351000012.tif37125

[0054] In this embodiment, the model learning unit 104 is NeRF model F Θ The final loss function L used for learning total This is expressed as shown in Equation 12 below. That is, the loss function L shown in Equation 12 below is used by the model learning unit 104. total The model learning unit 104 calculates the loss function L. total The learning process is repeated until the value of becomes sufficiently small.

[0055] <Formula 12> TIFF2026110351000013.tif13112

[0056] The model learning unit 104 uses the loss function described above to create the NeRF model F Θ Once the training is complete, the generated machine-trained NeRF model F ΘThe data is stored in the model storage unit 105. That is, data such as trained parameters representing the trained model generated by the machine learning processing of the model learning unit 104 is stored in the model storage unit 105. The trained model (NeRF model F) generated by the model learning unit 104 is stored in the model storage unit 105. Θ This can be used as a computer program module that causes a computer to output data used for rendering free-viewpoint images (i.e., images depicting organs as seen from a specified camera pose).

[0057] The above describes the configuration of the information processing device 100, particularly the configuration related to the NeRF model learning phase. Next, we will describe the configuration of the information processing device 100, particularly the configuration related to the display of free-viewpoint images using the NeRF model.

[0058] The pose reception unit 106 receives the camera pose specification. The user inputs an instruction to the information processing device 100 via the input device 200, specifying a camera pose corresponding to the desired viewpoint, in order to have the information processing device 100 generate an image of the subject's organs as seen from that viewpoint. The pose reception unit 106 receives this input.

[0059] The image generation unit 107 generates a new image of the organ with the camera pose received by the pose reception unit 106 as the viewpoint. Specifically, the image generation unit 107 generates a new image of the organ with the specified camera pose as the viewpoint based on the output of the NeRF model obtained by inputting the camera pose specified by the user into the trained NeRF model. More specifically, the image generation unit 107 emits M × N (where M and N are integers) rays within the field of view of the specified camera pose, and estimates the volume density and color of each point on the ray using the NeRF model. Then, for each ray, the image generation unit 107 estimates the color of one pixel corresponding to that ray by performing volume rendering according to equation 4. By performing this process for each ray, the image generation unit 107 generates a two-dimensional color image of M × N pixels. The image generation unit 107 may also generate a depth image as a new image of the organ with the specified camera pose as the viewpoint. In this case, the image generation unit 107 estimates the volume density of each point on each ray using the NeRF model, and then estimates the depth of one pixel corresponding to each ray by performing volume rendering according to equation 6 for each ray. By performing this process for each ray, the image generation unit 107 generates a two-dimensional depth image of M × N pixels. In this way, in the inference phase, the image generation unit 107 estimates the NeRF model F according to equation 4 or equation 6. Θ Volume rendering is performed using the output. The image generation unit 107 may generate either a two-dimensional color image or a two-dimensional depth image for the camera pose received by the pose reception unit 106, or it may generate both.

[0060] The display control unit 108 performs the process of displaying the image on the display device 300. In particular, in this embodiment, the display control unit 108 displays the image generated by the image generation unit 107 on the display device 300. This allows the user to view images of the subject's organs from a desired viewpoint.

[0061] Next, we will explain the operation flow of the information processing device 100. Figure 3 is a flowchart showing an example of the learning operation flow of the NeRF model by the information processing device 100. The learning operation flow of the NeRF model will be explained below in accordance with Figure 3.

[0062] In step S10, the image acquisition unit 101 acquires a set of images of organs captured by the endoscope camera 400. Next, in step S11, the image processing unit 102 uses the image set acquired in step S10 to estimate the camera pose of the endoscope camera 400 at the time of imaging for each image and to generate point cloud data of the imaged object. Next, in step S12, the pose generation unit 103 generates a new camera pose (an unobserved field of view). Next, in step S13, the model learning unit 104 performs machine learning on the NeRF model using the image set obtained in step S10, the point cloud data obtained in step S11, and the camera poses obtained in steps S11 and S12. As a result, a trained NeRF model is generated and stored in the model storage unit 105.

[0063] Figure 4 is a flowchart illustrating an example of the flow of free-viewpoint image generation by the information processing device 100. The flow of free-viewpoint image generation will be explained below in accordance with Figure 4.

[0064] In step S20, the pose reception unit 106 receives a camera pose specification from the user. Next, in step S21, the image generation unit 107 generates an image in which the object to be inspected is drawn using the camera pose received in step S20 as the viewpoint, using the trained NeRF model. Next, in step S22, the display control unit 108 causes the image generated in step S21 to be displayed on the display device 300.

[0065] Embodiment 1 has been described above. This embodiment provides a novel technique for observing the subject of examination from various viewpoints. As a result, users such as doctors can observe the subject of examination from various viewpoints, enabling them to appropriately detect lesions such as cancer. In particular, since the NeRF model is trained using not only the pose estimated by the image processing unit 102 but also the pose generated by the pose generation unit 103, a more accurate model can be generated compared to the case where only the pose estimated by the image processing unit 102 is used. Furthermore, since observation from various viewpoints is possible any number of times after imaging of the subject by the endoscope camera 400 is completed, the burden on the subject can be reduced while improving the convenience of the user (such as a doctor). In particular, by using the NeRF model trained using the loss function described above, it is possible to generate images with higher accuracy compared to the case where training is not performed using the loss function.

[0066] <Embodiment 2> Next, Embodiment 2 will be described. This embodiment differs from Embodiment 1 in that image conversion processing is performed on the image acquired by the image acquisition unit 101. Hereinafter, descriptions of configurations and processes similar to those in Embodiment 1 will be omitted as appropriate, and configurations and processes different from Embodiment 1 will be described in detail. In this embodiment as well, the organ is the stomach as an example, but other organs may also be used. For example, the esophagus, duodenum, small intestine, large intestine, etc. That is, in this embodiment as well, the organ may specifically be the esophagus, duodenum, small intestine, or large intestine, etc.

[0067] As described above, in Embodiment 1, the image processing unit 102 uses the set of organ images acquired by the image acquisition unit 101 to estimate the camera pose of the endoscope camera 400 at the time of imaging for each image and to generate point cloud data of the organs. At this time, if the set of images acquired by the image acquisition unit 101 is an image of an organ that has not been stained with a dye (staining solution), the texture (features) of the organ surface will be poor, and there is a risk that the camera pose estimation and point cloud data generation will not be performed appropriately in the processing of the image processing unit 102. In order for the processing of the image processing unit 102 to be performed appropriately, it is preferable that the images used for processing be images of an organ that has been stained with a dye (staining solution). This is because staining the surface of the organ with a dye emphasizes the unevenness and color tone of the organ surface (inner surface). However, in order to actually color an organ with a dye, it is necessary to spray the dye onto the organ. Therefore, the burden on the subject to be sprayed with the dye is significant, and it also takes a long time to obtain the image.

[0068] Therefore, in this embodiment, the image conversion unit 121, described later, converts the unstained organ image acquired by the image acquisition unit 101 into an image of an organ that has been virtually stained with a predetermined dye, using CycleGAN (Cycle-Consistent Generative Adversarial Networks) image conversion.

[0069] Figure 5 is a block diagram showing an example of the configuration of the image processing system 1a according to Embodiment 2. The image processing system 1a shown in Figure 5 differs from the image processing system 1 shown in Figure 1 in that the information processing device 100 is replaced by the information processing device 100a. Furthermore, the information processing device 100a differs from the information processing device 100 according to Embodiment 1 in that an image conversion unit 121 and an image selection receiving unit 122 are added. In this embodiment, the image set (multiple images) acquired by the image acquisition unit 101 is assumed to be images of organs that have not been stained with dye.

[0070] The image conversion unit 121 uses a pre-trained CycleGAN model to convert each of the multiple images acquired by the image acquisition unit 101 into an image of an organ stained with a predetermined dye. In this embodiment, as an example, the predetermined dye is indigo carmine dye, but other dyes used for organ staining may be used as the predetermined dye. Figure 6 shows an example of image 90a before conversion by the image conversion unit 121 and image 90b after conversion by the image conversion unit 121.

[0071] The CycleGAN model may be stored in the model memory unit 105 beforehand, or it may be generated by the model learning unit 104. The learning process (generation process) of the CycleGAN model by the model learning unit 104 will be described below.

[0072] Figure 7 is a schematic diagram illustrating CycleGAN. CycleGAN is a network that includes two sets of Generative Adversarial Networks (GANs). In Figure 7, Generator G1 and Discriminator D1 constitute the first Generative Adversarial Network. Generator G2 and Discriminator D2 constitute the second Generative Adversarial Network. In this embodiment, Generator G1 is the CycleGAN model used by the image conversion unit 121. In this disclosure, "CycleGAN model" may mean any one of Generator G1, Generator G2, Discriminator D1, or Discriminator D2, or two or more of them, or the whole of them.

[0073] Generator G1 is a neural network that generates an output image from an input image. Specifically, when an image is input to generator G1, it outputs an image of an organ stained with a predetermined dye (e.g., indigo carmine dye). For example, if the input image to generator G1 is an image x of an organ that is not stained with dye, actually captured by the endoscope camera 400, then generator G1 will output an image x of an organ that is not stained with dye. r When input is generated, a false image of an organ stained with a predetermined dye (e.g., indigo carmine) is generated. f(That is, it outputs an image of an organ that has been virtually colored with a predetermined dye.) In other words, in this case, the generator G1 outputs the input image x r Image y, which is virtually colored with a predetermined dye. f It outputs the following. As mentioned above, the generator G1 corresponds to the CycleGAN model used by the image conversion unit 121.

[0074] Hereafter, images of organs that are not stained with a dye will be referred to as unstained images. Images of organs that have been stained with a predetermined dye will be referred to as stained images. Specifically, stained images actually captured by the endoscope camera 400 (i.e., images of the organ after it has been actually stained with the predetermined dye) will be referred to as real stained images, and unstained images actually captured by the endoscope camera 400 (i.e., images of the organ after it has not been stained in any way) will be referred to as real unstained images. Furthermore, false images of organs stained with a predetermined dye, i.e., false stained images, will also be referred to as virtual stained images. Furthermore, false images of organs that have not been stained with a predetermined dye, i.e., false unstained images, will also be referred to as virtual unstained images.

[0075] Generator G2 is also a neural network that generates an output image from an input image. Specifically, when an image is input to generator G2, it outputs an unstained image. For example, if the input image to generator G2 is a real stained image y actually captured by the endoscope camera 400, r When this is entered, a fake unstained image x f (That is, it outputs an image of the organ from which the coloring with a predetermined dye has been virtually removed.) In other words, in this case, the generator G2 outputs the input image y r Image x from which pigment has been virtually removed. f Outputs.

[0076] Thus, generator G1 converts images belonging to the first domain X (the domain of unstained images) to images belonging to the second domain Y (the domain of stained images), while generator G2 converts images belonging to the second domain Y to images belonging to the first domain X.

[0077] The discriminator D1 is a neural network that determines whether the input image is a real image or a fake image generated by the generator G1. Specifically, the discriminator D1 is a fake stained image y generated by the generator G1. f , or real stained image y actually captured by endoscope camera 400 r The input is received, and the classifier D1 outputs a classification result indicating which image the input image is.

[0078] The discriminator D2 is a neural network that determines whether the input image is a real image or a fake image generated by the generator G2. Specifically, the discriminator D2 is a fake unstained image x generated by the generator G2. f , or a real unstained image x actually captured by the endoscope camera 400 r The input is received, and classifier D2 outputs a classification result indicating which image the input image is.

[0079] For machine learning of generators G1, G2, discriminators D1, and D2, the image acquisition unit 101 acquires two types of image sets. The first image set consists of multiple real unstained images x r This is an image set composed of the following. Note that the first image set may consist of images of organs from different subjects. The second image set consists of multiple real stained images y r This is an image set composed of the first and second image sets. The second image set may also consist of images of organs from different subjects. Here, it is not necessary for the images belonging to the first image set and the images in the second image set to correspond to each other. In other words, the model learning unit 104 uses these two arbitrary image sets as training data to perform machine learning on the CycleGAN model.

[0080] The model learning unit 104 performs machine learning on generator G1, generator G2, discriminator D1, and discriminator D2 using the two image sets described above, according to the CycleGAN learning algorithm. As part of the learning process, the model learning unit 104 updates the parameters of each CycleGAN model, for example, as follows: For example, the model learning unit 104 calculates the so-called adversarial loss for the first generative adversarial network (generator G1 and discriminator D1) and updates the parameters of generator G1 and discriminator D1 based on the calculated loss. Similarly, the model learning unit 104 also calculates the adversarial loss for the second generative adversarial network (generator G2 and discriminator D2) and updates the parameters of generator G2 and discriminator D2 based on the calculated loss. In addition, the model learning unit 104 updates the parameters of generator G1 and generator G2 by calculating the so-called cycle consistency loss. When the model learning unit 104 finishes training the CycleGAN model, it stores the generated, trained CycleGAN model in the model storage unit 105. That is, data such as trained parameters representing the trained model generated by the machine learning process of the model learning unit 104 are stored in the model storage unit 105. In this embodiment, the image conversion unit 121 uses only generator G1 among generator G1, generator G2, discriminator D1, and discriminator D2, so the CycleGAN model stored in the model storage unit 105 may consist only of generator G1. The trained model (generator G1) generated by the model learning unit 104 is used with a real unstained image x r a virtual stained image y f It can be used as a computer program module that makes a computer function to convert to [a certain format].

[0081] In this embodiment, it is sufficient that the image conversion unit 121 can utilize the generator G1, and the model learning unit 104 does not necessarily have to perform the CycleGAN learning process. That is, the image conversion unit 121 may use a generator G1 that has been pre-trained by another information processing device. The image conversion unit 121 processes multiple real unstained images x acquired by the image acquisition unit 101. r Each of these is a virtual stained image y f The image conversion unit 121 inputs the image acquired by the image acquisition unit 101 into a pre-trained CycleGAN model (generator G1) stored in the model storage unit 105, thereby converting this image into a virtual stained image y, which is an image of an organ that has been virtually stained with a predetermined dye. f It converts to a virtual stained image y output from the generator G1. f Obtain it.

[0082] In this embodiment, the image processing unit 102 converts the virtual stained image y converted by the image conversion unit 121. f Using this set, the image acquisition unit 101 estimates the camera pose of the endoscope camera 400 at the time of imaging for each image acquired by the image acquisition unit 101, and generates point cloud data of the object to be examined (organ). In other words, in this embodiment, the image processing unit 102 performs SfM processing on the image set converted by the image conversion unit 121. This allows the image processing unit 102 to perform processing using an image set in which the unevenness and color tone of the organ's surface (inner surface) are emphasized. Therefore, the camera pose estimation and point cloud data generation can be performed appropriately. In other words, in this embodiment, it is possible to generate an NeRF model with higher accuracy compared to when the image processing unit 102 performs processing using real unstained images.

[0083] Incidentally, for training the NeRF model, either a set of real unstained images acquired by the image acquisition unit 101 or a set of virtual stained images converted by the image conversion unit 121 may be used. When a set of real unstained images acquired by the image acquisition unit 101 is used for training the NeRF model, the image generation unit 107 can generate a two-dimensional color image of the unstained image corresponding to a specified viewpoint. On the other hand, when a set of virtual stained images converted by the image conversion unit 121 is used for training the NeRF model, the image generation unit 107 can generate a two-dimensional color image of the stained image corresponding to a specified viewpoint. For this reason, in this embodiment, the model training unit 104 may perform training of the NeRF model using a set of real unstained images acquired by the image acquisition unit 101 and training of the NeRF model using a set of virtual stained images converted by the image conversion unit 121, thereby generating two NeRF models. However, the model learning unit 104 may perform only one of the following: training an NeRF model using a set of real unstained images acquired by the image acquisition unit 101, or training an NeRF model using a set of virtual stained images converted by the image conversion unit 121, thereby generating a single NeRF model. Since stained images emphasize the contours and color tones of organs more than unstained images, stained images are preferable for generating free-viewpoint images for observing lesions. For this reason, it is preferable that stained images converted by the image conversion unit 121 be used for training the NeRF model by the model learning unit 104.

[0084] The image selection receiving unit 122 receives instructions to select an NeRF model to be used by the image generation unit 107. In other words, the image selection receiving unit 122 receives instructions to select either the generation of a stained image or the generation of an unstained image. For example, the user inputs instructions to the information processing device 100 via the input device 200, specifying whether they want to display a stained image or an unstained image. The image selection receiving unit 122 receives this input. When the image selection receiving unit 122 receives an instruction, the image generation unit 107 uses the NeRF model corresponding to the instruction received by the image selection receiving unit 122 to generate a new image of the organ with the camera pose received by the pose receiving unit 106 as the viewpoint. Specifically, if the image selection receiving unit 122 receives an instruction to select the generation of an unstained image, the image generation unit 107 uses an NeRF model trained using a set of real unstained images acquired by the image acquisition unit 101 to generate a free-viewpoint image. In response to this, if the image selection receiving unit 122 receives an instruction to select the generation of a stained image, the image generation unit 107 generates a free-viewpoint image using an NeRF model learned with the set of virtual stained images converted by the image conversion unit 121. Note that if only one type of NeRF model is generated, the information processing device 100a does not need to have the image selection receiving unit 122.

[0085] Next, the operation flow of the information processing device 100a according to this embodiment will be described. Figure 8 is a flowchart showing an example of the learning operation flow of a NeRF model by the information processing device 100a. The learning operation flow of the NeRF model according to this embodiment will be described below in accordance with Figure 8. The flowchart shown in Figure 8 differs from the flowchart shown in Figure 3 in that step S30 is added. The differences from the flowchart shown in Figure 3 will be explained below.

[0086] In the flowchart shown in Figure 8, after the image acquisition unit 101 acquires a set of images in step S10, the process moves to step S30. It is assumed that a set of real, unstained images is acquired in step S10.

[0087] In step S30, the image conversion unit 121 uses a trained CycleGAN model to convert the image set acquired in step S10 into a set of images of organs stained with a predetermined dye (indigo carmine). After step S30, the process moves to step S11. In step S11, the image set obtained in step S30 is used to estimate the camera pose of each image and to generate point cloud data of the imaged object.

[0088] Furthermore, in the flowchart shown in Figure 8, in step S13, the model learning unit 104 performs machine learning on the first NeRF model using the image set obtained in step S10, the point cloud data obtained in step S11, and the camera poses obtained in steps S11 and S12. The model learning unit 104 also performs machine learning on the second NeRF model using the image set obtained in step S30, the point cloud data obtained in step S11, and the camera poses obtained in steps S11 and S12. As a result, two trained NeRF models are generated and stored in the model storage unit 105. Note that in step S13, only one of the NeRF models may be generated.

[0089] The flow of the free-viewpoint image generation operation by the information processing device 100a differs from Embodiment 1 (see Figure 4) in that, when two NeRF models are generated, it is possible to select the model to be used for generating the free-viewpoint image (i.e., select the type of free-viewpoint image). However, other operations are the same, so a detailed explanation is omitted.

[0090] Embodiment 2 has been described above. In this embodiment, the image conversion unit 121 converts a real unstained image into a stained image. Therefore, the processing of the image processing unit 102 can be performed using the stained image. As a result, camera pose estimation and point cloud data generation can be performed with higher accuracy compared to when an unstained image is used. In particular, in this embodiment, the image conversion unit 121 acquires a virtual stained image by image conversion using CycleGAN. Therefore, since it is not necessary to actually spray dye on the subject's organs, the burden on the subject is reduced, and the time required to obtain the image is also reduced.

[0091] Furthermore, this embodiment allows for the training of two types of NeRF models. Therefore, images of unstained organs and stained organs can be generated as free-viewpoint images.

[0092] <Modified form of Embodiment 2> In the above-described Embodiment 2, a method for generating stained and unstained images as free-viewpoint images was shown, which involves generating two NeRF models. It is also possible to achieve this with a single NeRF model. The differences from Embodiment 2 will be described in detail below, and explanations of configurations and processes similar to those in Embodiment 2 will be omitted as appropriate.

[0093] Figure 9 is a schematic diagram of a modified NeRF model according to Embodiment 2. As shown in Figure 9, the modified NeRF model F ΘIn this modified version, the input is the same as that of the NeRF model in Embodiment 1 or Embodiment 2, but the output is different. In the NeRF model in Embodiment 1 or Embodiment 2, only one value c was output as an RGB color value, but in this modified version, not only c but also c1 is output as an RGB color value. Thus, in this modified version, the NeRF model is a model that outputs a first color and a second color. More specifically, the NeRF model in this modified version is a neural network model that takes the 3D coordinate x of a point on a ray emitted from a viewpoint and the direction d of the ray as input, and outputs the RGB color values ​​c and c1 and the volume density σ corresponding to that point. Here, the RGB color value c represents the color of an unstained organ, and the RGB color value c1 represents the color of a stained organ. Such a NeRF model can be generated by adding another color-based loss and performing a learning process. In other words, the model learning unit 104 uses two loss functions based on color: the loss function of Equation 7 calculated using a set of real unstained images acquired by the image acquisition unit 101 as training data, and the loss function of Equation 7 calculated using a set of virtual stained images transformed by the image transformation unit 121 as training data. The training of the NeRF model in the modified example is the same as in the embodiment described above, except that another color-based loss is added. In other words, in this modified example, the model learning unit 104 uses the NeRF model F for light rays corresponding to the first type of camera pose (observed field of view). Θ The NeRF model is then trained using a loss function that represents the difference between the color information estimated based on the output and the color information, which is training data, identified from the image converted by the image conversion unit 121.

[0094] In this modified example, the image generation unit 107 can generate two types of images using a camera pose received by the pose reception unit 106 as the viewpoint, using the output of a single NeRF model generated by the model learning unit 104. That is, in this modified example, the image generation unit 107 can generate a two-dimensional color image of an unstained image corresponding to a specified viewpoint, and a two-dimensional color image of a stained image corresponding to a specified viewpoint. The display control unit 108 displays each of the images generated by the image generation unit 107 on the display device 300. This allows the user to be provided with an unstained image and a stained image viewed from a desired viewpoint. The image generation unit 107 may also generate a selected image according to instructions received by the image selection reception unit 122. For example, the image generation unit 107 may generate either an unstained image or a stained image selected by the instructions. The image selection reception unit 122 may also receive instructions to select both the generation of a stained image and the generation of an unstained image. In this case, the image generation unit 107 may generate both images using the NeRF model. In this modified example, as in Embodiments 1 and 2, the image generation unit 107 may also generate a two-dimensional depth image corresponding to a specified viewpoint.

[0095] The above describes the modified form. This modified form can also generate free-viewpoint unstained and stained images.

[0096] The above-described functions (processing) of the information processing device 100 and the information processing device 100a may be implemented by a computer 500 having, for example, the following configuration.

[0097] Figure 10 is a block diagram showing an example configuration of a computer 500 that implements the processing of the information processing device 100 and the information processing device 100a. As shown in Figure 10, the computer 500 includes an input / output interface 501, memory 502, and a processor 503.

[0098] The input / output interface 501 is an interface for connecting to other devices (for example, the input device 200, display device, endoscope camera 400, etc.).

[0099] Memory 502 is composed of, for example, a combination of volatile memory and non-volatile memory. Memory 502 is used to store software (computer programs) containing one or more instructions executed by the processor 503, and data used for various processes. The model storage unit 105 can be implemented by, for example, memory 502, but may also be implemented by any storage device other than memory 502.

[0100] The processor 503 reads software (computer programs) from the memory 502 and executes them to perform the processing described above for the information processing device 100 or the information processing device 100a. The processor 503 may be, for example, a microprocessor, an MPU (Micro Processor Unit), a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), etc. The processor 503 may include multiple processors.

[0101] The program, when loaded into a computer, includes a set of instructions (or software code) for causing the computer to perform one or more of the functions described in the embodiments. The program may be stored on a non-temporary computer-readable medium or a physical storage medium. Examples, but not limited to, include random-access memory (RAM), read-only memory (ROM), flash memory, solid-state drive (SSD) or other memory technologies, CD-ROM, digital versatile disc (DVD), Blu-ray® disc or other optical disc storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage devices. The program may be transmitted over a temporary computer-readable medium or a communication medium. Examples, but not limited to, include temporary computer-readable medium or a communication medium that includes electrically, optically, acoustically or otherwise propagating signals.

[0102] It should be noted that the present invention is not limited to the embodiments described above, and can be modified as appropriate without departing from the spirit of the invention. For example, the model learning unit 104 does not have to use all of the loss functions described in the embodiments when learning the NeRF model. That is, the model learning unit 104 may learn the NeRF model using only some of the loss functions described in the embodiments. For example, the model learning unit 104 may perform the NeRF model learning process without using the smoothness loss function, or without using the unimodal loss function. Also, in Embodiment 1, the set of images acquired by the image acquisition unit 101 does not have to be a set of images of organs that have not been sprayed with dye, but may be a set of images of organs that have actually been sprayed with a predetermined dye and are captured by the endoscope camera 400.

[0103] Furthermore, some or all of the above embodiments may also be described as follows, but are not limited to the following. (Note 1) An image acquisition unit that acquires multiple images of organs captured by an endoscope camera, An image processing unit that uses the aforementioned plurality of images to estimate the camera pose, which is the position and orientation of the endoscope camera at the time the images were captured, and generates point cloud data of the organs, A pose generation unit generates a new camera pose, a second type of camera pose, which is different from the first type of camera pose, based on the estimated first type of camera pose, The model learning unit performs machine learning on the NeRF (Neural Radiance Fields) model. It has, The aforementioned model learning unit, A first loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image, A second loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and depth information which is training data identified from the point cloud data, A third loss function representing the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose and depth information which is training data identified from the point cloud data, The NeRF model is trained using at least the following: Image processing system. (Note 2) The model learning unit further uses, as a loss function, the smoothness of the organ's surface calculated using depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose, and the smoothness of the organ's surface calculated using depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose, to train the NeRF model. The image processing system described in Appendix 1. (Note 3) The model learning unit further uses the degree of unimodality in the volume density distribution output from the NeRF model for light rays corresponding to the first type of camera pose, and the degree of unimodality in the volume density distribution output from the NeRF model for light rays corresponding to the second type of camera pose, as loss functions to learn the NeRF model. The image processing system described in Appendix 1 or 2. (Note 4) The plurality of images acquired by the image acquisition unit are images of the organ that has not been stained with a predetermined dye. The image processing system further includes an image conversion unit that uses a pre-trained CycleGAN (Cycle-Consistent Generative Adversarial Networks) model to convert each of the multiple images acquired by the image acquisition unit into an image of the organ stained with the predetermined dye. The image processing unit uses the plurality of images converted by the image conversion unit to estimate the camera pose and generate the point cloud data. An image processing system as described in any one of the items 1 to 3 of the appendix. (Note 5) The model learning unit performs training on the NeRF model using the plurality of images converted by the image conversion unit. The image processing system described in Appendix 4. (Note 6) The model learning unit performs training on the NeRF model using the set of multiple images acquired by the image acquisition unit, and then trains on the NeRF model using the set of multiple images transformed by the image conversion unit, thereby generating two trained NeRF models. The image processing system described in Appendix 4. (Note 7) The aforementioned NeRF model is a model that outputs a first color and a second color. The model learning unit further uses a fourth loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image converted by the image conversion unit to train the NeRF model. The image processing system described in Appendix 4. (Note 8) A pose request desk that accepts requests for camera poses, An image generation unit that generates a new image of the organ with the specified camera pose as the viewpoint, based on the output of the NeRF model obtained by inputting the specified camera pose into the trained NeRF model, and An image processing system according to any one of the appendices 1 to 7, further comprising: (Note 9) The aforementioned organ is one of the following: stomach, esophagus, duodenum, small intestine, or large intestine. An image processing system as described in any one of the items 1 to 8 of the appendix. (Note 10) Multiple images of organs are acquired using an endoscope camera. Using the aforementioned multiple images, the camera pose, which is the position and orientation of the endoscope camera at the time the images were captured, is estimated, and point cloud data of the organs is generated. Based on the estimated camera pose, which is the first type of camera pose, a new camera pose, which is a second type of camera pose, is generated that is different from the first type of camera pose. We perform machine learning on the NeRF model, In the machine learning of the aforementioned NeRF model, A first loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image, A second loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and depth information which is training data identified from the point cloud data, A third loss function representing the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose and depth information which is training data identified from the point cloud data, The NeRF model is trained using at least the following: Image processing methods. (Note 11) Image acquisition step: Obtaining multiple images of organs using an endoscope camera, Image processing steps include using the aforementioned multiple images to estimate the camera pose, which is the position and orientation of the endoscope camera at the time the images were captured, and generating point cloud data of the organs. A pose generation step that generates a new camera pose, a second type of camera pose, which is different from the first type of camera pose, based on the estimated first type of camera pose, Model training steps for machine learning of the NeRF model Have the computer run it, In the aforementioned model learning step, A first loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image, A second loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and depth information which is training data identified from the point cloud data, A third loss function representing the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose and depth information which is training data identified from the point cloud data, The NeRF model is trained using at least the following: program. [Explanation of Symbols]

[0104] 1.1a Image processing system 50 rays 51a, 51b field of view 52 Volume Rendering 53 colors 54 Depth information 90-image set Images 90a and 90b 100, 100a Information Processing Device 101 Image acquisition unit 102 Image Processing Unit 103 Pose generation unit 104 Model Learning Department 105 Model Memory Unit 106 Pose Request Section 107 Image generation unit 108 Display Control Unit 121 Image Conversion Unit 122 Image Selection Reception Section 200 Input Devices 300 display device 400 Endoscope Cameras 500 Computers 501 Input / Output Interface 502 memory 503 Processor D1, D2 discriminator G1, G2 generator x f Virtual unstained image x r Real unstained image y f Virtual stained image y r Real dyed image

Claims

1. An image acquisition unit that acquires multiple images of organs captured by an endoscope camera, An image processing unit that uses the aforementioned plurality of images to estimate the camera pose, which is the position and orientation of the endoscope camera at the time the images were captured, and generates point cloud data of the organs, A pose generation unit generates a new camera pose, a second type of camera pose, which is different from the first type of camera pose, based on the estimated first type of camera pose, The model learning unit performs machine learning on the NeRF (Neural Radiance Fields) model. It has, The aforementioned model learning unit, A first loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image, A second loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and depth information which is training data identified from the point cloud data, A third loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose and depth information which is training data identified from the point cloud data, The NeRF model is trained using at least the following: Image processing system.

2. The model learning unit further uses, as a loss function, the smoothness of the organ's surface calculated using depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose, and the smoothness of the organ's surface calculated using depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose, to learn the NeRF model. The image processing system according to claim 1.

3. The model learning unit further uses the degree of unimodality in the volume density distribution output from the NeRF model for light rays corresponding to the first type of camera pose, and the degree of unimodality in the volume density distribution output from the NeRF model for light rays corresponding to the second type of camera pose, as loss functions to learn the NeRF model. The image processing system according to claim 1 or 2.

4. The plurality of images acquired by the image acquisition unit are images of the organ that has not been stained with a predetermined dye. The image processing system further includes an image conversion unit that uses a pre-trained CycleGAN (Cycle-Consistent Generative Adversarial Networks) model to convert each of the plurality of images acquired by the image acquisition unit into an image of the organ stained with the predetermined dye. The image processing unit uses the plurality of images converted by the image conversion unit to estimate the camera pose and generate the point cloud data. The image processing system according to claim 1.

5. The model learning unit performs training on the NeRF model using the plurality of images converted by the image conversion unit. The image processing system according to claim 4.

6. The model learning unit performs training of the NeRF model using the set of multiple images acquired by the image acquisition unit, and training of the NeRF model using the set of multiple images converted by the image conversion unit, thereby generating two trained NeRF models. The image processing system according to claim 4.

7. The NeRF model mentioned above is a model that outputs a first color and a second color. The model learning unit further uses a fourth loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image converted by the image conversion unit to train the NeRF model. The image processing system according to claim 4.

8. A pose request desk that accepts requests for camera poses, An image generation unit generates a new image of the organ with the specified camera pose as the viewpoint, based on the output of the NeRF model obtained by inputting the specified camera pose into the NeRF model which has been trained. The image processing system according to claim 1, further comprising:

9. The aforementioned organ is one of the following: stomach, esophagus, duodenum, small intestine, or large intestine. The image processing system according to claim 1.

10. Multiple images of organs are acquired using an endoscope camera. Using the aforementioned multiple images, the camera pose, which is the position and orientation of the endoscope camera at the time the images were captured, is estimated, and point cloud data of the organs is generated. Based on the estimated camera pose, which is the first type of camera pose, a new camera pose, which is different from the first type of camera pose, is generated, Perform machine learning on the NeRF model, In the machine learning of the NeRF model described above, A first loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image, A second loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and depth information which is training data identified from the point cloud data, A third loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose and depth information which is training data identified from the point cloud data, The NeRF model is trained using at least the following: Image processing methods.

11. Image acquisition step: Obtaining multiple images of organs using an endoscope camera, Image processing steps include using the aforementioned multiple images to estimate the camera pose, which is the position and orientation of the endoscope camera at the time of image acquisition, and to generate point cloud data of the organs. A pose generation step that generates a new camera pose, a second type of camera pose, which is different from the first type of camera pose, based on the estimated first type of camera pose, Model training steps for performing machine learning on the NeRF model and Have the computer run it, In the aforementioned model learning step, A first loss function that represents the difference between color information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and color information which is training data identified from the image, A second loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the first type of camera pose and depth information which is training data identified from the point cloud data, A third loss function that represents the difference between depth information estimated based on the output of the NeRF model for light rays corresponding to the second type of camera pose and depth information which is training data identified from the point cloud data, The NeRF model is trained using at least the following: program.