Information processing system and information processing method
The information processing system addresses latency-induced positional discrepancies in VST-HMDs by dynamically adjusting virtual image synthesis to align object positions, improving immersion and reducing VR-related issues.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SONY GROUP CORP
- Filing Date
- 2025-12-04
- Publication Date
- 2026-06-25
AI Technical Summary
In video see-through head-mounted displays (VST-HMDs), combining real-world footage with virtual images results in discrepancies in object positional relationships due to latency issues, leading to decreased immersion and potential VR sickness.
An information processing system that generates virtual images based on real-time object recognition and adjusts the synthesis mode to minimize latency and align object positions, using predictable and stationary object recognition to reduce positional discrepancies.
The system effectively controls latency and maintains realistic object positioning in composite images, enhancing user immersion and reducing VR-related discomfort.
Smart Images

Figure JP2025042272_25062026_PF_FP_ABST
Abstract
Description
Information processing system and information processing method
[0001] This technology relates to an information processing system and an information processing method, and more particularly to an information processing system and an information processing method that can control the delay time of captured images in a composite image while suppressing the displacement of the positional relationships of objects in the composite image.
[0002] In VST-HMDs (Video See-Through Head Mounted Displays), real-world footage and virtual images such as computer graphics (CG) are combined and displayed. Because these virtual images are generated using information from the time of filming, there is a delay between when the footage is filmed and when the virtual images are generated, combined with the filmed footage, and displayed. However, if the delay between filming and display is too long, problems such as decreased immersion, VR sickness, and delayed crisis avoidance actions can occur.
[0003] Therefore, a method has been devised to display the captured video with low latency by combining the latest captured video (captured up to the time of compositing) with a virtual image. However, with this method, because the time of the captured video and the time of the virtual image are different, a discrepancy occurs in the positional relationship of objects in the composite image, and the composite image may become unnatural and lack realism.
[0004] Based on the above, an image processing device has been devised that controls the delay time of the captured video in the composite video by selectively using the captured video corresponding to the virtual video and the most recent captured video captured at the time of synthesis as the captured video to be synthesized with the virtual video (see, for example, Patent Document 1).
[0005] However, in this image processing device, if the most recently captured video footage is selected as the video footage to be composited with the virtual image, there is a possibility that the positional relationships of objects in the composite image may be misaligned.
[0006] Japanese Patent Publication No. 2023-88667
[0007] Therefore, it is desirable to control the delay time of the captured footage within the composite image while suppressing the discrepancy in the positional relationships of objects within the composite image.
[0008] This technology was developed in light of these circumstances, and aims to suppress discrepancies in the positional relationships of objects within the composite image while controlling the delay time of the captured images within the composite image.
[0009] One aspect of this technology is an information processing system that includes a circuit configured to generate a first virtual image, which is a virtual image for the first captured image, or a second virtual image, which is a virtual image for the second captured image, which is a captured image for a second frame after the first frame, based on the object recognition result of the first captured image, which is a captured image for a first frame, and to generate a first composite image by combining the first captured image and the first virtual image, or a second composite image by combining the second captured image and the second virtual image, depending on the synthesis mode.
[0010] One aspect of this technology is an information processing method in which an information processing system generates a first virtual image, which is a virtual image for the first captured image, or a second virtual image, which is a virtual image for the second captured image, which is a captured image for a second frame after the first frame, based on the object recognition result of the first captured image, which is a captured image for a first frame, and generates a first composite image by combining the first captured image and the first virtual image, or a second composite image by combining the second captured image and the second virtual image, depending on the synthesis mode.
[0011] In one aspect of this technology, based on the object recognition result of the first captured image, which is the captured image of the first frame, a first virtual image is generated for the first captured image, or a second virtual image is generated for the second captured image, which is the captured image of a second frame that is after the first frame. Depending on the synthesis mode, a first composite image is generated by combining the first captured image and the first virtual image, or a second composite image is generated by combining the second captured image and the second virtual image.
[0012] An information processing system may be a standalone device or a module incorporated into another device.
[0013] Figure 10 shows an example of an ideal composite image. Figure 10 shows an example of an actual composite image. Figure 20 shows a block diagram illustrating the configuration of the first embodiment of an information processing system to which this technology is applied. Figure 30 shows a diagram explaining the composite mode. Figure 40 shows a diagram illustrating the processing flow of the information processing system in low-latency mode in the first embodiment. Figure 50 shows an example of a composite image in low-latency mode in the first embodiment. Figure 60 shows a diagram illustrating the processing flow of the information processing system in delay-tolerant mode in the first embodiment. Figure 70 shows an example of a composite image in delay-tolerant mode in the first embodiment. Figure 80 shows a flowchart illustrating the composite image display processing by the image processing unit in Figure 3. Figure 90 shows a block diagram illustrating the configuration of the second embodiment of an information processing system to which this technology is applied. Figure 10 shows a diagram illustrating the processing flow of the information processing system in low-latency mode in the second embodiment. Figure 80 shows a flowchart illustrating the composite image display processing by the image processing unit in Figure 10. Figure 10 shows a block diagram illustrating the configuration of the third embodiment of an information processing system to which this technology is applied. Figure 90 shows a diagram illustrating processing according to the operating mode. Figure 10 shows a flowchart illustrating the display processing. Figure 11 shows a block diagram illustrating the configuration of the fourth embodiment of an information processing system to which this technology is applied. Figure 12 shows a block diagram illustrating the configuration of the fifth embodiment of an information processing system to which this technology is applied. Figure 18 shows a diagram illustrating the processing flow of the information processing system. Figure 18 shows a block diagram illustrating an example of the configuration of computer hardware.
[0014] The following describes the embodiments for implementing this technology. The explanation will be given in the following order: 1. Ideal vs. Reality of Synthetic Images 2. First Embodiment (Information Processing System for Generating Synthetic Images According to the Synthesis Mode) 3. Second Embodiment (Information Processing System for Correcting the Viewpoint of Synthetic Images) 4. Third Embodiment (Information Processing System for Immediately Displaying Captured Images in Emergencies) 5. Fourth Embodiment (Information Processing System for Changing the Synthesis Mode Based on Prediction Errors of Object Recognition Results) 6. Fifth Embodiment (Information Processing System for Gradual Changes in the Synthesis Mode) 7. Description of a Computer Applying This Technology
[0015] <1. Ideal vs. Reality of Composite Images> <Example of an Ideal Composite Image> Figure 1 shows an example of an ideal composite image.
[0016] As shown in Figure 1, if a video 11-1 including a vase 21 and a hand 22 is acquired at time t1, ideally, a CG video 12-1 including a CG of a bird 31 is immediately generated based on this video 11-1, at a position corresponding to the position of the hand 22 in the video 11-1.
[0017] For example, if the hand 22 moves to the upper right between time t1 and time t2, then at time t2, a video 11-2 is obtained that includes the vase 21 at the same position as in video 11-1, and the hand 22 that has moved to the upper right.
[0018] In this case, ideally, based on the captured image 11-2, a CG image 12-2 including a CG of a bird 31 is immediately generated at a position corresponding to the position of the hand 22 in the captured image 11-2. Then, by combining the captured image 11-2 at time t2 and the CG image 12-2 at time t2, an ideal composite image 13 is generated.
[0019] As described above, since the time of the filmed video 11-2 and the time of the CG video 12-2 are the same time t2, no discrepancy occurs in the relative positions of the bird's CG 31 and the hand 22 in the ideal composite video 13. Because the CG video 12-2 is generated immediately based on the filmed video 11-2, the delay time of the filmed video 11-2 in the ideal composite video 13 is small.
[0020] <Example of actual composite image> Figure 2 shows an example of an actual composite image.
[0021] In Figure 2, components identical to those in Figure 1 are given the same reference numerals.
[0022] As shown in Figure 2, when the captured video 11-1 is acquired at time t1, object recognition of the hand 22 is actually performed based on this captured video 11-1, and the position of the hand 22 on the captured video 11-1 is obtained as an object recognition result. Then, based on this object recognition result, the drawing of the bird CG 31 on the position corresponding to the hand 22 in the captured video 11-2 in the CG video 12-1 is started. Therefore, a predetermined time delay occurs from the time the captured video 11-1 is acquired until the generation of the CG video 12-1 with the bird CG 31 drawn is completed. In the example in Figure 2, this delay time is from time t1 to time t2.
[0023] As shown in Figure 2, when a composite image is generated by combining the latest captured video footage and CG footage taken at the time of compositing, the composite image 51 at time t2 is a composite image of the captured video 11-2 at time t2 and the CG footage 12-1 corresponding to the captured video 11-1 at time t1. Therefore, the position of the bird's CG 31 in the composite image 51 does not correspond to the position of the hand 22 in the captured video 11-2 at time t2, but rather to the position of the hand 22 in the captured video 11-1 at time t1. As a result, a discrepancy occurs in the positional relationship between the bird's CG 31 and the hand 22 in the composite image 51, and the bird's CG 31 is placed next to the hand 22, not on top of it. Consequently, the composite image 51 becomes an unnatural image lacking realism. However, since the captured video 11-2 in the composite image 51 is the latest captured video taken at the time of compositing, the delay time of the captured video 11-2 in the composite image 51 is small.
[0024] In contrast, although not shown in the diagram, when the CG image 12-1 and the filmed image 11-1 used to generate the CG image 12-1 are combined at time t2, the time of the combined CG image 12-1 and the time of the filmed image 11-1 are both time t1. Therefore, no discrepancy occurs in the positional relationship between the bird's CG 31 and the hand 22 in the combined image. However, since the filmed image 11-1 in the combined image was filmed at time t1, which is before time t2, the delay time of the filmed image in the combined image is large.
[0025] <2. First Embodiment> <Example of Information Processing System Configuration> Figure 3 is a block diagram showing an example of the configuration of the first embodiment of an information processing system to which this technology is applied.
[0026] The information processing system 100 in Figure 3 comprises a shooting unit 101, an image processing unit 102, and a display unit 103. The information processing system 100 combines the captured video and CG video captured by the shooting unit 101 and displays them on the display unit 103.
[0027] Specifically, the imaging unit 101 of the information processing system 100 captures the surroundings using a rolling shutter method and acquires frame-by-frame captured video (VST video). The imaging unit 101 then supplies the captured video to the image processing unit 102.
[0028] The image processing unit 102 includes an ISP (Image Signal Processing) processing unit 110, a control unit 111, a recognition processing unit 112, a CG image generation unit 113, a synthesis mode setting unit 114, a synthesis unit 115, and a display control unit 116.
[0029] The ISP processing unit 110 performs ISP processing on the captured video supplied from the imaging unit 101. The ISP processing unit 110 then supplies the captured video after ISP processing to the recognition processing unit 112 and the synthesis unit 115.
[0030] The control unit 111 controls the generation of one or more CG images by controlling the recognition processing unit 112 and the CG image generation unit 113. Specifically, the control unit 111 supplies drawing position information regarding the drawing position of the CG image to the recognition processing unit 112. The drawing position information includes reference presence / absence information indicating whether the drawing position of the CG image is determined based on the position of one or more objects in the captured image. If the reference presence / absence information indicates that the drawing position of the CG image is determined based on the position of one or more objects in the captured image, the drawing position information also includes object information representing each of those one or more objects as a reference object. The control unit 111 supplies CG drawing information, which is information regarding the drawing of CG, to the CG image generation unit 113.
[0031] If the reference presence / absence information indicates that the drawing position of the CG image is determined based on the position of one or more objects in the captured image, the recognition processing unit 112 performs object recognition processing for each frame of the captured image supplied by the ISP processing unit 110 to recognize each reference object represented by the object information. As a result, the recognition processing unit 112 obtains the position of the reference object as the object recognition result.
[0032] If all reference objects represented by the object information are predictable objects or stationary objects, the recognition processing unit 112 predicts the object recognition result of the prediction frame based on the object recognition result of the target frame, which is the frame to be processed by the object recognition process.
[0033] Specifically, for reference objects that are predictable objects, the recognition processing unit 112 predicts the object recognition result of the prediction frame based on the object recognition result of the target frame, according to the prediction algorithm. For reference objects that are stationary objects, the recognition processing unit 112 predicts the object recognition result of the prediction frame as is, based on the object recognition result of the target frame.
[0034] A predictable object is a moving object whose object recognition result can be easily and accurately predicted according to the prediction algorithm. The prediction frame (second frame) is a frame that is one or more frames later than the target frame (first frame), and in this case, it is the latest captured video frame that is captured and processed by ISP during synthesis by the synthesis unit 115 corresponding to the target frame. Therefore, the recognition processing unit 112 can compensate for delays in the object recognition result due to the object recognition processing and the rendering of CG images in the CG image generation unit 113, which will be described later, by predicting the object recognition result of the prediction frame. The recognition processing unit 112 supplies object recognition result prediction information, which represents the predicted result of the object recognition result, to the CG image generation unit 113.
[0035] On the other hand, if at least one reference object represented by the object information is a moving object that is not a predictable object (hereinafter referred to as a difficult-to-predict object), the recognition processing unit 112 supplies object recognition result information representing the object recognition result of the target frame to the CG image generation unit 113. The recognition processing unit 112 supplies reference presence / absence information and object type information indicating whether each reference object is a predictable object, a difficult-to-predict object, or a stationary object to the synthesis mode setting unit 114.
[0036] If the recognition processing unit 112 does not supply anything, that is, if the reference presence / absence information indicates that the drawing position of the CG image is not determined based on the position of one or more objects in the captured image, the CG image generation unit 113 draws the CG at a predetermined position based on the CG drawing information. As a result, the CG image generation unit 113 generates a fixed CG image which is a CG image containing each CG at the predetermined position.
[0037] When object recognition result information is supplied from the recognition processing unit 112, the CG image generation unit 113 draws CG at the position based on the object recognition result information, based on the CG drawing information. In this way, the CG image generation unit 113 generates a target CG image (first virtual image), which is a CG image for the captured image (first captured image) of the target frame.
[0038] When object recognition result prediction information is supplied from the recognition processing unit 112, the CG video generation unit 113 draws a CG at a position based on the object recognition result prediction information, based on the CG drawing information. Thereby, the recognition processing unit 112 generates a predicted CG video (second virtual video), which is a CG video for the captured video (second captured video) of the predicted frame. The CG video generation unit 113 supplies the generated fixed CG video, target CG video, or predicted CG video to the composition unit 115.
[0039] The composition mode setting unit 114 sets a composition mode based on the presence / absence reference information and object type information supplied from the recognition processing unit 112. Examples of the composition mode include a low-latency mode and a latency-tolerant mode. The low-latency mode (second video composition mode) is a mode that uses the captured video of the predicted frame for composition. The latency-tolerant mode (first video composition mode) is a mode that uses the captured video of the target frame for composition. The composition mode setting unit 114 supplies the set composition mode to the composition unit 115.
[0040] The composition unit 115 composes the captured video supplied from the ISP processing unit 110 and the fixed CG video, target CG video, or predicted CG video supplied from the CG video generation unit 113 according to the composition mode supplied from the composition mode setting unit 114.
[0041] Specifically, when the composition mode is the low-latency mode, the composition unit 115 composes the captured video of the predicted frame and the fixed CG video or the predicted CG video to generate a composite video (second composite video). On the other hand, when the composition mode is the latency-tolerant mode, the composition unit 115 composes the captured video of the target frame and the target CG video to generate a composite video (first composite video). The composition unit 115 supplies the generated composite video to the display control unit 116.
[0042] The display control unit 116 supplies the composite video supplied from the composition unit 115 to the display unit 103 for display.
[0043] The display unit 103 is configured by an HMD (Head Mounted Display) or the like. The display unit 103 displays (presents) the composite video supplied from the display control unit 116.
[0044] <Explanation of Synthesis Modes> Figure 4 is a diagram illustrating the synthesis modes set by the synthesis mode setting unit 114 in Figure 3. Specifically, Figure 4 is a table that associates the synthesis modes with the characteristics and setting conditions of those synthesis modes.
[0045] As shown in Figure 4, when the synthesis mode is low latency mode, the most recently captured video footage taken at the time of synthesis is used for synthesis, so the delay time of the captured video footage in the synthesized video, i.e., the display delay time of the captured video footage (Motion to Photon), is small.
[0046] When the synthesis mode is set to low latency mode, the CG image synthesized with the captured video is either a predicted CG image from the same time as the captured video, or a fixed CG image where the rendering position of the CG is independent of the position of objects in the captured video. Therefore, the positional difference between the reference object and the CG in the synthesized image, which is a synthesis of the captured video and the predicted CG image, is small. The synthesized image, which is a synthesis of the captured video and the fixed CG image, is not affected by the time difference between the captured video and the fixed CG image. As a result, when the synthesis mode is set to low latency mode, the synthesized image becomes a realistic and natural image.
[0047] The composite mode setting unit 114 sets the composite mode to low latency mode if the reference presence / absence information indicates that the drawing position of the CG image is not determined based on the position of one or more objects in the captured image, that is, if the position of objects in the captured image and the position of the CG are unrelated. An example of a situation where the position of objects in the captured image and the position of the CG are unrelated is when the user is about to go outside the playable zone or about to collide with a real object while the control unit 111 is running a VR application. In this case, the control unit 111 controls the generation of the CG image so that the CG is positioned at a predetermined position in the composite image in order to display the captured image on the display unit 103.
[0048] The synthesis mode setting unit 114 also sets the synthesis mode to low latency mode if the object type information indicates that all reference objects are stationary or predictable objects. When a reference object is a stationary object, that is, when a reference object that is the cause of interaction between the real space and the virtual space is stationary, an example would be when a CG ball is thrown against the wall of a real room and bounces back. When a reference object is a predictable object, that is, when the movement of a reference object that is the cause of interaction between the real space and the virtual space can be predicted with high accuracy and easily, an example would be when a user operates a menu using their hands.
[0049] When the blending mode is set to the tolerance delay mode, the captured video of the target frame is used for blending, resulting in a larger delay in the display of the captured video. However, in this case, the CG image blended with the captured video of the target frame is the CG image of the target at the same time as the captured video. Therefore, there is no discrepancy in the positional relationship between the reference object and the CG in the blended image. As a result, even when the blending mode is set to the tolerance delay mode, the blended image becomes a realistic and natural image.
[0050] The synthesis mode setting unit 114 sets the synthesis mode to a delay-tolerant mode when the object type information indicates that at least one reference object is an unpredictable object, that is, when there is high interaction between the real space and the virtual space, but it is difficult to predict the movement of the reference object. Examples of cases where there is high interaction between the real space and the virtual space, but it is difficult to predict the movement of the reference object include cases where the reference object is a ball or a dog that is not rigid and tends to move quickly.
[0051] As described above, unless the object type information indicates that at least one reference object is an unpredictable object, the synthesis mode setting unit 114 sets the synthesis mode to low-latency mode. Therefore, unless at least one reference object is an unpredictable object, the delay time of the captured images in the synthesized image can be reduced while making the synthesized image look natural.
[0052] On the other hand, if the object type information indicates that at least one reference object is an unpredictable object, the synthesis mode setting unit 114 sets the synthesis mode to a delay-tolerant mode. Therefore, when at least one reference object is an unpredictable object, the delay time of the captured images in the synthesized image will be large, but the synthesized image can be made to look natural.
[0053] <Explanation of the processing flow in low-latency mode> Figure 5 is a diagram illustrating the processing flow of the information processing system 100 in low-latency mode.
[0054] In Figure 5, the horizontal axis represents time. In the example in Figure 5, the reference object is a predictable object. The same applies to Figures 11 and 19, which will be discussed later.
[0055] As shown in Figure 5, the shooting unit 101 takes pictures frame by frame, and the ISP processing unit 110 performs ISP processing on the captured video of each frame obtained as a result of the shooting. The recognition processing unit 112 performs object recognition processing of a reference object on the captured video after ISP processing, and predicts the object recognition result of the predicted frame based on the object recognition result obtained as a result. In the example in Figure 5, the predicted frame is the frame two frames after the target frame. The CG video generation unit 113 generates a predicted CG video based on the object recognition result prediction information representing the prediction result by the recognition processing unit 112 and the CG drawing information.
[0056] At this time, the imaging unit 101 completes the imaging of a predicted frame two frames after the target frame, and the ISP processing unit 110 completes ISP processing on the image of that predicted frame.
[0057] Therefore, the synthesis unit 115 synthesizes the latest captured video after ISP processing, i.e., the captured video of the predicted frame two frames after the frame to be processed, with the predicted CG video. As a result, the display unit 103 displays a composite video in which the captured video of the predicted frame and the predicted CG video are combined.
[0058] <Example of composite image in low-latency mode> Figure 6 shows an example of composite image in low-latency mode.
[0059] In Figure 6, elements identical to those in Figure 1 are given the same reference numerals. In the example in Figure 6, the hand 22, which is the reference object, is the predictable object.
[0060] As shown in Figure 6, when the captured video 11-1 is acquired by the imaging unit 101 at time t1, the recognition processing unit 112 performs object recognition processing of the hand 22 based on the captured video 11-1 after ISP processing by the ISP processing unit 110. As a result, the position of the hand 22 on the captured video 11-1 is obtained as an object recognition result.
[0061] Based on this object recognition result, the recognition processing unit 112 predicts the object recognition result in the latest captured video 121, i.e., the captured video 121 of the prediction frame, which has had its ISP processing completed by time t12 when the predicted CG video 122 is synthesized. In the example in Figure 6, the recognition processing unit 112 predicts that the hand 22 will move to the upper right between time t1 and time t12, and generates the position of the hand 22 after the movement as the predicted object recognition result.
[0062] Based on the prediction result of object recognition, the CG image generation unit 113 starts drawing the CG image 31 of the bird at a position above the position corresponding to the predicted position of the hand 22 in the captured image 121 in the predicted CG image 122.
[0063] At time t12, once the generation of the predicted CG image 122, in which the bird's CG 31 is drawn, is complete, the compositing unit 115 generates a composite image 123 by compositing the latest captured image 121, which has been captured and processed by ISP at that time, with the predicted CG image 122. Therefore, both frames (times) of the captured image 121 and the predicted CG image 122 in the composite image 123 become predicted frames. As a result, in the composite image 123, the positional discrepancy between the bird's CG 31 and the hand 22 is small, and the bird's CG 31 is positioned above the hand 22. Consequently, the composite image 123 becomes a realistic and natural image. Furthermore, since the captured image 121 in the composite image 123 is the latest captured image taken at the time of compositing, the delay time is small.
[0064] <Explanation of the processing flow in delay-tolerant mode> Figure 7 is a diagram illustrating the processing flow of the information processing system 100 in delay-tolerant mode.
[0065] In Figure 7, the horizontal axis represents time. This is also true for Figure 12, which will be discussed later.
[0066] As shown in Figure 7, the shooting unit 101 takes pictures frame by frame, and the ISP processing unit 110 performs ISP processing on the captured video of each frame obtained as a result of the shooting. The recognition processing unit 112 performs object recognition processing of a reference object on the captured video after ISP processing. The CG video generation unit 113 generates a target CG video based on the object recognition result information and CG rendering information obtained as a result of the object recognition processing. The synthesis unit 115 synthesizes the captured video after ISP processing of the target frame and the target CG video. As a result, the display unit 103 displays a composite video which is a combination of the captured video of the target frame and the target CG video.
[0067] In the example shown in Figure 7, when the target CG image is composited, the shooting and ISP processing of the frame two frames after the target frame are completed. Therefore, the captured image in the composite image displayed on the display unit 103 is the image taken two frames before the most recent captured image, resulting in a delay in the display of the captured image.
[0068] <Example of composite video in delay-tolerant mode> Figure 8 shows an example of composite video in delay-tolerant mode.
[0069] In Figure 8, the same reference numerals are used for elements identical to those in Figure 1. In the example in Figure 8, the hand 22, which is the reference object, is the unpredictable object.
[0070] As shown in Figure 8, when the captured video 11-1 is acquired by the imaging unit 101 at time t1, the recognition processing unit 112 performs object recognition processing of the hand 22 based on the captured video 11-1 after ISP processing by the ISP processing unit 110. As a result, the position of the hand 22 on the captured video 11-1 is obtained as an object recognition result.
[0071] Based on the object recognition result, the CG image generation unit 113 starts drawing the CG 31 of the bird at a position above the position corresponding to the position of the hand 22 in the captured image 11-1 in the target CG image 141.
[0072] At time t22, once the generation of the target CG image 141 in which the bird CG 31 is drawn is complete, the compositing unit 115 generates a composite image 142 by compositing the captured image 11-1 and the target CG image 141. Therefore, both frames of the captured image 11-1 and the target CG image 141 in the composite image 142 become target frames. As a result, there is no misalignment in the positional relationship between the bird CG 31 and the hand 22 in the composite image 142, and the bird CG 31 is positioned above the hand 22. Consequently, the composite image 142 becomes a realistic and natural image.
[0073] On the other hand, between time t1 and time t22, one or more frames of captured video are acquired and ISP processing is performed. Therefore, the latest captured video 143 after ISP processing at time t22 is different from captured video 11-1. However, the captured video used for synthesis at time t22 is the captured video 11-1 taken at time t1, and the delay time of the captured video in the synthesized video 142 is large.
[0074] <Explanation of Composite Image Display Processing> Figure 9 is a flowchart illustrating the composite image display processing performed by the image processing unit 102 in Figure 3. This composite image display processing is performed frame by frame, for example, when the input of frame-by-frame captured video from the shooting unit 101 begins, with each frame being treated as a target frame in sequence.
[0075] In step S11 of Figure 9, the ISP processing unit 110 of the image processing unit 102 performs ISP processing on the captured video of the target frame supplied from the shooting unit 101, and supplies the captured video after ISP processing to the recognition processing unit 112 and the synthesis unit 115.
[0076] In step S12, the recognition processing unit 112 determines whether or not to perform object recognition processing based on the reference presence / absence information supplied from the control unit 111. Specifically, if the reference presence / absence information indicates that the drawing position of the CG image is determined based on the position of one or more objects in the captured image, the recognition processing unit 112 determines to perform object recognition processing. On the other hand, if the reference presence / absence information indicates that the drawing position of the CG image is not determined based on the position of one or more objects in the captured image, the recognition processing unit 112 determines not to perform object recognition processing.
[0077] If it is determined in step S12 that object recognition processing should be performed, the process proceeds to step S13. In step S13, the recognition processing unit 112 performs object recognition processing on the captured video of the target frame for which ISP processing was performed in step S11, recognizing each reference object represented by the object information, and obtains the position of the reference object as the object recognition result.
[0078] In step S14, the recognition processing unit 112 determines whether all reference objects represented by the object information are predictable objects or stationary objects. If it is determined in step S14 that all reference objects are predictable objects or stationary objects, the process proceeds to step S15.
[0079] In step S15, the recognition processing unit 112 predicts the object recognition result of the predicted frame based on the object recognition result of the target frame and supplies the predicted object recognition result information to the CG image generation unit 113. The recognition processing unit 112 also supplies reference presence / absence information and object type information to the synthesis mode setting unit 114.
[0080] In step S16, the CG image generation unit 113 generates a predicted CG image by drawing CG at positions based on the object recognition result prediction information supplied by the recognition processing unit 112, based on the CG drawing information supplied by the control unit 111. The CG image generation unit 113 supplies the generated predicted CG image to the synthesis unit 115 and proceeds to step S18.
[0081] On the other hand, if it is determined in step S12 that object recognition processing is not to be performed, the process proceeds to step S17. In step S17, the CG image generation unit 113 generates a fixed CG image by drawing CG at predetermined positions based on the CG drawing information supplied from the control unit 111. The CG image generation unit 113 supplies the generated fixed CG image to the synthesis unit 115, and the process proceeds to step S18.
[0082] In step S18, the synthesis mode setting unit 114 sets the synthesis mode to low-latency mode based on the reference presence / absence information and object type information supplied from the recognition processing unit 112, and supplies it to the synthesis unit 115. Then, the process proceeds to step S21.
[0083] On the other hand, if it is determined in step S14 that all reference objects are neither predictable objects nor stationary objects, that is, if at least one reference object is an unpredictable object, the recognition processing unit 112 generates object recognition result information representing the object recognition result obtained in step S13. The recognition processing unit 112 then supplies this object recognition result information to the CG image generation unit 113, and also supplies reference presence / absence information and object type information to the synthesis mode setting unit 114.
[0084] Then, in step S19, the CG image generation unit 113 generates a target CG image by drawing CG at positions based on object recognition result information supplied by the recognition processing unit 112, based on CG drawing information supplied by the control unit 111. The CG image generation unit 113 then supplies the generated target CG image to the synthesis unit 115.
[0085] In step S20, the synthesis mode setting unit 114 sets the synthesis mode to a delay-tolerant mode based on the reference presence / absence information and object type information supplied from the recognition processing unit 112, and supplies it to the synthesis unit 115. Then, the process proceeds to step S21.
[0086] In step S21, the compositing unit 115 combines the captured video after ISP processing of the target frame or predicted frame with the fixed CG video, target CG video, or predicted CG video, according to the compositing mode set in step S18 or S20, to generate a composite video. The compositing unit 115 then supplies the composite video to the display control unit 116.
[0087] In step S22, the display control unit 116 supplies the composite image generated in step S21 to the display unit 103 for display. The composite image display process then ends.
[0088] As described above, the CG image generation unit 113 of the information processing system 100 generates a target CG image for the target frame's captured image, or a predicted CG image for the predicted frame's captured image, based on the object recognition result of the target frame's captured image. Then, the synthesis unit 115 generates a composite image by combining the target frame's captured image and the target CG image, or a composite image by combining the predicted frame's captured image and the predicted CG image, depending on the synthesis mode. Therefore, it is possible to control the delay time of the captured image in the composite image while suppressing the positional relationship between the reference object and the CG in the composite image.
[0089] Furthermore, the synthesis mode setting unit 114 dynamically switches the synthesis mode based on the reference object. For example, the synthesis mode setting unit 114 sets the synthesis mode to a delay-tolerant mode when at least one reference object is an unpredictable object, and sets the synthesis mode to a low-latency mode otherwise. As a result, when at least one reference object is not an unpredictable object, the captured image can be displayed immediately while suppressing the discrepancy in the positional relationship between the reference object and the CG in the synthesized image. On the other hand, when at least one reference object is an unpredictable object, a delay occurs in the display of the captured image, but the discrepancy in the positional relationship between the reference object and the CG in the synthesized image can be suppressed.
[0090] The synthesis mode setting unit 114 may also be configured to set the synthesis mode based on instructions from the control unit 111. In this case, for example, the developer of the application executed by the control unit 111 can control the switching of the synthesis mode. As a result, the user can experience a synthesized image synthesized in the synthesis mode intended by the application developer.
[0091] The synthesis mode setting unit 114 may set the synthesis mode to one of the low-latency mode and the delay-tolerant mode, and then set it to the other mode after a predetermined time has elapsed since the first mode was set. Specifically, when the synthesis mode setting unit 114 changes the synthesis mode based on reference presence / absence information and object type information, if a predetermined time has not elapsed since the first mode was set, it sets the final synthesis mode supplied to the synthesis unit 115 to the first mode. On the other hand, if a predetermined time has elapsed since the first mode was set, the synthesis mode setting unit 114 sets the final synthesis mode to the new synthesis mode.
[0092] In this case, frequent switching of the composite mode can be suppressed, reducing user discomfort and unease. Specifically, when the composite mode is switched, the timing of the captured video and CG video in the composite video is delayed or advanced by the number of frames from the target frame to the predicted frame, causing the displayed composite video to behave unnaturally. Therefore, if the composite mode is switched frequently, the composite video will repeatedly exhibit unnatural behavior in a short period of time, which may cause the user to feel uncomfortable or unease. Thus, the composite mode setting unit 114 can reduce user discomfort and unease by suppressing frequent switching of the composite mode.
[0093] <3. Second Embodiment> <Example of Information Processing System Configuration> Figure 10 is a block diagram showing an example of the configuration of a second embodiment of an information processing system to which this technology is applied.
[0094] In the information processing system 200 of Figure 10, the parts corresponding to the information processing system 100 of Figure 3 are denoted by the same reference numerals. Therefore, explanations of those parts will be omitted as appropriate, and the explanation will focus on the parts that differ from the information processing system 100. The information processing system 200 differs from the information processing system 100 in that the image processing unit 102 is replaced by the image processing unit 202, and otherwise it is configured the same as the information processing system 100. The information processing system 200 performs SLAM (Simultaneous Localization and Mapping) processing and corrects the viewpoint of the synthesized image based on the self-position estimated as a result.
[0095] Specifically, the image processing unit 202 differs from the image processing unit 102 in that it newly includes a self-position estimation unit 211, a pre-combination correction unit 212, and a post-combination correction unit 214, and replaces the combination unit 115 with a combination unit 213. Otherwise, it is configured the same as the image processing unit 102.
[0096] The self-position estimation unit 211 performs SLAM processing based on multiple captured video frames, including the target frame after ISP processing by the ISP processing unit 110, to estimate the self-position of the imaging unit 101. At this time, the self-position estimation unit 211 performs extrapolation processing on the self-position estimation result based on motion information of the imaging unit 101 detected by an IMU (Inertial Measurement Unit) and an acceleration sensor (not shown). As a result, the self-position estimation unit 211 can obtain the self-position estimation result of the imaging unit 101 at a high frequency and predict the future self-position of the imaging unit 101. The self-position estimation unit 211 supplies the estimated or predicted self-position of the imaging unit 101 to the pre-combination correction unit 212 and the post-combination correction unit 214.
[0097] The pre-compositing correction unit 212 is supplied with captured video that has undergone ISP processing by the ISP processing unit 110, and fixed CG video, target CG video, or predicted CG video generated by the CG video generation unit 113. The pre-compositing correction unit 212 is also supplied with the synthesis mode set by the synthesis mode setting unit 114.
[0098] If the synthesis mode is low latency mode, the pre-combination correction unit 212 performs pre-combination time warp processing on the captured video and predicted CG video of the predicted frame after ISP processing, based on its own position supplied by the self-position estimation unit 211.
[0099] Time warp processing is a process that compensates for the delay between the viewpoint at the time of shooting or rendering and the viewpoint at the time of display. Specifically, pre-compositing time warp processing is a process that corrects the shot video and predicted CG video by reprojecting objects in the shot video and CG in the predicted CG video so that they appear as they would from their own position at the time the composite video is displayed. Therefore, the shot video after pre-compositing time warp processing is a shot video in which the viewpoint of the shooting unit 101 at the time of shooting has been converted to a predicted value of the viewpoint of the shooting unit 101 at the time the composite video is displayed. The predicted CG video after pre-compositing time warp processing is a predicted CG video in which the viewpoint at the time the predicted CG video is rendered, that is, the viewpoint of the shooting unit 101 at the time of shooting the shot video of the target frame, has been converted to a predicted value of the viewpoint of the shooting unit 101 at the time the composite video is displayed.
[0100] As described above, the self-position estimation unit 211 obtains self-position estimation results at a high frequency, so the pre-composition correction unit 212 can perform high-precision pre-composition time warp processing based on the latest self-position at the time of shooting or rendering. The self-position estimation unit 211 predicts the self-position, so the pre-composition correction unit 212 can perform pre-composition time warp processing based on the predicted value of the self-position at the time of display.
[0101] When the synthesis mode is low latency mode, the pre-synthesis correction unit 212 supplies the captured video of the predicted frame after the pre-synthesis time warp processing and the predicted CG video or fixed CG video after the pre-synthesis time warp processing to the synthesis unit 213.
[0102] If the synthesis mode is the delay-tolerant mode, the pre-synthesis correction unit 212 supplies the captured video and target CG video of the target frame after ISP processing directly to the synthesis unit 213.
[0103] The compositing unit 213 combines the captured video of the predicted frame after time warp processing supplied from the pre-compositing correction unit 212 with the predicted CG video or fixed CG video after time warp processing supplied from the pre-compositing correction unit 212 to generate a composite video. The compositing unit 213 also combines the captured video of the target frame after ISP processing supplied from the pre-compositing correction unit 212 with the target CG video to generate a composite video. The compositing unit 213 supplies the composite video to the post-compositing correction unit 214.
[0104] If the synthesis mode set by the synthesis mode setting unit 114 is the low-latency mode, the post-composite correction unit 214 supplies the synthesized video supplied from the synthesis unit 213 to the display control unit 116 as is.
[0105] When the synthesis mode is the delay-tolerant mode, the post-synthesis correction unit 214 performs a post-synthesis time warp process on the synthesized image supplied by the synthesis unit 213 based on its own position supplied by the self-position estimation unit 211. The post-synthesis time warp process is a process that corrects the synthesized image by reprojecting objects and CG in the synthesized image so that they appear as they would appear from the self-position at the time the synthesized image is displayed. Therefore, the synthesized image after the post-synthesis time warp process is a synthesized image in which the viewpoint at the time of shooting of the captured image included in the synthesized image has been converted into a predicted value of the viewpoint of the shooting unit 101 at the time the synthesized image is displayed.
[0106] As described above, when the synthesis mode is the delay-tolerant mode, the viewpoint at the time of shooting of the captured video included in the synthesized video and the viewpoint at the time of rendering of the target CG video are the same, so a time warp process is performed on the synthesized video after synthesis. However, even when the synthesis mode is the delay-tolerant mode, the same as in the low-latency mode, the time warp process may be performed separately on the captured video and the target CG video before synthesis. The post-synthesis correction unit 214 supplies the synthesized video after the time warp process to the display control unit 116.
[0107] <Explanation of the processing flow in low-latency mode> Figure 11 is a diagram illustrating the processing flow of the information processing system 200 in low-latency mode.
[0108] In Figure 11, the processes other than those of the self-position estimation unit 211, the pre-combination correction unit 212, and the combination unit 213 are the same as those in Figure 5, so the explanation of those processes will be omitted as appropriate.
[0109] As shown in Figure 11, the self-position estimation unit 211 performs SLAM processing based on multiple captured video frames, including the target frame after ISP processing. Based on the motion information of the imaging unit 101, the self-position estimation unit 211 performs extrapolation processing on the self-position estimation results for each frame obtained as a result of SLAM processing. In this way, the self-position estimation unit 211 estimates or predicts the self-position of the imaging unit 101 at predetermined time intervals, including the time each frame was captured.
[0110] The pre-composition correction unit 212 uses, for example, the self-position at the time of the pre-composition time warp processing (TW), which is predicted by the self-position estimation unit 211, as the self-position at the time of display. Based on the self-position at the display time and the self-position of the predicted frame, the pre-composition correction unit 212 performs the pre-composition time warp processing on the captured video of the predicted frame. The pre-composition correction unit 212 also performs the pre-composition time warp processing on the predicted CG video based on the self-position at the time of display and the self-position of the target frame.
[0111] As described above, when the synthesis mode is low latency mode, the warp source time in the time warp processing of the captured video is the predicted frame, while the warp source time in the time warp processing of the predicted CG video is the target frame. Therefore, before the composite of the captured video and the predicted CG video, time warp processing is performed separately on the captured video and the predicted CG video.
[0112] The synthesis unit 213 synthesizes the captured video and predicted CG video of the predicted frame after the time warp processing by the pre-synthesis correction unit 212. As a result, the display unit 103 displays the synthesized video from the viewpoint at the time of display. This reduces the display delay experienced by the user.
[0113] <Explanation of the processing flow in delay-tolerant mode> Figure 12 is a diagram illustrating the processing flow of the information processing system 200 in delay-tolerant mode.
[0114] In Figure 12, the processing other than that of the self-position estimation unit 211 and the post-combination correction unit 214 is the same as that of Figure 7, so the explanation of that processing will be omitted as appropriate.
[0115] As shown in Figure 12, the self-position estimation unit 211 estimates or predicts the self-position of the imaging unit 101 for a predetermined time interval including the time of shooting each frame, similar to the case in Figure 11.
[0116] The post-compositing correction unit 214, for example, uses the self-position during the post-compositing time warp processing, which is predicted by the self-position estimation unit 211, as the self-position at the time of display. Based on the self-position at the time of display and the self-position of the target frame, the post-compositing correction unit 214 performs post-compositing time warp processing on the composite image.
[0117] As described above, when the synthesis mode is the delay-tolerant mode, both the warp source time in the time warp processing of the captured video and the warp source time in the time warp processing of the target CG video are the target frame. Therefore, after the captured video and the target CG video are synthesized, the time warp processing is performed on the synthesized video.
[0118] The display unit 103 displays the composite image after the time warp processing performed by the post-compositing correction unit 214, that is, the composite image from the viewpoint at the time of display. This reduces the display delay experienced by the user.
[0119] <Explanation of Composite Image Display Processing> Figure 13 is a flowchart illustrating the composite image display processing performed by the image processing unit 202 in Figure 10. This composite image display processing is performed frame by frame, for example, when the input of frame-by-frame captured video from the shooting unit 101 begins, with each frame being treated as a target frame in sequence.
[0120] In step S31 of Figure 13, the ISP processing unit 110 of the image processing unit 202 performs ISP processing on the captured video of the target frame supplied from the shooting unit 101. The ISP processing unit 110 supplies the captured video after ISP processing to the recognition processing unit 112, the self-position estimation unit 211, and the pre-composition correction unit 212.
[0121] In step S32, the self-position estimation unit 211 performs SLAM processing based on multiple captured video frames, including the target frame after ISP processing by the ISP processing unit 110. Based on the motion information of the imaging unit 101, the self-position estimation unit 211 performs extrapolation processing on the self-position estimation result of the imaging unit 101 obtained by SLAM processing. The self-position estimation unit 211 supplies the resulting self-position of the imaging unit 101 to the pre-combination correction unit 212 and the post-combination correction unit 214.
[0122] The processes in steps S33 to S38 are the same as those in steps S12 to S16 and S18 in Figure 9, so their explanation is omitted. After the process in step S38, the process proceeds to step S39.
[0123] In step S39, the pre-combination correction unit 212 performs a pre-combination time warp process on the captured video of the predicted frame after ISP processing and the predicted CG video generated in step S37, based on its own position supplied by the self-position estimation unit 211. The pre-combination correction unit 212 supplies the captured video of the predicted frame and the predicted CG video after the pre-combination time warp process to the combination unit 213, and proceeds to step S43.
[0124] On the other hand, if it is determined in step S33 that object recognition processing is not to be performed, the process proceeds to step S40. In step S40, the CG image generation unit 113 generates a fixed CG image in the same manner as in step S17 and supplies it to the pre-composition correction unit 212.
[0125] In step S41, the synthesis mode setting unit 114 sets the synthesis mode to low latency mode and supplies it to the pre-synthesis correction unit 212 and the post-synthesis correction unit 214. In step S42, the pre-synthesis correction unit 212 performs pre-synthesis time warp processing on the captured video of the predicted frame after ISP processing, based on the self-position supplied from the self-position estimation unit 211. The pre-synthesis correction unit 212 supplies the captured video of the predicted frame after pre-synthesis time warp processing and the fixed CG video generated in step S40 to the synthesis unit 213 and proceeds to step S43.
[0126] In step S43, the synthesis unit 213 synthesizes the captured video of the predicted frame supplied from the pre-synthesis correction unit 212 with the fixed CG video or the predicted CG video to generate a composite video. The synthesis unit 213 supplies the composite video to the display control unit 116 via the post-synthesis correction unit 214, and proceeds to step S48.
[0127] On the other hand, if it is determined in step S35 that all reference objects are neither predictable objects nor stationary objects, the recognition processing unit 112 generates object recognition result information representing the object recognition result obtained in step S34. The recognition processing unit 112 then supplies this object recognition result information to the CG image generation unit 113, and also supplies reference presence / absence information and object type information to the synthesis mode setting unit 114. The process then proceeds to step S44.
[0128] The processes in steps S44 and S45 are the same as those in steps S19 and S20, so their explanation will be omitted.
[0129] In step S46, the compositing unit 213 combines the captured video of the target frame that underwent ISP processing in step S31 with the target CG video generated in step S44 to generate a composite image. The compositing unit 213 then supplies the composite image to the post-compositing correction unit 214.
[0130] In step S47, the post-compositing correction unit 214 performs post-compositing time warp processing on the composite image generated in step S46 based on its own position supplied by the self-position estimation unit 211. The post-compositing correction unit 214 supplies the post-compositing time warp processed composite image to the display control unit 116 and proceeds to step S48.
[0131] In step S48, the display control unit 116 supplies the composite image supplied from the post-composite correction unit 214 to the display unit 103 for display. Then the composite image display process is completed.
[0132] As described above, the image processing unit 202 of the information processing system 200 dynamically switches the synthesis mode, similar to the image processing unit 102, and generates a synthesized image according to the synthesis mode. Therefore, the same effect as the image processing unit 102 can be obtained. The image processing unit 202 also generates a synthesized image of the viewpoint at the time of display by performing time warp processing before or after synthesis. Therefore, the display delay experienced by the user can be reduced.
[0133] <4. Third Embodiment> <Example of Information Processing System Configuration> Figure 14 is a block diagram showing an example of the configuration of a third embodiment of an information processing system to which this technology is applied.
[0134] In the information processing system 300 in Figure 14, the same reference numerals are used for parts corresponding to the information processing system 100 in Figure 3. Therefore, explanations of those parts will be omitted as appropriate, and the explanation will focus on the parts that differ from the information processing system 100. The information processing system 300 differs from the information processing system 100 in that the imaging unit 101 and the image processing unit 102 are replaced by the imaging unit 301 and the image processing unit 302, respectively, but otherwise it is configured the same as the information processing system 100. The information processing system 300 immediately displays captured video in emergencies, such as when the user is about to collide with a wall.
[0135] Specifically, the imaging unit 301 of the information processing system 300 photographs the surroundings according to the operating mode supplied by the control unit 111 and acquires captured images in frame units. Specifically, when the operating mode is normal mode, the imaging unit 301 acquires captured images of a predetermined resolution by sequentially starting exposure for a predetermined time for each horizontal line. On the other hand, when the operating mode is emergency mode, the imaging unit 301 sequentially starts exposure for a shorter time than in normal mode for each N horizontal lines (where N is an integer of 2 or more) and reads the charge for each N horizontal line. As a result, the imaging unit 301 acquires captured images with lower exposure and lower resolution than in normal mode. Therefore, although the exposure and resolution of the captured images in emergency mode are lower than in normal mode, the captured images in emergency mode are acquired at a higher speed than the captured images in normal mode. The imaging unit 301 supplies the acquired captured images to the image processing unit 302. Hereafter, captured images with lower exposure and lower resolution than in normal mode will be referred to as low-quality captured images.
[0136] The image processing unit 302 differs from the image processing unit 102 in that the ISP processing unit 110, control unit 111, synthesis unit 115, and display control unit 116 are replaced by the ISP processing unit 310, control unit 311, synthesis unit 315, and display control unit 316, respectively. Otherwise, it is configured the same as the image processing unit 302.
[0137] The ISP processing unit 310 performs ISP processing on the captured video supplied from the imaging unit 301 according to the operating mode supplied from the control unit 111. Specifically, when the operating mode is normal mode, the ISP processing unit 310 performs ISP processing, including noise reduction processing, on captured video of a predetermined resolution. On the other hand, when the operating mode is emergency mode, the ISP processing unit 310 performs simplified ISP processing, which omits noise reduction processing and other parts of the ISP processing, on low-quality captured video. Therefore, the processing time of the ISP processing unit 310 in emergency mode is shorter than the processing time in normal mode. The ISP processing unit 310 supplies the captured video after ISP processing to the recognition processing unit 112 and the synthesis unit 315.
[0138] The control unit 311 sets the operating mode to either normal mode or emergency mode. For example, if the user needs to take action to avoid danger, the control unit 311 sets the operating mode to emergency mode. The control unit 311 supplies the set operating mode to the shooting unit 301, the ISP processing unit 310, the compositing unit 315, and the display control unit 316. When the operating mode is set to normal mode, the control unit 311 controls the generation of CG images in the same way as the control unit 111.
[0139] If the operating mode supplied by the control unit 311 is the emergency mode, the compositing unit 315 supplies the low-quality captured video supplied by the ISP processing unit 310 directly to the display control unit 316. On the other hand, if the operating mode is the normal mode, the compositing unit 315, like the compositing unit 115, generates a composite image according to the compositing mode and supplies it to the display control unit 316.
[0140] The display control unit 316 sets the display method according to the operating mode supplied by the control unit 311. Specifically, when the operating mode is emergency mode, the display control unit 316 sets the display method to high-speed display mode, which displays for every N horizontal lines of the display unit 103. On the other hand, when the operating mode is normal mode, the display control unit 316 sets the display method to normal display mode, which displays for every one horizontal line of the display unit 103. The display control unit 316 outputs the low-quality captured video or composite video supplied by the composite unit 315 to the display unit 103 and controls the display unit 103 to display it using the set display method.
[0141] <Explanation of processing according to the operating mode> Figure 15 is a diagram that explains the processing according to the operating mode.
[0142] In Figure 15, the horizontal axis represents time, and the vertical axis represents a horizontal line. The upper graph in Figure 15 represents the processing time of the imaging unit 301, and the lower graph in Figure 15 represents the processing time of the display unit 103.
[0143] As shown on the left side of Figure 15, when the operating mode is normal mode, the imaging unit 301 sequentially captures each horizontal line for a predetermined time E nBy starting the exposure and reading the accumulated charge, a captured image of a predetermined resolution is acquired. This means the time required to acquire one frame of captured image is time C. n This is the result.
[0144] In contrast, as shown on the right side of Figure 15, when the operating mode is emergency mode, the imaging unit 301 sequentially takes time E for each of the N horizontal lines. n Shorter time E e Exposure is started, and the charge is read out for every N horizontal lines. As a result, the resolution of the captured image acquired by the imaging unit 301 is lower than in normal mode, but the time required to acquire one frame of captured image is time C. n Shorter time C e This is the result.
[0145] As described above, when the operating mode is emergency mode, the captured video is acquired at a faster speed compared to when it is normal mode.
[0146] As shown on the left side of Figure 15, when the operating mode is normal mode, ISP processing, object recognition processing, CG rendering, synthesis, etc. are performed on the acquired captured video, and the resulting synthesized video is supplied to the display control unit 316. Therefore, a predetermined delay time occurs from the time the captured video is acquired until the display starts. The display control unit 316 then displays the synthesized video on the display unit 103 using the normal display method. Therefore, the synthesized video is displayed for each horizontal line on the display unit 103, and the time required to display one frame is time D. n This is the result.
[0147] In contrast, as shown on the right side of Figure 15, when the operating mode is emergency mode, only simplified ISP processing is performed on the acquired low-quality captured video, and the captured video after simplified ISP processing is supplied to the display control unit 316. Therefore, the delay time from when the captured video is acquired until the display starts is shorter than in normal mode. The display control unit 316 also displays the low-quality captured video after simplified ISP processing on the display unit 103 using a high-speed display method. As a result, the low-quality captured video is displayed for each of the N horizontal lines on the display unit 103, that is, for each horizontal line of the low-quality captured video. Consequently, the time required to display one frame is time D.n Shorter time D e is obtained.
[0148] As described above, when the operation mode is the emergency mode, the time required for shooting, the delay time from when the captured video is acquired until the display starts, and the time required for display are shorter than in the normal mode. Therefore, in this case, although the quality is lower than in the normal mode, the captured video can be displayed on the display unit 103 with low latency.
[0149] <Explanation of display processing> FIG. 16 is a flowchart for explaining the display processing by the information processing system 300 when the operation mode is the emergency mode. This display processing is performed, for example, in units of frames.
[0150] In step S61 of FIG. 16, the imaging unit 301 of the information processing system 300 acquires a captured video of lower quality than in the normal mode and supplies it to the image processing unit 302.
[0151] In step S62, the ISP processing unit 310 of the image processing unit 302 performs simple ISP processing on the captured video of lower quality acquired in step S61. The ISP processing unit 310 supplies the captured video of lower quality after the simple ISP processing to the recognition processing unit 112 and the synthesis unit 315.
[0152] In step S63, the synthesis unit 315 outputs the captured video of lower quality on which the simple ISP processing has been performed in step S62 as it is to the display control unit 316. In step S64, the display control unit 316 sets the display method to the high-speed display method. In step S65, the display control unit 316 supplies the captured video of lower quality output in step S63 to the display unit 103 and causes the display unit 103 to display it in the high-speed display method. Then, the display processing ends.
[0153] Note that the display processing when the operation mode is the normal mode is composed of the same imaging processing as the imaging unit 101 by the imaging unit 301 and the composite video display processing of FIG. 9, so the description is omitted. In this display processing, the display control unit 316 sets the display method to the normal display method before the processing of step S22 in FIG. 9.
[0154] As described above, in the information processing system 300, when the operating mode is normal mode, the image processing unit 302 dynamically switches the synthesis mode, similar to the image processing unit 102, and generates a synthesized image according to the synthesis mode. Therefore, the same effect as the image processing unit 102 can be obtained.
[0155] In the information processing system 300, the display control unit 316 outputs either low-quality captured video or a composite video depending on the operating mode. Therefore, for example, when the operating mode is emergency mode, the display control unit 316 outputs low-quality captured video to the display unit 103 for display, thereby enabling the display of captured video with low latency. As a result, for example, if the control unit 311 sets the operating mode to emergency mode when the user needs to take action to avoid danger, the user can quickly grasp the current situation from the display on the display unit 103 and take action to avoid danger. Therefore, injuries and accidents to the user can be prevented, and safety can be enhanced.
[0156] <5. Fourth Embodiment> <Example of Information Processing System Configuration> Figure 17 is a block diagram showing an example of the configuration of the fourth embodiment of an information processing system to which this technology is applied.
[0157] In the information processing system 400 shown in Figure 17, the parts corresponding to those in the information processing system 100 shown in Figure 3 are denoted by the same reference numerals. Therefore, explanations of those parts will be omitted as appropriate, and the explanation will focus on the parts that differ from the information processing system 100. The information processing system 400 differs from the information processing system 100 in that the image processing unit 102 is replaced by the image processing unit 402, and otherwise it is configured in the same way as the information processing system 100. The information processing system 400 changes the synthesis mode based on the prediction error of the object recognition result.
[0158] Specifically, the image processing unit 402 differs from the image processing unit 102 in that it includes a recognition processing unit 412 and a synthesis mode setting unit 414 instead of the recognition processing unit 112 and the synthesis mode setting unit 114. Otherwise, it is configured the same as the image processing unit 102.
[0159] The recognition processing unit 412, like the recognition processing unit 112, generates object recognition result information or object recognition result prediction information and supplies it to the CG image generation unit 113. The recognition processing unit 412 supplies reference presence / absence information and object type information to the synthesis mode setting unit 414.
[0160] The recognition processing unit 412 also calculates the difference between the object recognition result of a frame that was previously designated as a predicted frame in the object recognition process and the predicted result of the object recognition result of that frame, and uses this difference as the prediction error for that frame. The recognition processing unit 412 then supplies this prediction error to the synthesis mode setting unit 414.
[0161] The synthesis mode setting unit 414, similar to the synthesis mode setting unit 114, sets the synthesis mode based on the reference presence / absence information and object type information supplied from the recognition processing unit 412, and supplies it to the synthesis unit 115. The synthesis mode setting unit 414 also changes the synthesis mode to a delay-tolerant mode if the number of consecutive frames in which the prediction error supplied from the recognition processing unit 412 is greater than or equal to a predetermined value is greater than or equal to a predetermined number. The synthesis mode setting unit 414 supplies the changed synthesis mode to the synthesis unit 115.
[0162] As described above, the image processing unit 402 of the information processing system 400 dynamically switches the synthesis mode, similar to the image processing unit 102, and generates a synthesized image according to the synthesis mode. Therefore, the same effect as the image processing unit 402 can be obtained.
[0163] The image processing unit 402 also changes the synthesis mode to a delay-tolerant mode based on the prediction error. For example, if the number of consecutive frames in which the prediction error is greater than or equal to a predetermined value is greater than or equal to a predetermined number, the image processing unit 402 changes the synthesis mode to a delay-tolerant mode. This means that, for example, if the reference object is a predictable object, but the actual movement of the reference object is difficult to predict, and the prediction error becomes large, the synthesis mode is changed to a delay-tolerant mode. As a result, it is possible to prevent discrepancies in the positional relationship between the reference object and the CG in the synthesized image.
[0164] <6. Fifth Embodiment> <Example of Information Processing System Configuration> Figure 18 is a block diagram showing an example of the configuration of the fifth embodiment of an information processing system to which this technology is applied.
[0165] In the information processing system 500 in Figure 18, the parts corresponding to the information processing system 100 in Figure 3 are denoted by the same reference numerals. Therefore, explanations of those parts will be omitted as appropriate, and the explanation will focus on the parts that differ from the information processing system 100. The information processing system 500 differs from the information processing system 100 in that the image processing unit 102 is replaced by the image processing unit 502, and otherwise it is configured the same as the information processing system 100. The information processing system 500 gradually changes the synthesis mode from one of the low-latency mode and the delay-tolerant mode to the other.
[0166] Specifically, the image processing unit 502 differs from the image processing unit 102 in that it includes a recognition processing unit 512 and a CG image generation unit 513 instead of the recognition processing unit 112 and the CG image generation unit 113. The image processing unit 502 also differs from the image processing unit 102 in that it includes a synthesis mode setting unit 514 and a synthesis unit 515 instead of the synthesis mode setting unit 114 and the synthesis unit 115. Otherwise, it is configured the same as the image processing unit 102.
[0167] The recognition processing unit 512 performs object recognition processing in the same manner as the recognition processing unit 112 and obtains an object recognition result. If the reference object is a predictable object or a stationary object, and the current synthesis mode supplied by the synthesis mode setting unit 514 is a delay-tolerant mode, the recognition processing unit 512 predicts the object recognition result of the intermediate frame in the same manner as the recognition processing unit 112. That is, if the synthesis mode is subsequently changed to an intermediate mode, the recognition processing unit 512 predicts the object recognition result of the intermediate frame. An intermediate mode is a mode that is temporarily set before the change to the other when the synthesis mode is changed from one of the low-latency mode and the delay-tolerant mode. An intermediate frame is a frame between the target frame and the prediction frame.
[0168] If the reference object is a predictable object or a stationary object, and the current synthesis mode is intermediate mode, the recognition processing unit 512 predicts the object recognition result for the predicted frame, similar to the recognition processing unit 112. That is, if the synthesis mode is subsequently changed to low-latency mode, the recognition processing unit 512 predicts the object recognition result for the predicted frame. The recognition processing unit 512 supplies the object recognition result prediction information for the intermediate frame or predicted frame to the CG image generation unit 513.
[0169] On the other hand, if the reference object is an object that is difficult to predict, the recognition processing unit 512 supplies object recognition result information to the CG image generation unit 513. The recognition processing unit 512 supplies reference presence / absence information and object type information to the synthesis mode setting unit 514.
[0170] The CG image generation unit 513 generates a fixed CG image, a target CG image, and a predicted CG image, similar to the CG image generation unit 113. When object recognition result prediction information for an intermediate frame is supplied from the recognition processing unit 512, the CG image generation unit 513 draws CG at the position based on the object recognition result prediction information, based on the CG drawing information. As a result, the recognition processing unit 512 generates an intermediate CG image, which is a CG image for the captured image of the intermediate frame. The CG image generation unit 513 supplies the generated fixed CG image, target CG image, predicted CG image, or intermediate CG image to the synthesis unit 515.
[0171] The synthesis mode setting unit 514, similar to the synthesis mode setting unit 114, determines the synthesis mode to be either a low-latency mode or a delay-tolerant mode based on the reference presence / absence information and object type information supplied from the recognition processing unit 512. If the currently set synthesis mode is either a low-latency mode or a delay-tolerant mode, and the determined synthesis mode is the other, the synthesis mode setting unit 514 sets the final synthesis mode to an intermediate mode. That is, when changing the synthesis mode from one of the low-latency mode or a delay-tolerant mode to the other, the synthesis mode setting unit 514 sets the final synthesis mode to an intermediate mode. On the other hand, if the currently set synthesis mode is an intermediate mode and the determined synthesis mode is either a low-latency mode or a delay-tolerant mode, the synthesis mode setting unit 514 sets the final synthesis mode to the determined synthesis mode. The synthesis mode setting unit 514 supplies the final synthesis mode to the recognition processing unit 512 and the synthesis unit 515.
[0172] If the synthesis mode supplied by the synthesis mode setting unit 514 is the low-latency mode or the delay-tolerant mode, the synthesis unit 515 synthesizes the captured video with the fixed CG video, the target CG video, or the predicted CG video, similar to the synthesis unit 515. On the other hand, if the synthesis mode is the intermediate mode, the synthesis unit 515 synthesizes the captured video of the intermediate frame with the target CG video or intermediate CG video supplied by the CG video generation unit 513 to generate a synthesized video. The synthesis unit 515 supplies the generated synthesized video to the display control unit 116.
[0173] <Explanation of the processing flow when changing from delay-tolerant mode to intermediate mode> Figure 19 is a diagram illustrating the processing flow of the information processing system 500 when changing from delay-tolerant mode to intermediate mode.
[0174] As shown in Figure 19, the shooting unit 101 takes pictures frame by frame, and the ISP processing unit 110 performs ISP processing on the captured video of each frame obtained as a result of the shooting. The recognition processing unit 512 performs object recognition processing of a reference object on the captured video after ISP processing, and predicts the object recognition result of the intermediate frame based on the object recognition result obtained as a result. In the example in Figure 19, the predicted frame is a frame two frames after the target frame, and the intermediate frame is a frame between the target frame and the predicted frame, one frame after the target frame. The CG video generation unit 513 generates an intermediate CG video based on the object recognition result prediction information representing the prediction result by the recognition processing unit 512 and CG drawing information.
[0175] At this time, the imaging unit 101 has completed imaging of a predicted frame two frames after the target frame, and the ISP processing unit 110 has completed ISP processing on the image of that predicted frame.
[0176] However, the synthesis unit 515 synthesizes the captured video and intermediate CG video of the intermediate frame supplied by the ISP processing unit 110. As a result, the display unit 103 displays a composite image, which is a combination of the captured video and intermediate CG video of the intermediate frame.
[0177] Furthermore, the predicted frame can be a frame M (where M is 3 or more) frames after the target frame, meaning that the number of frames between the target frame and the predicted frame is M-1. In this case, the number of types of intermediate frames is M-1.
[0178] Specifically, before switching from one delay-tolerant mode to the other, M-1 intermediate modes with different intermediate frames are set. When switching from delay-tolerant mode to low-latency mode, the intermediate frames of these M-1 intermediate modes are set one frame behind the target frame, starting from one frame after the target frame, and continuing up to one frame before the predicted frame. On the other hand, when switching from low-latency mode to delay-tolerant mode, the intermediate frames of the M-1 intermediate modes are set one frame before the predicted frame, starting from one frame before the predicted frame, and continuing up to one frame after the target frame.
[0179] As described above, the image processing unit 502 of the information processing system 500 dynamically switches the synthesis mode, similar to the image processing unit 102, and generates a synthesized image according to the synthesis mode. Therefore, the same effect as the image processing unit 102 can be obtained. The image processing unit 502 also temporarily sets the synthesis mode to an intermediate mode before changing the synthesis mode from one of the low-latency mode and the delay-tolerant mode to the other. That is, the image processing unit 502 changes the synthesis mode from one of the low-latency mode and the delay-tolerant mode to the other in a stepwise (seamless) manner. This reduces the user's discomfort or unease when switching synthesis modes.
[0180] Specifically, as mentioned above, when switching between synthesis modes, the behavior of the displayed synthesized image becomes unnatural. However, the image processing unit 502 can gradually switch between the low-latency mode and the delay-tolerant mode, thereby delaying or speeding up the time of the captured video within the synthesized image by one frame at a time. Therefore, the unnaturalness of the synthesized image's behavior can be reduced, and the user's sense of unease or discomfort can be alleviated.
[0181] <7. Description of a computer using this technology>
[0182] The series of processes described above can be executed by hardware or by software. When the series of processes are executed by software, the programs that make up that software are installed on a computer. Here, a computer includes computers built into dedicated hardware, as well as general-purpose personal computers, for example, that can perform various functions by installing various programs.
[0183] Figure 20 is a block diagram showing an example of the hardware configuration of a computer that executes the series of processes described above using a program.
[0184] In a computer, the processing circuit 901, ROM (Read Only Memory) 902, and RAM (Random Access Memory) 903 are interconnected by a bus 904.
[0185] An input / output interface 905 is further connected to the bus 904. An input unit 906, an output unit 907, a storage unit 908, a communication unit 909, and a drive 910 are connected to the input / output interface 905.
[0186] The input unit 906 may include physical or virtual operating means that the user operates to input information, such as a keyboard, mouse, or touch panel, as well as means that the user inputs information through voice, eye gaze, etc. Furthermore, the input unit 906 may include sensors for inputting various physical quantities to the computer. For example, the input unit 906 may include sensors that acquire physical quantities such as light (including infrared light other than visible light) or sound, such as a camera or microphone. Also, for example, the input unit 906 may include sensors that acquire other physical quantities such as temperature, moisture content, acceleration, distance, etc. The output unit 907 may include means that present information to the user by stimulating the user's perception, such as a display, speaker, or haptic device. The storage unit 908 is composed of a hard disk, non-volatile or volatile memory, etc., and stores various types of information (including programs). The communication unit 909 is a network interface, etc., and performs wired or wireless communication with the outside. The drive 910 drives removable media 911 such as a magnetic disk, optical disk, magneto-optical disk, or semiconductor memory.
[0187] The processing circuit 901 includes a processor that executes programs such as a CPU (Central Processing Unit) and a DSP (Digital Signal Processor). The processing circuit 901 (its processor) performs the above-described series of processes by loading the program stored in the storage unit 908 into the RAM 903 via the input / output interface 905 and the bus 904 and executing it. The processing circuit 901 can output the processing results of the series of processes from the output unit 907 via the bus 904 and the input / output interface 905 as needed. The processing circuit 901 can also store the processing results in the storage unit 908 or transmit them from the communication unit 909.
[0188] The program executed by the computer (processing circuit 901) can be provided by recording it on a removable medium 911, such as a package medium. The program can also be provided via wired or wireless transmission media, such as a local area network, the internet, or digital satellite broadcasting.
[0189] In a computer, a program can be installed in the storage unit 908 via the input / output interface 905 by inserting a removable media 911 into the drive 910. Alternatively, a program can be received by the communication unit 909 from another device, such as a server, via a wired or wireless transmission medium, and installed in the storage unit 908. Furthermore, programs can be pre-installed in the ROM 902 or the storage unit 908.
[0190] The programs executed by the computer may be programs that are processed chronologically in the order described herein, or they may be programs that are processed in parallel or at necessary times, such as when a call is made.
[0191] The processes that a computer performs according to a program do not necessarily have to follow the order described in the flowchart. In other words, the processes that a computer performs according to a program include processes that are executed in parallel or individually (e.g., parallel processing and object-based processing).
[0192] The program may be processed by a single computer (processor), or it may be processed in a distributed manner by multiple computers. Furthermore, the program may be transferred to a remote computer and executed there.
[0193] When the computer executes a program to perform the above-described series of processes, the input unit 906 functions as the imaging unit 101 (301). The processing circuit 901 (its processor) executes a program to function as the image processing unit 102 (202, 302, 402, 502), and the output unit 907 functions as the display unit 103.
[0194] In this specification, a system means one component or a collection of multiple components (devices, modules (parts), etc.). Therefore, one or more components of a computer, for example, only the processor, or a combination of the processor and memory (for example, only the processing circuit 901, or a combination of the processing circuit 901 to the bus 904, etc.), constitute a system. Regarding a collection of multiple components, it is not necessary whether all components reside in the same enclosure. Therefore, multiple devices housed in separate enclosures and connected via a network, or a single device containing multiple modules within a single enclosure, are all systems. Furthermore, for example, the entire computer, or a combination of a computer and other devices such as a server (not shown), also constitute a system.
[0195] The components (blocks) of the apparatus illustrated in this specification are functional conceptual blocks, and the actual apparatus does not need to have the illustrated configuration. That is, the apparatus can have any configuration in which the functions of the illustrated components are divided and / or integrated into any unit, for example, a configuration having one block in which the functions of all components are integrated.
[0196] The embodiments of this technology are not limited to those described above, and various modifications are possible without departing from the spirit of this technology.
[0197] For example, a combination of all or some of the above-described embodiments can be adopted.
[0198] For example, this technology can be configured as cloud computing, where a single function is shared and processed collaboratively by multiple devices via a network.
[0199] Furthermore, each step described in the flowchart above can be performed by a single device, or it can be divided and performed by multiple devices.
[0200] Furthermore, if a single step includes multiple processes, those processes can be executed by a single device or shared among multiple devices.
[0201] The effects described herein are merely illustrative and not limited to those described herein; other effects may also occur.
[0202] The technology can take the following configuration: (1) An information processing system including a circuit that generates a first virtual image which is a virtual image for the first captured image, or a second virtual image which is a virtual image for the second captured image which is a captured image for a second frame after the first frame, based on the object recognition result of the first captured image which is a captured image for the first frame, and generates a first composite image which is a composite of the first captured image and the first virtual image, or a second composite image which is a composite of the second captured image and the second virtual image, depending on the synthesis mode. (2) The circuit is also provided to set the synthesis mode to a second video synthesis mode which uses the second captured image for synthesis, if the object recognized in the first captured image is an object for which the object recognition result of the second captured image can be predicted, and the circuit generates the second composite image which is a composite of the second captured image and the second virtual image, when the synthesis mode is the second video synthesis mode. (3) The circuit is also provided to correct the second captured image and the second virtual image based on its own position estimated based on the captured image of a plurality of frames including the first frame, and the circuit is provided to synthesize the corrected second captured image and the second virtual image to generate the second composite image, as described in (1) or (2). (4) The circuit is also provided to correct the first composite image based on its own position estimated based on the captured image of a plurality of frames including the first frame, as described in any of (1) to (3). (5) The circuit is also provided to output the first composite image or the second composite image, or output a low-quality first captured image, as described in any of (1) to (4), depending on the operating mode, as described in (6) The circuit is also provided to set the synthesis mode based on an instruction, as described in (1).(7) The information processing system according to any one of (1) to (6), wherein the circuit is also provided to change the synthesis mode to a first image synthesis mode that uses the first image for synthesis, based on the prediction result of the object recognition result of the second captured image and the object recognition result of the second captured image. (8) The information processing system according to any one of (1) to (7), wherein the circuit is also provided to change the synthesis mode to the other after a predetermined time has elapsed since it was set to one of a first image synthesis mode that uses the first captured image for synthesis and a second image synthesis mode that uses the second captured image for synthesis. (9) The information processing system according to any one of (1) to (8), wherein the circuit is also provided to gradually change the synthesis mode from one of a first image synthesis mode that uses the first captured image for synthesis and a second image synthesis mode that uses the second captured image for synthesis to the other. (10) The information processing system according to any one of (1) to (9), further comprising the circuit and a display that presents the first synthesized image or the second synthesized image generated by the circuit. (11) The information processing system according to any one of (1) to (10), comprising the circuit and an imaging unit for acquiring the first captured image and the second captured image. (12) An information processing method comprising: an information processing system generating a first virtual image which is a virtual image for the first captured image, or a second virtual image which is a virtual image for the second captured image which is a captured image of a second frame after the first frame, based on the object recognition result of the first captured image which is a captured image of a first frame; and generating a first composite image which is a composite of the first captured image and the first virtual image, or a second composite image which is a composite of the second captured image and the second virtual image, depending on the composite mode. (13) The information processing method according to (12) above, wherein if the object recognized in the first captured video is an object for which the object recognition result of the second captured video can be predicted, the synthesis mode is further set to a second video synthesis mode in which the second captured video is used for synthesis, and if the synthesis mode is the second video synthesis mode, the second video synthesis mode is used to synthesize the second captured video and the second virtual video to generate the second synthesized video.(14) The information processing method according to (12) or (13), further comprising correcting the second captured image and the second virtual image based on the self-position estimated based on the captured image of a plurality of frames including the first frame, and combining the corrected second captured image and the second virtual image to generate the second composite image. (15) The information processing method according to any one of (12) to (14), further comprising correcting the first composite image based on the self-position estimated based on the captured image of a plurality of frames including the first frame. (16) The information processing method according to any one of (12) to (15), further comprising outputting the first composite image or the second composite image, or outputting the first captured image of low quality, depending on the operating mode. (17) The information processing method according to (12), further comprising setting the synthesis mode based on an instruction. (18) The information processing method according to any one of (12) to (17), further comprising changing the synthesis mode to a first video synthesis mode that uses the first captured video for synthesis, based on the predicted result of object recognition of the second captured video and the object recognition result of the second captured video. (19) The information processing method according to any one of (12) to (18), further comprising setting the synthesis mode to one of a first video synthesis mode that uses the first captured video for synthesis and a second video synthesis mode that uses the second captured video for synthesis, and then changing it to the other after a predetermined amount of time has elapsed. (20) The information processing method according to any one of (12) to (19), further comprising gradually changing the synthesis mode from one of a first video synthesis mode that uses the first captured video for synthesis and a second video synthesis mode that uses the second captured video for synthesis to the other.
[0203] 100 Information processing system, 101 Imaging unit, 102 Image processing unit, 103 Display unit, 200 Information processing system, 202 Image processing unit, 300 Information processing system, 301 Imaging unit, 302 Image processing unit, 400 Information processing system, 402 Image processing unit, 500 Information processing system, 502 Image processing unit, 901 Processing circuit
Claims
1. An information processing system including a circuit that generates a first virtual image, which is a virtual image of the first captured image, or a second virtual image, which is a virtual image of the second captured image, which is a captured image of a second frame after the first frame, based on the object recognition result of the first captured image, which is a captured image of the first frame; and generates a first composite image by combining the first captured image and the first virtual image, or a second composite image by combining the second captured image and the second virtual image, depending on the synthesis mode.
2. The circuit is also provided to set the synthesis mode to a second image synthesis mode that uses the second image for synthesis when the object recognized in the first image is an object for which the object recognition result of the second image can be predicted, and the circuit generates the second synthesized image by synthesizing the second image and the second virtual image when the synthesis mode is the second image synthesis mode, according to claim 1.
3. The information processing system according to claim 1, wherein the circuit is also provided to correct the second captured image and the second virtual image based on its own position estimated based on the captured images of a plurality of frames including the first frame, and the circuit combines the corrected second captured image and the second virtual image to generate the second composite image.
4. The information processing system according to claim 1, wherein the circuit is also provided to correct the first composite image based on its own position estimated based on a plurality of captured images including the first frame.
5. The information processing system according to claim 1, wherein the circuit is also provided to output the first composite image or the second composite image, or to output the first captured image of low quality, depending on the operating mode.
6. The information processing system according to claim 1, further comprising the circuit for setting the synthesis mode based on instructions.
7. The information processing system according to claim 1, wherein the circuit is also provided to change the synthesis mode to a first image synthesis mode in which the first image is used for synthesis, based on the prediction result of the object recognition result of the second captured image and the object recognition result of the second captured image.
8. The information processing system according to claim 1, wherein the circuit is further provided to change the synthesis mode to one of a first video synthesis mode that uses the first captured video for synthesis and a second video synthesis mode that uses the second captured video for synthesis after a predetermined time has elapsed.
9. The information processing system according to claim 1, wherein the circuit is also provided to stepwise change the synthesis mode from one of a first video synthesis mode that uses the first captured video for synthesis and a second video synthesis mode that uses the second captured video for synthesis to the other.
10. The information processing system according to claim 1, comprising the circuit and a display for displaying the first composite image or the second composite image generated by the circuit.
11. The information processing system according to claim 1, further comprising the circuit and an imaging unit for acquiring the first captured image and the second captured image.
12. An information processing method comprising: an information processing system generating a first virtual image, which is a virtual image for the first captured image, or a second virtual image, which is a virtual image for a second captured image, which is a captured image of a second frame after the first frame, based on the object recognition result of the first captured image, which is a captured image of a first frame; and generating a first composite image by combining the first captured image and the first virtual image, or a second composite image by combining the second captured image and the second virtual image, depending on the synthesis mode.
13. The information processing method according to claim 12, further comprising setting the synthesis mode to a second video synthesis mode that uses the second video for synthesis, if the object recognized in the first captured video is an object for which the object recognition result of the second captured video can be predicted, and if the synthesis mode is the second video synthesis mode, the second captured video and the second virtual video are synthesized to generate the second synthesized video.
14. The information processing method according to claim 12, further comprising correcting the second captured image and the second virtual image based on the self-position estimated based on the captured images of a plurality of frames including the first frame, and generating the second composite image by combining the corrected second captured image and the second virtual image.
15. The information processing method according to claim 12, further comprising correcting the first composite image based on the self-position estimated based on the captured images of a plurality of frames including the first frame.
16. The information processing method according to claim 12, further comprising outputting the first composite image or the second composite image, or outputting the first captured image of low quality, depending on the operating mode.
17. The information processing method according to claim 12, further comprising setting the synthesis mode based on instructions.
18. The information processing method according to claim 12, further comprising changing the synthesis mode to a first video synthesis mode in which the first captured video is used for synthesis, based on the predicted result of the object recognition result of the second captured video and the object recognition result of the second captured video.
19. The information processing method according to claim 12, further comprising setting the synthesis mode to one of a first video synthesis mode that uses the first captured video for synthesis and a second video synthesis mode that uses the second captured video for synthesis, and then changing it to the other after a predetermined period of time has elapsed.
20. The information processing method according to claim 12, further comprising stepwise changing the synthesis mode from one of a first video synthesis mode that uses the first captured video for synthesis and a second video synthesis mode that uses the second captured video for synthesis to the other.