Spatiotemporal inference program, method and device

By converting spatiotemporal information into tabular log data and using a language model, the technique addresses the limitations of conventional methods, enabling efficient and accurate spatiotemporal inference for long-duration videos, including three-dimensional object positioning.

WO2026126348A1PCT designated stage Publication Date: 2026-06-18FUJITSU LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
FUJITSU LTD
Filing Date
2024-12-10
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Conventional spatiotemporal inference methods are not applicable to long-duration videos, such as those captured by CCTV cameras spanning multiple days, and do not account for spatial information indicating the three-dimensional position of objects in real space.

Method used

The technique converts spatiotemporal information from video frames into tabular log data, incorporating three-dimensional positional information, and performs inference using a natural language question and log data with a language model.

🎯Benefits of technology

Enables efficient spatiotemporal inference on a real-space scale for long-duration video, allowing for accurate analysis and response to user queries regarding object positions and activities over extended periods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure JP2024043673_18062026_PF_FP_ABST
    Figure JP2024043673_18062026_PF_FP_ABST
Patent Text Reader

Abstract

This spatiotemporal inference device is configured to: convert spatiotemporal information that includes the position of an object in a real space at each time, said object being detected from an image of each frame of a video, into log data in a table format; and execute spatiotemporal inference on the video on the basis of the log data and a question in a natural language sentence.
Need to check novelty before this filing date? Find Prior Art

Description

Spatio-temporal Inference Program, Method, and Apparatus 【0001】 The disclosed technology relates to a spatio-temporal inference program, a spatio-temporal inference method, and a spatio-temporal inference apparatus. 【0002】 Conventionally, with CCTV (Closed-circuit Television), for example, objects such as people and vehicles are observed from above the observation target range of factories, supermarkets, intersections, etc., and the actions of the objects are analyzed. Also, a spatio-temporal inference agent extracts information from the video captured by the CCTV camera in response to a question from the user, performs spatio-temporal inference, and answers. 【0003】 Also, a technology has been proposed that applies a VLM (Vision and Language Model) trained using spatial inference data to VQA (Visual Question Answering). This technology assumes that the limited spatial inference function of the VLM is due to the lack of 3D spatial knowledge in the training data, and solves the lack of 3D spatial knowledge in the training data by training the VLM using internet-scale spatial inference data. 【0004】 Also, for example, LongVLM, a video-based LLM that utilizes a large language model (LLM: Large Language Models) and is a VideoLLM for long-term video understanding, has been proposed. This technology decomposes a long video into multiple short segments and encodes the local functions of each segment via a hierarchical token merge module. 【0005】 Also, for example, an LLM-based large-scale multimodal model that focuses on the design of an efficient model for long-term video understanding has been proposed. Instead of trying to process more frames simultaneously, this model processes the video online and stores past video information in a memory bank. Thereby, this model can refer to past video content for long-term analysis without exceeding the context length limitation of the LLM and the GPU memory limitation. 【0006】 Furthermore, a recursive video captioning model has been proposed that can output video captions at multiple hierarchical levels. This video captioning model uses a curriculum-based learning training scheme to start with clip-level captions describing atomic actions. It then focuses on segment-level descriptions, concluding by generating a summary of a one-hour video, thereby learning the hierarchical structure of the video. 【0007】Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia, "SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities", arXiv:2401.12168 [cs.CV], 22 Jan 2024.Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang, "LongVLM: Efficient Long Video Understanding via Large Language Models", arXiv:2404.03384 [cs.CV], 4 Apr 2024.Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim, "MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding", arXiv:2404.05726 [cs.CV], 8 Apr 2024.Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius, "Video ReCap: Recursive Captioning of Hour-Long Videos", arXiv:2402.13250 [cs.CV], 20 Feb 2024. 【0008】However, conventional spatiotemporal inference methods for video are either not applicable to long-duration videos, or at best, only to videos lasting a few minutes to a few hours. Therefore, it is difficult to apply spatiotemporal inference to videos spanning multiple days, such as those captured by CCTV cameras. 【0009】 Furthermore, conventional spatiotemporal inference does not take into account spatial information indicating the three-dimensional position of the object in real space. As described above, when considering spatiotemporal inference on a real-space scale for long-duration video spanning multiple days, conventional methods cannot perform efficient spatiotemporal inference. 【0010】 One aspect of the disclosure technology is that it aims to efficiently perform spatiotemporal inference on a real-space scale for long-duration video. 【0011】 In one embodiment, the disclosed technique converts spatiotemporal information, including the position of an object in real space at each time, detected from the image of each frame of the video, into tabular log data. The disclosed technique then performs spatiotemporal inference on the video based on a natural language question and the log data. 【0012】 One aspect of this approach is that it enables efficient spatiotemporal inference at a real-space scale for long-duration video footage. 【0013】 This is a functional block diagram of a spatiotemporal inference device. This is a diagram showing an example of a timestamp table. This is a diagram showing an example of a person table. This is a diagram showing an example of an object table. This is a diagram showing an example of an auxiliary data table. This is a diagram showing an example of a scene layout table. This is a diagram for explaining the inference unit. This is a block diagram showing the schematic configuration of a computer that functions as a spatiotemporal inference device. This is a flowchart showing an example of spatiotemporal inference processing. This is a diagram showing an example of a response output from a spatiotemporal inference device. This is a diagram showing an example of a response output from a spatiotemporal inference device. 【0014】 An example of an embodiment relating to the disclosed technology will be described below with reference to the drawings. 【0015】As shown in Figure 1, the spatiotemporal inference device 10 functionally includes a conversion unit 12 and an inference unit 16. Furthermore, a spatiotemporal log 14 is stored in a predetermined storage area of ​​the spatiotemporal inference device 10. 【0016】 The conversion unit 12 acquires the video input to the spatiotemporal inference device 10. The video is, for example, footage taken by a CCTV camera of an observation target area such as a factory, supermarket, or intersection. The video contains multiple frames, and each frame is assigned a timestamp indicating the date and time it was taken. 【0017】 Furthermore, the conversion unit 12 acquires auxiliary data that is input to the spatiotemporal inference device 10. The auxiliary data is, for example, sensing data other than video detected at each time in the real space including the video shooting range by one or more sensors. The sensing data may be, for example, environmental information such as temperature and humidity in the real space, or detection information of events such as abnormalities or failures of machines installed in the real space. Alternatively, data from a POS (Point of Sale) system may be acquired as sensing data. Each piece of sensing data is given a timestamp indicating the date and time the data was acquired. The conversion unit 12 may also acquire layout information of the real space as auxiliary data. 【0018】 The conversion unit 12 converts spatiotemporal information, including the position of the object in real space at each time point detected from the image of each frame of the video, into log data in a table format. 【0019】 Specifically, the conversion unit 12 uses methods such as object detection and segmentation (for example, Reference 1) to detect people and objects as targets from the image of each frame of the video, and also identifies the type of object. The conversion unit 12 also acquires external characteristics such as clothing for each of the detected people. 【0020】Furthermore, the conversion unit 12 tracks people by assigning a person ID to detected people using an object tracking method (for example, Reference 2) and assigning the same person ID to the same person across frames. Similarly, the conversion unit 12 tracks objects by assigning an object ID to detected objects and assigning the same object ID to objects that are the same instance across frames. 【0021】 Furthermore, the conversion unit 12 estimates the posture of each person detected from the image using skeletal recognition technology or the like (for example, Reference 3). The conversion unit 12 also estimates the movements of the people using the estimated posture and a machine learning model that has been pre-trained to estimate the movements of the people (for example, Reference 4). 【0022】 Furthermore, the conversion unit 12 performs calibration of the internal and external parameters of the camera that captured the video (for example, reference 5). The conversion unit 12 also uses the calibrated camera parameters to convert the two-dimensional positions in the image into three-dimensional positions in real space, thereby calculating the positional information of the detected people and objects. 【0023】References: 1. Tianhe Ren et al., "Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection", 2024. 2. Fan Yang et al., "Hard to track objects with irregular motions and similar appearances? make it easier by buffering the matching space", 2023. 3. Tao Jiang et al., "RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation", 2024. 4. Chaitanya Ryali et al., "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles", 2023. 5. Alexander Veicht et al., "GeoCalib: Single-image Calibration with Geometric Optimization", 2024. 6. Fan Yang et al., "YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras", 2024. 【0024】 The conversion unit 12 stores the information of people and objects obtained through the above-described processes in the tables of the spatiotemporal log 14, associating it with the timestamps attached to the frames in which the people and objects were detected. Figures 2 to 6 show examples of tables included in the spatiotemporal log 14. 【0025】 The table shown in Figure 2 is an example of a timestamp table 14A. The conversion unit 12 associates the timestamp assigned to each frame with the person ID of the person detected from the image of that frame, and the object ID of the detected object, and stores them in the timestamp table 14A. 【0026】 The table shown in Figure 3 is an example of a person table 14B. The conversion unit 12 associates the timestamp assigned to each frame with the person ID of the person detected from the image of that frame, the location information, and the attributes of that person, and stores them in the person table 14B. In the example in Figure 3, appearance features and actions are the attributes of the person. The location information is the three-dimensional position of the person in real space, but here, the position information is represented as two-dimensional coordinate values ​​on a plane by converting the position of the person's feet in the image into a three-dimensional position. 【0027】 The table shown in Figure 4 is an example of an object table 14C. The conversion unit 12 associates the timestamp assigned to each frame with the object ID and type of the object detected from the image of that frame, the location information, and the attributes of the object, and stores them in the object table 14C. In the example in Figure 4, the contents are the attributes of a person. The contents are, for example, information about the objects inside the object "cart" assumed in the example in Figure 4. The location information, similar to the location information of a person, is represented as a two-dimensional coordinate value on a plane by converting the position of the object's contact point into a three-dimensional position. The contact point may be, for example, the midpoint of the bottom edge of the bounding box that indicates the area of ​​the object in the image. 【0028】 Furthermore, the spatiotemporal log 14 also stores auxiliary data and layout information. The table shown in Figure 5 is an example of an auxiliary data table 14D. In the example in Figure 5, events that occurred on machines installed in real space are stored in association with timestamps indicating the time of occurrence of those events. 【0029】The table shown in Figure 6 is an example of a scene layout table 14E. For example, if the real space being observed is a factory, the conversion unit 12 assigns a scene ID to each of the following based on the acquired layout information: Factory, Workshop, Production Machine, etc. The conversion unit 12 associates the scene ID with the information related to the scene indicated by that scene ID and stores it in the scene layout table 14E. In Figure 6, an example of scene information is shown in the scene layout table 14E, which includes location information, a higher-level scene, and a lower-level scene. 【0030】 The positional information here can be, for example, a series of coordinates of the vertices of a polygon when the area surrounding each scene is represented by a polygon. The parent-level and child-level scenes indicate the inclusion relationship of each scene. In the scene layout table 14E of Figure 6, it is shown that the workshop "Workshop A-1" is contained within the factory "Factory X". It is also shown that the production machines "Production Machine A", "Production Machine B", and "Production Machine F" are installed within the workshop "Workshop A-1". 【0031】 By associating the layout's positional information with the individual positional information of people and objects, it is possible to determine the positions of people and objects within the layout. 【0032】 The inference unit 16 performs spatiotemporal inference on the video based on a natural language question input by the user and the spatiotemporal log 14. As shown in Figure 7, the inference unit 16 performs spatiotemporal inference using the LLM 16A. The LLM 16A plans tasks to output answers to questions, generates code to execute tasks, executes the code on log data, organizes the execution results to generate answers, and outputs them. The LLM 16A may perform these processes using Text2Code or the like. 【0033】Furthermore, as shown in Figure 7, LLM16A performs spatiotemporal inference by referring to external information 16B. That is, it applies RAG (Retrieval-Augmented Generation). External information 16B includes at least one of a predetermined function for generating code, prior knowledge about the real space, and a predetermined prompt for outputting the answer. An example of a function for generating code is shown below. 【0034】 The first example is the function check_location(object, area), which calculates whether an object is within a target area. In this function, an arbitrary range is set as the range of the target area in "area," and the location information of the person ID or object ID specified in "object" is obtained from the person table 14B or object table 14C. Then, it is determined whether the obtained location information is included within the range of the target area. This function can be used for various tasks, for example, in a factory or warehouse, such as determining whether a worker has entered a restricted area or whether a shopping cart is correctly placed in its designated location. This function can also be used for tasks such as determining how long each customer stays in front of a product shelf, for example, in a supermarket. 【0035】 The second example is the function check_distance(objectA, objectB) which calculates the relative distance between objects. In this function, the location information of the person ID or object ID specified by "objectA" and "objectB" is obtained from the person table 14B or object table 14C, and the distance between the two locations is calculated. This function can be used, for example, to determine whether a worker is too close to a moving forklift, or whether at least two workers are performing dangerous tasks such as working on a ladder. This function can also be used, for example, to determine the degree of congestion in a supermarket. 【0036】The third example is the count_num(object_list) function, which counts the number of people IDs or object IDs specified by "object_list" in the timestamp table 14A. This function can be used, for example, to check the number of objects such as carts, pallets, and boxes in a warehouse or factory, to track the number of products on a supermarket shelf, or to count the number of products assembled by a worker. It can also be used to determine, for example, how many minutes a worker has performed heavy labor that requires a break, by counting the number of timestamps for the same person ID. 【0037】 The fourth example is the function fuzzy_match(attribute_stringA, attribute_stringB) which determines attribute matching. In this function, an arbitrary attribute is specified in either "attribute_stringA" or "attribute_stringB", and an attribute obtained from the person table 14B or object table 14C is specified in the other, and the matching of both attributes is determined. This function is used to check whether the attributes of an object match the required attributes. For example, suppose the appearance features representing standard personal protective clothing for factory workers are {"hat", "jacket", "long trousers"}. By determining whether these appearance features match the appearance features of a person obtained from the person table 14B, it is possible to determine whether a person is a worker or not, or whether a worker is wearing the correct protective clothing. Note that fuzzy matching may be performed, determining a match not only when the two attributes are a perfect match, but also when the similarity is above a predetermined value. For example, if the appearance characteristics of a person obtained from person table 14B are {"white helmet", "blue jacket", "long blue trousers"}, the string is slightly different from the standard appearance characteristics mentioned above, but it may still be determined to be a match. 【0038】The fifth example is the function cluster(object_list) which clusters spatiotemporal information. This function can be used to investigate the spatiotemporal relationships of operations, such as counting periodic operations. 【0039】 Furthermore, the prior knowledge regarding the physical space included in the external information 16B includes various types of information such as work manuals, workflows, and work schedules. For example, the operation workflow for product A may include a description of what operations are performed in each work area in order of process, or a description of the time from the start to the end of work for each day of the week for each type of work, such as day shifts and night shifts. 【0040】 Furthermore, prompts for outputting answers contained in external information 16B include instructions to be given to LLM 16A in order to properly perform the task of outputting answers to user questions. For example, instructions such as "You are an expert in generating graphs of tasks for spatiotemporal calculations based on the factory's production schedule and rules. When a user gives you a question about the factory's operations, your task is to understand the intent of the question and generate a graph that answers the question using the following functions" may be included. Prompts may also be included to instruct which functions to execute and in what order. 【0041】 LLM16A uses the external information 16B described above to plan a task to output an answer to a question, as described above, and generates code to execute the task. When LLM16A executes the generated code, the spatiotemporal information specified by the function is obtained from the spatiotemporal log 14, and calculations such as counting and comparison are performed. LLM16A aggregates the results of the code execution, graphs them, and creates text based on the execution results to generate and output the answer. LLM16A is a language model such as a transformer that has been trained to estimate and output the next token from an input token sequence. 【0042】The spatio-temporal inference device 10 may be implemented by, for example, a computer 40 shown in FIG. 8. The computer 40 includes a CPU (Central Processing Unit) 41, a GPU (Graphics Processing Unit) 42, a memory 43 as a temporary storage area, and a non-volatile storage device 44. The computer 40 also includes input / output devices 45 such as an input device and a display device, and an R / W (Read / Write) device 46 that controls reading and writing of data to and from a storage medium 49. The computer 40 further includes a communication I / F (Interface) 47 connected to a network such as the Internet. The CPU 41, GPU 42, memory 43, storage device 44, input / output devices 45, R / W device 46, and communication I / F 47 are connected to each other via a bus 48. 【0043】 The storage device 44 is, for example, an HDD (Hard Disk Drive), SSD (Solid State Drive), flash memory, or the like. A spatio-temporal inference program 50 for causing the computer 40 to function as the spatio-temporal inference device 10 is stored in the storage device 44 as a storage medium. The spatio-temporal inference program 50 has a conversion process control instruction 52 and an inference process control instruction 56. The storage device 44 also has an information storage area 60 in which information constituting the spatio-temporal log 14 is stored. 【0044】 The CPU 41 reads the spatio-temporal inference program 50 from the storage device 44 and expands it in the memory 43, and sequentially executes the control instructions included in the spatio-temporal inference program 50. By executing the conversion process control instruction 52, the CPU 41 operates as the conversion unit 12 shown in FIG. 1. By executing the inference process control instruction 56, the CPU 41 operates as the inference unit 16 shown in FIG. 1. The CPU 41 also reads information from the information storage area 60 and expands the spatio-temporal log 14 in the memory 43. As a result, the computer 40 that executes the spatio-temporal inference program 50 functions as the spatio-temporal inference device 10. Note that the CPU 41 that executes the program is hardware. A part of the program may also be executed by the GPU 42. 【0045】Note that the functions realized by the spatio-temporal inference program 50 may be realized by, for example, a semiconductor integrated circuit, more specifically, an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array), or the like. 【0046】 Next, the operation of the spatio-temporal inference device 10 according to this embodiment will be described. When video and auxiliary data, and a question from a user are input to the spatio-temporal inference device 10, the spatio-temporal inference process shown in FIG. 9 is executed in the spatio-temporal inference device 10. Note that the spatio-temporal inference process is an example of the spatio-temporal inference method of the disclosed technology. 【0047】 In step S10, the conversion unit 12 acquires the video and auxiliary data input to the spatio-temporal inference device 10, and the question from the user. Next, in step S12, the conversion unit 12 converts the video and auxiliary data into log data in table format and stores it in the spatio-temporal log 14. 【0048】 Next, in step S14, the inference unit 16 inputs the spatio-temporal log 14 and the question to the LLM16A. Next, in step S16, the LLM16A refers to the external information 16B, plans a task for outputting an answer to the question, generates a code for executing the task, executes the code on the log data, organizes the execution result to generate an answer, and outputs it. Then, the spatio-temporal inference process ends. 【0049】 As described above, the spatio-temporal inference device according to this embodiment converts spatio-temporal information including the position of an object detected from the image of each frame of the video at each time in the real space into log data in table format. Then, the spatio-temporal inference device performs spatio-temporal inference on the video based on a question in natural language text and the log data. Thereby, spatio-temporal inference on the real space scale for a long video can be efficiently performed. 【0050】 An example of an application to which the spatio-temporal inference device according to this embodiment is applied will be described. 【0051】As the first example, we will describe an application for analyzing customer activity in a supermarket. In this example, the coordinates of the supermarket layout are used in the LLM 16A from the scene layout table 14E, and the timestamps and all person IDs and object IDs at each timestamp are used in the timestamp table 14A. In addition, the trajectories of all people are used in the LLM 16A from the person table 14B. The trajectories are the position information of each person at each timestamp, arranged in chronological order. 【0052】 By using the spatiotemporal logs described above, for example, in response to a user's question such as "How many customers enter the supermarket between 10:00 AM and 11:00 AM?", the spatiotemporal inference device 10 can output an answer such as "13". Also, for example, in response to a user's question such as "Where did customers spend the most time in the supermarket between 10:00 AM and 12:00 PM?", the spatiotemporal inference device 10 can output an answer such as "The grocery section". 【0053】 In addition to the spatiotemporal logs mentioned above, LLM16A can also utilize weather data for the supermarket's location and sales data for the supermarket as supplementary data. In this case, for example, suppose a user asks, "How does the weather affect customer behavior and sales?" The spatiotemporal inference device 10 can then output an answer such as, "Compared to sunny days, fewer customers visit the supermarket, but those who do visit tend to stay longer and move more slowly within the supermarket..." 【0054】 Furthermore, in response to a user's question, for example, "Can you provide advice on optimizing shelf replenishment?", the spatiotemporal inference device 10 can output a response such as, "Yes. 1. Before it rains, we replenish the chilled noodles less. We have 50 packs of chilled noodles in stock for the day, but on rainy days, we only sell an average of 5 packs. 2. ..." 【0055】As a second example, we will describe an application for analyzing sports (in this case, basketball). In this example, the coordinates of the basketball court are used in LLM16A from the scene layout table 14E, and the timestamps and all person IDs and object IDs at each timestamp are used in LLM16A from the timestamp table 14A. In addition, the trajectories and attributes of all players are used in LLM16A from the person table 14B, and the trajectories and attributes of the basketballs used in the game are used in LLM16A from the object table 14C. From all the people, LLM16A identifies players by team based on their attributes, and identifies the basketballs used in the game from all the objects. 【0056】 By using the spatiotemporal logs described above, for example, in response to a user's question such as "What was the average speed of all players during the game?", the spatiotemporal inference device 10 can output an answer such as "2.1 m / s". Also, for example, in response to a user's question such as "What was the maximum speed of all players during the game?", the spatiotemporal inference device 10 can output an answer such as "7.0 m / s". Furthermore, for example, suppose the user's question is "During the first quarter, how long was the basketball placed on the left side of the court and on the right side, respectively?", the spatiotemporal inference device 10 can output an answer such as "During the first quarter, the basketball was placed on the left side of the court for 7.3 minutes and on the right side for 5.7 minutes." 【0057】 As a third example, let's describe a factory management application. In this example, the factory layout and production equipment coordinates are used in LLM16A from the scene layout table 14E, and the timestamps and all person IDs for each timestamp are used in LLM16A from the timestamp table 14A. In addition, the trajectories and attributes of all workers are used in LLM16A from the person table 14B. 【0058】By using the spatiotemporal log described above, the spatiotemporal inference device 10 can output an answer like the one shown in Figure 10 to a user's question such as, "How long does it take for a worker to complete the assembly work? Please answer this question in an organized table." 【0059】 In addition to the spatiotemporal log described above, the LLM 16A can also utilize the trajectories and attributes of all movable carts from the person table 14B, as well as events indicating the occurrence of abnormalities reported by production equipment at each timestamp, as auxiliary data. In this case, for example, in response to a question from a user such as, "How long did it take the workers to deal with the abnormal cases reported by equipment A and B during today's production? Please answer this question in an organized table," the spatiotemporal inference device 10 can output an answer as shown in Figure 11. 【0060】 Furthermore, the spatiotemporal inference device according to the above embodiment can also be used for data integration with other systems. For example, by inputting a spatiotemporal log converted from video and auxiliary data, and a prompt instructing conversion to the data structure of another system, the spatiotemporal log can be converted to a data structure usable by the other system. 【0061】 In the above embodiment, the spatiotemporal inference program is pre-stored (installed) in the storage device, but the invention is not limited to this. The program relating to the disclosed technology may be provided in a form stored on a storage medium such as a CD-ROM, DVD-ROM, or USB memory. 【0062】10 Spatiotemporal Inference Device 12 Conversion Unit 14 Spatiotemporal Log 14A Timestamp Table 14B Person Table 14C Object Table 14D Auxiliary Data Table 14E Scene Layout Table 16 Inference Unit 16A LLM 16B External Information 40 Computer 41 CPU 42 GPU 43 Memory 44 Storage Device 45 Input / Output Device 46 R / W Device 47 Communication Interface 48 Bus 49 Storage Medium 50 Spatiotemporal Inference Program 52 Conversion Process Control Instruction 56 Inference Process Control Instruction 60 Information Storage Area

Claims

1. A spatiotemporal inference program for causing a computer to perform a process that includes converting spatiotemporal information, including the position of an object in real space at each time, detected from the image of each frame of a video, into log data in a table format, and performing spatiotemporal inference on the video based on a question in natural language and the log data.

2. The spatiotemporal inference program according to claim 1, wherein the log data further comprises at least one of the external characteristics of the person as the object, the actions of the person, and the type of object as the object.

3. The spatiotemporal inference program according to claim 1 or 2, further comprising spatiotemporal information converted from sensing data other than the video, detected at each time in the real space by one or more sensors, the log data.

4. The spatiotemporal inference program according to claim 1 or 2, wherein the log data includes layout information of the real space to which the position of the object in real space is associated.

5. The spatiotemporal inference program according to claim 1, wherein the spatiotemporal inference is performed by a large-scale language model.

6. The spatiotemporal inference program according to claim 5, wherein the large-scale language model plans a task to output an answer to the question, generates code to perform the task, and outputs the answer by executing the code on the log data.

7. The spatiotemporal inference program according to claim 6, wherein the large-scale language model performs spatiotemporal inference by referring to external information including at least one of a predetermined function for generating the code, prior knowledge about the real space, and a predetermined prompt for outputting the answer.

8. A spatiotemporal inference method performed by a computer, which includes converting spatiotemporal information, including the position of an object in real space at each time, detected from the image of each frame of the video, into log data in table format, and performing spatiotemporal inference on the video based on a question in natural language and the log data.

9. The spatiotemporal inference method according to claim 8, wherein the log data further comprises at least one of the external characteristics of the person as the object, the actions of the person, and the type of object as the object.

10. The spatiotemporal inference method according to claim 8 or 9, further comprising spatiotemporal information converted from sensing data other than the video, detected at each time in the real space by one or more sensors, the log data.

11. The spatiotemporal inference method according to claim 8 or 9, wherein the log data includes layout information of the real space to which the position of the object in real space is associated.

12. The spatiotemporal inference method according to claim 8, wherein the spatiotemporal inference is performed by a large-scale language model.

13. The spatiotemporal inference method according to claim 12, wherein the large-scale language model plans a task to output an answer to the question, generates code to perform the task, and outputs the answer by executing the code on the log data.

14. The spatiotemporal inference method according to claim 13, wherein the large-scale language model performs spatiotemporal inference by referring to external information including at least one of a predetermined function for generating the code, prior knowledge about the real space, and a predetermined prompt for outputting the answer.

15. A spatiotemporal inference device comprising one or more processors, wherein the processors perform a process that includes converting spatiotemporal information, including the position of an object in real space at each time, detected from the image of each frame of the video, into log data in table format, and performing spatiotemporal inference on the video based on a question in natural language and the log data.

16. The spatiotemporal inference device according to claim 15, wherein the log data further comprises at least one of the external characteristics of the person as the object, the actions of the person, and the type of object as the object.

17. The spatiotemporal inference device according to claim 15 or claim 16, further comprising spatiotemporal information converted from sensing data other than the video, detected at each time in the real space by one or more sensors, the log data.

18. The spatiotemporal inference device according to claim 15 or 16, wherein the log data includes layout information of the real space to which the position of the object in real space is associated.

19. The spatiotemporal inference device according to claim 15, wherein the spatiotemporal inference is performed by a large-scale language model.

20. The spatiotemporal inference device according to claim 19, wherein the large-scale language model plans a task for outputting an answer to the question, generates code for executing the task, and outputs the answer by executing the code on the log data.