Monitoring system, monitoring method, and program
The monitoring system improves vehicle item identification accuracy by extracting partial images from difference images and using dual pre-trained models for enhanced detection and verification.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- TOYOTA JIDOSHA KK
- Filing Date
- 2024-12-19
- Publication Date
- 2026-07-01
Smart Images

Figure 2026109367000001_ABST
Abstract
Description
Technical Field
[0001] The present disclosure relates to a monitoring system, a monitoring method, and a program.
Background Art
[0002] Patent Document 1 describes generating a difference image between an image captured before a user gets on a vehicle and an image captured after the user gets off the vehicle, and performing pattern matching processing on the difference image to identify the name of an item left in the vehicle after getting off.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] Conventional configurations for identifying objects inside a vehicle have sometimes made it difficult to identify items due to the presence of sunlight, the angle of the forgotten item, and color, etc. Thus, there has been room for improvement in the accuracy of identifying items inside the vehicle in conventional configurations.
[0005] An object of the present disclosure is to enable identification of items inside a vehicle with higher accuracy.
Means for Solving the Problems
[0006] A monitoring system according to one embodiment of the present disclosure is a monitoring system that monitors the inside of a vehicle using a camera unit, and extracts a partial image from the second image that includes a difference region, which is a region in which a significant difference with a certain extent is observed in a difference image, which is the difference between a first image taken at a first time and a second image taken at a second time inside the vehicle, and inputs the partial image and a first prompt that queries for an item included in the partial image to a first pre-trained model that has been trained to output the name of an item captured in the image when at least the image and the first prompt are input, thereby obtaining a first output result regarding the item included in the second image, and inputs the second image and a second prompt that queries whether the item indicated by the first output result is included in the second image to a second pre-trained model that has been trained to output whether or not an item identified by the item name is captured in the image when at least the image and the second prompt including the item name are input, thereby outputting a second output result regarding the correctness of the first output result.
[0007] A monitoring system according to one embodiment of the present disclosure is a monitoring system that monitors the inside of a vehicle using a camera unit, and extracts a partial image from the second image that includes a difference region in the difference image between a first image taken at a first time and a second image taken at a second time inside the vehicle, inputs the partial image and a first prompt that queries the articles included in the partial image to a first trained model to obtain a first output result regarding the articles included in the second image, and inputs the second image and a second prompt that queries whether the second image contains the articles indicated by the first output result to a second trained model to output a second output result regarding the correctness of the first output result.
[0008] A monitoring method according to one embodiment of the present disclosure is a monitoring method for a monitoring system that monitors the inside of a vehicle using a camera unit, and includes: extracting a partial image from a second image that includes a difference region in a difference image between a first image taken at a first time and a second image taken at a second time inside the vehicle; inputting the partial image and a first prompt for querying the articles contained in the partial image into a first trained model to obtain a first output result regarding the articles contained in the second image; and inputting the second image and a second prompt for querying whether the second image contains the articles indicated by the first output result into a second trained model to output a second output result regarding the correctness of the first output result.
[0009] A program according to one embodiment of the present disclosure is a program for controlling a monitoring system that monitors the inside of a vehicle, and causes a processor to perform the following actions: extract a partial image from a second image that includes a difference region in a difference image between a first image taken at a first time and a second image taken at a second time inside the vehicle; input the partial image and a first prompt that queries for an item contained in the partial image to a first trained model to obtain a first output result regarding an item contained in the second image; and input the second image and a second prompt that queries for an item indicated by the first output result to a second trained model to output a second output result regarding the correctness of the first output result. [Effects of the Invention]
[0010] According to one embodiment of this disclosure, items inside a vehicle can be identified with higher accuracy. [Brief explanation of the drawing]
[0011] [Figure 1] This figure shows an example configuration of a monitoring system according to one embodiment. [Figure 2] This is a schematic diagram showing an example of a vehicle equipped with a terminal device. [Figure 3] This is a block diagram showing an example of the configuration of a terminal device in Figure 1. [Figure 4] This block diagram shows an example of the server device configuration shown in Figure 1. [Figure 5] This flowchart shows an example of the operating procedure for a monitoring system. [Figure 6] This is a diagram showing an example of the first image. [Figure 7] This is a diagram showing an example of the second image. [Modes for carrying out the invention]
[0012] Hereinafter, an embodiment of the present disclosure will be described with reference to the drawings. In each drawing, parts having the same configuration or function are denoted by the same reference numerals. In the description of this embodiment, redundant descriptions of the same parts may be omitted or simplified as appropriate.
[0013] (Summary of the embodiment) Figure 1 shows an example of the configuration of a monitoring system 1 according to one embodiment. The monitoring system 1 identifies items such as lost or forgotten items left in the vehicle 40 (see Figure 2) based on images taken inside the vehicle 40.
[0014] The monitoring system 1 comprises a terminal device 10 and a server device 20. The terminal device 10 and the server device 20 are connected to each other via a network N so that they can communicate with one another. The network N may include at least one of the following: the internet, an intranet, a mobile communication network, etc.
[0015] The terminal device 10 is a computer that acquires a captured image taken inside the vehicle 40, analyzes the captured image, and identifies an article left inside the vehicle 40. The terminal device 10 is, for example, a general-purpose computer such as a PC (Personal Computer) or a tablet terminal, but may be configured as any dedicated electronic device. As will be described later with reference to FIG. 2, in the present embodiment, the terminal device 10 is provided as an in-vehicle device mounted inside the vehicle 40. However, the terminal device 10 is not limited to an in-vehicle device. For example, the terminal device 10 may acquire a captured image taken inside the vehicle 40 by receiving it from another device via the network N or reading it from a recording medium such as a USB (Universal Serial Bus).
[0016] The server device 20 is a computer that manages information on articles related to forgotten items left inside the vehicle 40 and verifies the analysis result of the captured image by the terminal device 10. The terminal device 10 is, for example, a general-purpose computer such as a WS (Workstation) or a PC, but may be configured as any dedicated electronic device.
[0017] FIG. 2 is a schematic diagram showing an example of the vehicle 40 in which the terminal device 10 is mounted. As shown in FIG. 2, a camera 30 as a photographing unit that photographs a predetermined photographing range 30A and acquires a photographed image is provided inside the vehicle 40. The camera 30 is connected to the terminal device 10 and transmits the acquired photographed image to the terminal device 10.
[0018] In this embodiment, the camera 30 may be, for example, a depth camera that acquires a depth image. A depth image is an image in which the distance from the camera 30 is expressed for each pixel of the image, for example, by luminance or the like. The depth camera is configured by, for example, existing technologies such as a stereo camera, a ToF (Time of Flight) camera, or a structured light camera, but the operating principle of the depth camera is arbitrary. Hereinafter, an example in which the camera 30 is a depth camera that acquires a depth image will be mainly described, but the type of the camera 30 and the type of the image acquired by the camera 30 are arbitrary. For example, the camera 30 may be a device that acquires a black-and-white or color two-dimensional image based on incident light.
[0019] Note that the vehicle 40 is, for example, a bus, a minibus, a shared taxi, etc. that unspecified people get on and off, but the type and use of the vehicle 40 are not limited to these. For example, the vehicle 40 may be a private car owned by an individual or a rental car, etc.
[0020] In such a configuration, the monitoring system 1 uses a large language model (LLM: Large Language Model) including a pre-trained vision and language model (VLM: Vision and Language Model) to identify lost items from the captured image of the camera 30. The vision and language model is a generation model that generates a text output according to the input of an image and a text.
[0021] Specifically, a difference image that is the difference between the first image captured at the first time and the second image captured at the second time within the vehicle 40 is acquired. The difference image is calculated, for example, by obtaining the difference in depth for each pixel of the first image and the second image. The first image and the second image are images obtained by the camera 30 capturing the same shooting range 30A. The first time may be, for example, the time before the passenger gets on the vehicle 40. The second time may be, for example, the time after the passenger gets off the vehicle 40.
[0022] Monitoring system 1 inputs a partial image containing a region with a significant difference having a certain extent in the difference image, and a first prompt querying the items included in the partial image, into the first trained model to obtain a first output result regarding the items included in the second image. Monitoring system 1 inputs the second image and a second prompt querying whether the items indicated by the first output result are included in the second image into the second trained model to verify the correctness of the first output result. If the first output result is determined to be correct, monitoring system 1 registers the items indicated by the first output result. The first and second prompts are instruction statements for giving instructions or queries to the trained model.
[0023] In this way, the monitoring system 1 determines the region where an item is thought to be contained based on the difference image, obtains a first output result using the first machine learning model for the partial image containing that region, and further verifies the appropriateness of the first output result using the second machine learning model. By using two pre-trained models, the monitoring system 1 can detect lost items and other items with high accuracy.
[0024] (Terminal device 10) Figure 3 is a block diagram showing an example configuration of the terminal device 10 shown in Figure 1. The terminal device 10 comprises a control unit 11, a storage unit 12, and a communication unit 13.
[0025] The control unit 11 includes one or more processors, one or more programmable circuits, one or more dedicated circuits, or a combination thereof. The processor is, for example, a general-purpose processor such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit), or a dedicated processor specialized for a specific process, but is not limited to these. The programmable circuit is, for example, an FPGA (Field-Programmable Gate Array), but is not limited to this. The dedicated circuit is, for example, an ASIC (Application Specific Integrated Circuit), but is not limited to this. The control unit 11 controls the operation of the entire terminal device 10.
[0026] The storage unit 12 includes one or more memories. The memories are, for example, semiconductor memories, magnetic memories, or optical memories, but are not limited to these. Each memory included in the storage unit 12 may function as, for example, a main memory, an auxiliary memory, or a cache memory. The storage unit 12 stores any information used for the operation of the terminal device 10. For example, the storage unit 12 may store system programs, application programs, and embedded software. The information stored in the storage unit 12 may be updateable with information obtained from the network N via, for example, the communication unit 13.
[0027] The communication unit 13 includes one or more communication interfaces connected to the network N. The communication interface supports, but is not limited to, mobile communication standards, wired LAN (Local Area Network) standards, or wireless LAN standards, and may support any communication standard. The mobile communication standard is, for example, 4G (4th Generation) or 5G (5th Generation), but is not limited to these. In this embodiment, the terminal device 10 communicates with the server device 20 via the communication unit 13 and the network N.
[0028] The functions of the terminal device 10 can be realized by executing a computer program (program) according to this embodiment on the processor included in the control unit 11. In other words, the functions of the terminal device 10 may be realized by software. The computer program causes the computer to execute the processing of the steps included in the operation of the terminal device 10, thereby realizing the functions corresponding to the processing of each step. In other words, the computer program is a program that causes the computer to function as the terminal device 10 according to this embodiment. The computer program may be recorded on a recording medium that can be read by a computer. The program includes information used for processing by an electronic computer that is equivalent to a program. For example, data that is not a direct instruction to the computer but has the nature of defining the processing of the computer falls under "information equivalent to a program".
[0029] Computer programs can be recorded on computer-readable recording media. Computer-readable recording media include, for example, magnetic recording devices, optical discs, magneto-optical recording media, or semiconductor memory. Programs can be distributed, for example, by selling, transferring, or leasing portable recording media such as DVDs (Digital Versatile Discs) or CD-ROMs (Compact Disc Read Only Memory) on which the programs are recorded. Programs may also be distributed by storing them in server storage and transferring them from the server to other computers via a network. Programs may also be provided as program products.
[0030] A computer may, for example, temporarily store a program recorded on a portable storage medium or a program transferred from a server in its main memory. The computer may then read the program stored in the main memory with its processor and execute the processing according to the read program. The computer may also directly read a program from a portable storage medium and execute the processing according to the program. The computer may sequentially execute the processing according to the received program each time a program is transferred to it from a server. Such processing may also be performed by a so-called ASP (Application Service Provider) type service, which does not transfer programs from the server to the computer, but realizes its function only through execution instructions and result acquisition. A program includes information used for processing by an electronic computer that is equivalent to a program. For example, data that is not a direct instruction to the computer but has the nature of defining the computer's processing falls under the category of "information equivalent to a program."
[0031] Some or all of the functions of the terminal device 10 may be implemented by a dedicated circuit included in the control unit 11. In other words, some or all of the functions of the terminal device 10 may be implemented by hardware. The terminal device 10 may be implemented by a single computer, or by the cooperation of multiple computers that can communicate with each other.
[0032] (Server device 20) Figure 4 is a block diagram showing an example configuration of the server device 20 in Figure 1. The server device 20 comprises a control unit 21, a storage unit 22, and a communication unit 23. The control unit 21, storage unit 22, and communication unit 23 of the server device 20 are the same as the control unit 11, storage unit 12, and communication unit 13 of the terminal device 10, so a detailed explanation is omitted. As with the terminal device 10, the functions of the server device 20 may be implemented by software or by hardware. The server device 20 may be implemented by a single computer or by the cooperation of multiple computers that can communicate with each other.
[0033] (Example of operation) An example of the operation of the monitoring system 1 described above will be explained with reference to Figures 5 to 7. Figure 5 is a flowchart of an example of the operation procedure of the monitoring system 1. Figure 6 is a diagram showing an example of the first image. Figure 7 is a diagram showing an example of the second image.
[0034] The operation of the monitoring system 1, described with reference to Figures 5 to 7, may correspond to one of the monitoring methods of the monitoring system 1. The operation of each step in Figures 5 to 7 may be performed based on control by the control unit 11 of the terminal device 10 or the control unit 21 of the server device 20.
[0035] The following describes an example of operation in which terminal device 10 obtains a first output result using a first machine learning model for a partial image of the second image that includes an area where an item is thought to be contained, and server device 20 verifies the appropriateness of the first output result using a second machine learning model. Note that the division of processing between terminal device 10 and server device 20 is not limited to this. For example, either terminal device 10 or server device 20 may perform all the processing.
[0036] In S1, the control unit 11 of the terminal device 10 receives a first image taken at a first time and a second image taken at a second time inside the vehicle 40. The first time may be, for example, the time before the occupants board the vehicle 40. The second time may be, for example, the time after the occupants disembark from the vehicle 40. The first and second images are acquired by the camera 30 by capturing the same shooting range 30A.
[0037] Figure 6 shows an example of the first image 51. Figure 7 shows an example of the second image 53. Figure 7 shows an example of an umbrella 54 being left behind as lost property.
[0038] In S2, the control unit 11 estimates the depth from the first image 51 and the second image 53 acquired in S1. If the first image 51 and the second image 53 are depth images, each pixel in the first image 51 and the second image 53 indicates the depth. If the first image 51 and the second image 53 are ordinary two-dimensional images in black and white or color, the control unit 11 may estimate the depth based, for example, on the brightness and saturation of the first image 51 and the second image 53. For example, the control unit 11 may estimate the depth by referring to a predetermined correspondence between brightness and saturation and depth.
[0039] In S3, the control unit 11 calculates a difference image between the first image 51 and the second image 53, and determines the difference region from the second image 53. The difference region is an area in the difference image where a significant difference with a certain extent is observed.
[0040] Specifically, the control unit 11 calculates the difference in pixel values (or depth in the case of a depth image) between the first image 51 and the second image 53 for each pixel in the first image 51 and the second image 53, and obtains a difference image. Next, the control unit 11 determines a region in the difference image where a significant difference with a certain extent is observed. For example, the control unit 11 may determine pixels where the difference in pixel values between corresponding pixels in the first image 51 and the second image 53 is greater than or equal to a certain value as difference pixels. The control unit 11 may also determine a region consisting of pixels where the area of a set of adjacent difference pixels is greater than or equal to a predetermined threshold as a difference region. In other words, even if multiple difference pixels are adjacent, the control unit 11 may remove multiple adjacent difference pixels as noise if the area of the region connecting the adjacent difference pixels is smaller than the area expected when an object is captured. The difference region determined in this way corresponds to the area occupied by the object (umbrella 54 in the example of Figure 7) in the second image 53.
[0041] In S4, the control unit 11 crops (extracts) a partial image from the second image 53 that includes the difference region determined in S3. For example, the control unit 11 may crop a rectangular partial image from the second image 53 that circumsects the difference region determined in S3. In Figure 7, region 55 is a rectangular region cropped based on the difference region corresponding to the area occupied by the umbrella 54. If there are multiple difference regions determined in S3, the control unit 11 crops a partial image from the second image 53 that includes each of the multiple difference regions. The monitoring system 1 can improve the accuracy of object detection by detecting objects in such partial image units.
[0042] In S5, the control unit 11 inputs the partial image acquired in S4 and a first prompt to query the items contained in the partial image to the first model, which is the first trained model, and obtains a first output result regarding the items contained in the second image. The control unit 11 may encode the partial image using an encoding scheme according to the model's specifications when inputting it to the first model. The control unit 11 transmits the second image and the first output result to the server device 20 via the communication unit 13 and the network N.
[0043] The first model is a large-scale language model that includes a pre-trained visual-language model. The visual-language model is a machine learning model (multimodal model) trained by mapping the features of visual input and the features of language input into a single feature space. The first model is pre-trained on large-scale data of images and documents so that, when at least an image and a first prompt are input, it outputs the names of the objects depicted in the image. The first model may also be pre-trained on large-scale data using existing techniques, for example, UNITER (UNiversal Image-Text Representation Learning), COCO (Common Objects in Context), Visual Genome, etc.
[0044] The first prompt may be a text-based instruction such as, "What is in this image?" The first output may be a text-based answer such as, "The image shows an umbrella."
[0045] In S6, the control unit 21 of the server device 20 inputs the second image and the second prompt to the second model, which is the second trained model, to verify the correctness of the first output result. When inputting to the second model, the control unit 21 may encode the second image using an encoding scheme according to the model's specifications.
[0046] The second model is a large-scale language model that includes a pre-trained visual language model. The second model is pre-trained with a large dataset of images and documents so that when a second prompt containing at least an image and an item name is input, it outputs whether or not the item identified by the item name is visible in the image.
[0047] The second prompt is an instruction that asks whether the item indicated by the first output is included in the second image. The second prompt may be a text-based instruction, such as "Does this image include an umbrella?". The second output may be a text-based answer, such as "Yes." or "No.".
[0048] In S7, the control unit 21 determines, based on the second output result obtained in S6, that the item indicated by the first output result is included in the second image (Yes in S7), and proceeds to S8; otherwise, it proceeds to S9 (No in S7).
[0049] In S8, the control unit 21 registers the item determined to be included in the second image as a lost item in the system. For example, the control unit 21 may register the item name and the second image in a database constructed in the storage unit 12, associating them with relevant information such as the vehicle 40's identification information, date and time, and location (e.g., GPS information).
[0050] The control unit 21 may also input additional prompts to the second model to profile lost and forgotten items and register the results. For example, if the item in the second image is an umbrella, the control unit 21 may input a prompt such as "What color is this umbrella?" to the second model and output the color of the umbrella. Alternatively, the control unit 21 may input a prompt such as "What type of umbrella is this?" to the second model and output the type of umbrella (e.g., folding umbrella, walking stick type, plastic umbrella, parasol, etc.). Alternatively, the control unit 21 may input a prompt such as "Is this umbrella for a child?" to the second model and output whether it is for a child or an adult. Alternatively, the control unit 21 may input a prompt such as "Where was this umbrella found?" to the second model and output the location within the vehicle 40 where the umbrella was found (e.g., near the door, on the back seat, aisle, etc.). Alternatively, for example, the control unit 21 may prompt the second model for information such as the date and time and the location of the vehicle 40. The control unit 21 may also register these output results in the database in association with the item name.
[0051] After executing S8, the control unit 21 terminates the processing of the flowchart.
[0052] In S9, the control unit 21 feeds the second output result back to the first model. Specifically, the control unit 21 of the server device 20 notifies the terminal device 10 that it has obtained an output indicating that the item indicated by the first output result is not included in the second image, as the second output result. Upon receiving the notification from the server device 20, the control unit 11 of the terminal device 10 may, in order to improve the false detection of the first model, retrain the first model or input a prompt to the first model indicating that the item indicated by the first output result is not included in the second image. After executing S9, the control unit 21 terminates the flowchart processing.
[0053] As described above, the monitoring system 1 monitors the inside of the vehicle 40 with its imaging unit. Specifically, the monitoring system 1 extracts a partial image from the second image that includes a difference region, which is a region in the difference image (the difference between the first image taken at the first time and the second image taken at the second time) that has a significant difference with a certain extent. The monitoring system 1 then provides the partial image and a first prompt to inquire about the items contained in the partial image. The first model is input. The first model is a pre-trained model that, when at least an image and a first prompt are input, outputs the names of the items depicted in the image. The monitoring system 1 obtains the first output result from the first model regarding the items included in the second image. The monitoring system 1 inputs the second image and a second prompt that queries whether the second image contains the items indicated by the first output result to the second model. The second model is a pre-trained model that, when at least an image and a second prompt containing an item name are input, outputs whether or not the item identified by the item name is depicted in the image. The monitoring system 1 outputs the second output result from the second model regarding the correctness of the first output result.
[0054] In this way, the monitoring system 1 can improve the accuracy of item identification by giving different prompts to the trained machine learning model.
[0055] The second model may be the same as the first model, or it may be a different model. Even if the first and second models are the same model, the first output result can be verified with high accuracy because different prompts are used for querying. For example, the first model used in terminal device 10 may be a lightweight model capable of high-speed processing. The second model used in server device 20 may be a model capable of producing higher-precision output for computationally intensive calculations. In this way, by using trained models that are appropriate to the resources of terminal device 10 and server device 20, it becomes easy to integrate and implement them in terminal device 10, reducing the computational power of terminal device 10 and the amount of communication between terminal device 10 and server device 20. Furthermore, by performing high-precision calculations in server device 20, which has more resources, it is possible to produce high-precision output. For example, terminal device 10 may immediately notify the occupants of vehicle 40 that an item has been left behind when it detects that an item has been left behind.
[0056] Furthermore, the first and second images may be depth images that represent the distance from the camera 30 for each pixel of the image. The difference image may be an image that shows the difference in the depth direction between the first and second images. In this way, by using depth images as the first and second images, the monitoring system 1 can accurately identify lost items inside the vehicle 40 regardless of the lighting conditions inside the vehicle 40 and the conditions of light entering from outside the vehicle.
[0057] Furthermore, while the monitoring system 1 identifies items that appear in only one of the first and second images taken at different times, its use is not limited to identifying lost items. For example, if the second time is earlier than the first time, and an item that appears in the second image taken at the second time is not present in the first image taken at the first time, the monitoring system 1 may determine that the item has been stolen.
[0058] Furthermore, monitoring system 1 outputs a second output result regarding the correctness of the first output result using the second model, but the second output result is not limited to a choice between "correct" and "incorrect". For example, monitoring system 1 may output a numerical value indicating the degree of correctness of the first output result (e.g., a value between 0 and 1), or one of multiple levels (e.g., A, B, C, etc.) as the second output result.
[0059] This disclosure is not limited to the embodiments described above. For example, multiple blocks shown in the block diagram may be combined, or a single block may be divided. Multiple steps shown in the flowchart may be performed in parallel or in a different order, depending on the processing capacity of the device performing each step, or as necessary, instead of being performed in chronological order as described. Other modifications are possible without departing from the spirit of this disclosure.
[0060] Some embodiments of the present disclosure are described below. However, it should be noted that the embodiments of the present disclosure are not limited to these. [Note 1] A surveillance system that monitors the inside of a vehicle using a camera unit, A partial image containing a difference region, which is a region with a significant difference having a certain extent in the difference image (the difference between the first image taken at the first time and the second image taken at the second time inside the vehicle), is extracted from the second image. The partial image and a first prompt querying the items contained in the partial image are input to a first pre-trained model that has been trained to output the names of the items captured in the image when at least the image and the first prompt are input, thereby obtaining a first output result regarding the items contained in the second image. The second image and a second prompt that queries whether the item indicated by the first output result is included in the second image are input to a second pre-trained model that, when at least an image and a second prompt including an item name are input, outputs whether or not the item identified by the item name is captured in the image, and a second output result regarding the correctness of the first output result is output. A monitoring system. [Note 2] A surveillance system that monitors the inside of a vehicle using a camera unit, A partial image containing the difference region in the difference image between the first image taken at the first time and the second image taken at the second time inside the vehicle is extracted from the second image. The partial image and a first prompt querying the items contained in the partial image are input to the first trained model to obtain a first output result regarding the items contained in the second image. The second image and a second prompt querying whether the second image contains the item indicated by the first output result are input to a second trained model, and a second output result regarding the correctness of the first output result is output. A monitoring system. [Note 3] The monitoring system described in [Appendix 2], wherein the first trained model and the second trained model are identical. [Note 4] The monitoring system described in [Appendix 2] or [Appendix 3], wherein the second trained model is a trained model that outputs results with higher accuracy than the first trained model. [Note 5] It comprises a server device capable of communicating with each other and a terminal device mounted on the vehicle, The terminal device inputs the partial image and a first prompt for querying the items contained in the partial image to the first trained model and executes a process to obtain a first output result regarding the items contained in the second image. The server device inputs the second image and a second prompt asking whether the second image contains the item indicated by the first output result to the second trained model, and executes a process to output a second output result regarding the correctness of the first output result. The monitoring system described in any one of the following [Appendix 2] to [Appendix 4]. [Note 6] Between corresponding pixels in the first image and the second image, a group of pixels whose pixel value difference is greater than or equal to a predetermined value is determined to be difference pixels. A region consisting of pixels whose area is greater than or equal to a predetermined threshold is determined to be the difference region. The image containing the determined difference region is extracted from the second image as the partial image. The monitoring system described in any one of the following [Appendix 2] to [Appendix 5]. [Note 7] A monitoring system according to any one of [Appendix 2] to [Appendix 6], wherein if the second output result indicates that the first output result is correct, the item indicated by the first output result is registered in the storage unit. [Note 8] The first and second images are depth images that represent the distance from the camera for each pixel of the image. The difference image is an image showing the difference in the depth direction between the first image and the second image. The monitoring system described in any one of the following [Appendix 2] to [Appendix 7]. [Note 9] A monitoring method for a monitoring system that monitors the inside of a vehicle using a camera unit, Extracting a partial image from the second image that includes the difference region in the difference image between the first image taken at the first time and the second image taken at the second time inside the vehicle, The partial image and a first prompt querying the items contained in the partial image are input to a first trained model to obtain a first output result regarding the items contained in the second image. The second image and a second prompt querying whether the second image contains the item indicated by the first output result are input to the second trained model, and a second output result regarding the correctness of the first output result is output. A monitoring method that includes this. [Note 10] The monitoring method described in [Appendix 9], wherein the first trained model and the second trained model are identical. [Note 11] The monitoring method described in [Appendix 9] or [Appendix 10], wherein the second trained model is a trained model that outputs results with higher accuracy than the first trained model. [Note 12] The monitoring system comprises a server device capable of communicating with each other and a terminal device mounted on the vehicle. The terminal device inputs the partial image and a first prompt for querying the items contained in the partial image to the first trained model and executes a process to obtain a first output result regarding the items contained in the second image. The server device inputs the second image and a second prompt asking whether the second image contains the item indicated by the first output result to the second trained model, and executes a process to output a second output result regarding the correctness of the first output result. The monitoring method described in any one of the following [Appendix 9] to [Appendix 11]. [Note 13] Between corresponding pixels in the first image and the second image, a group of pixels whose pixel value difference is greater than or equal to a predetermined value is determined to be difference pixels. A region consisting of pixels whose area is greater than or equal to a predetermined threshold is determined to be the difference region. The image containing the determined difference region is extracted from the second image as the partial image. The monitoring method described in any one of the following [Appendix 9] to [Appendix 12]. [Note 14] The monitoring method described in any one of [Appendix 9] to [Appendix 13], wherein if the second output result indicates that the first output result is correct, the item indicated by the first output result is registered in the storage unit. [Note 15] A program that controls a monitoring system that monitors the inside of a vehicle, Extracting a partial image from the second image that includes the difference region in the difference image between the first image taken at the first time and the second image taken at the second time inside the vehicle, The partial image and a first prompt querying the items contained in the partial image are input to a first trained model to obtain a first output result regarding the items contained in the second image. The second image and a second prompt querying whether the second image contains the item indicated by the first output result are input to the second trained model, and a second output result regarding the correctness of the first output result is output. A program that causes the processor to execute. [Note 16] The monitoring system described in [Appendix 15], wherein the first trained model and the second trained model are identical. [Note 17] The monitoring system described in [Appendix 15] or [Appendix 16], wherein the second trained model is a trained model that outputs results with higher accuracy than the first trained model. [Note 18] Between corresponding pixels in the first image and the second image, a group of pixels whose pixel value difference is greater than or equal to a predetermined value is determined to be difference pixels. A region consisting of pixels whose area is greater than or equal to a predetermined threshold is determined to be the difference region. The image containing the determined difference region is extracted from the second image as the partial image. The monitoring system described in any one of the following [Appendix 15] to [Appendix 17]. [Note 19] A monitoring system according to any one of [Appendix 15] to [Appendix 18], wherein if the second output result indicates that the first output result is correct, the item indicated by the first output result is registered in the storage unit. [Note 20] The first and second images are depth images that represent the distance from the camera for each pixel of the image. The difference image is an image showing the difference in the depth direction between the first image and the second image. The monitoring method described in any one of the following [Appendix 15] to [Appendix 19]. [Explanation of Symbols]
[0061] 1: Surveillance system 10: Terminal device 11: Control Unit 12: Storage part 13: Communications Department 20: Server equipment 21: Control Unit 22: Storage part 23: Communications Department 30: Camera 30A: Shooting range 40: Vehicles 51, 53: Captured images 54: Umbrella 55: Area N: Network
Claims
1. A surveillance system that monitors the inside of a vehicle using a camera unit, A partial image including a difference region, which is a region with a significant difference having a certain extent in the difference image (the difference between the first image taken at the first time and the second image taken at the second time inside the vehicle), is extracted from the second image. The partial image and a first prompt querying the items contained in the partial image are input to a first pre-trained model that has been trained to output the names of the items captured in the image when at least the image and the first prompt are input, thereby obtaining a first output result regarding the items contained in the second image. The second image and a second prompt that inquires whether the second image contains the item indicated by the first output result are input to a second pre-trained model that, when at least an image and a second prompt containing an item name are input, outputs whether or not the item identified by the item name is captured in the image, and a second output result regarding the correctness of the first output result is output. A monitoring system.
2. A surveillance system that monitors the inside of a vehicle using a camera unit, A partial image including the difference region in the difference image between the first image taken at the first time and the second image taken at the second time inside the vehicle is extracted from the second image. The partial image and a first prompt querying the items contained in the partial image are input to the first trained model to obtain a first output result regarding the items contained in the second image. The second image and a second prompt that inquires whether the second image contains the item indicated by the first output result are input to the second trained model, and a second output result regarding the correctness of the first output result is output. A monitoring system.
3. The monitoring system according to claim 2, wherein the second trained model is a trained model that outputs results with higher accuracy than the first trained model.
4. The monitoring system according to claim 2, wherein the first trained model and the second trained model are identical.
5. It comprises a server device capable of communicating with each other and a terminal device mounted on the vehicle, The terminal device inputs the partial image and a first prompt for querying the items contained in the partial image to the first trained model and executes a process to obtain a first output result regarding the items contained in the second image. The server device inputs the second image and a second prompt asking whether the second image contains the item indicated by the first output result to the second trained model, and executes a process to output a second output result regarding the correctness of the first output result. The monitoring system according to claim 2.
6. Between corresponding pixels in the first image and the second image, a group of pixels whose pixel value difference is greater than or equal to a predetermined value is determined to be difference pixels. A region consisting of pixels whose area is greater than or equal to a predetermined threshold is determined to be the difference region. The image containing the determined difference region is extracted from the second image as the partial image. The monitoring system according to claim 2.
7. The monitoring system according to claim 2, wherein if the second output result indicates that the first output result is correct, the item indicated by the first output result is registered in the storage unit.
8. The first and second images are depth images that represent the distance from the camera for each pixel of the image. The difference image is an image showing the difference in the depth direction between the first image and the second image. The monitoring system according to any one of claims 2 to 7.
9. A monitoring method for a monitoring system that monitors the inside of a vehicle using a camera unit, Extracting a partial image from the second image that includes the difference region in the difference image between the first image taken at the first time and the second image taken at the second time inside the vehicle, The partial image and a first prompt querying the items contained in the partial image are input to a first trained model to obtain a first output result regarding the items contained in the second image. The second image and a second prompt querying whether the second image contains the item indicated by the first output result are input to the second trained model, and a second output result regarding the correctness of the first output result is output. A monitoring method that includes this.
10. The monitoring method according to claim 9, wherein the second trained model is a trained model that outputs results with higher accuracy than the first trained model.
11. The monitoring method according to claim 9, wherein the first trained model and the second trained model are identical.
12. The monitoring system comprises a server device capable of communicating with each other and a terminal device mounted on the vehicle. The terminal device inputs the partial image and a first prompt for querying the items contained in the partial image to the first trained model and executes a process to obtain a first output result regarding the items contained in the second image. The server device inputs the second image and a second prompt asking whether the second image contains the item indicated by the first output result to the second trained model, and executes a process to output a second output result regarding the correctness of the first output result. The monitoring method according to claim 9.
13. Between corresponding pixels in the first image and the second image, a group of pixels whose pixel value difference is greater than or equal to a predetermined value is determined to be difference pixels. A region consisting of pixels whose area is greater than or equal to a predetermined threshold is determined to be the difference region. The image containing the determined difference region is extracted from the second image as the partial image. The monitoring method according to claim 9.
14. The monitoring method according to claim 9, wherein if the second output result indicates that the first output result is correct, the article indicated by the first output result is registered in the storage unit.
15. A program that controls a monitoring system that monitors the inside of a vehicle, Extracting a partial image from the second image that includes the difference region in the difference image between the first image taken at the first time and the second image taken at the second time inside the vehicle, The partial image and a first prompt querying the items contained in the partial image are input to a first trained model to obtain a first output result regarding the items contained in the second image. The second image and a second prompt querying whether the second image contains the item indicated by the first output result are input to the second trained model, and a second output result regarding the correctness of the first output result is output. A program that causes the processor to execute.
16. The program according to claim 15, wherein the second trained model is a trained model that outputs results with higher accuracy than the first trained model.
17. The program according to claim 15, wherein the first trained model and the second trained model are identical.
18. Between corresponding pixels in the first image and the second image, a group of pixels whose pixel value difference is greater than or equal to a predetermined value is determined to be difference pixels. A region consisting of pixels whose area is greater than or equal to a predetermined threshold is determined to be the difference region. The image containing the determined difference region is extracted from the second image as the partial image. The program according to claim 15.
19. The program according to claim 15, wherein if the second output result indicates that the first output result is correct, the program registers the item indicated by the first output result in the storage unit.
20. The first and second images are depth images that represent the distance from the camera for each pixel of the image. The difference image is an image showing the difference in the depth direction between the first image and the second image. The program according to claim 15.