Class definition supporting system, class definition supporting method, annotation system, and annotation method
The class definition support system and annotation method leverage a language model to assist users in defining classes and labeling images, addressing the expertise requirement and effort burden in creating action recognition models, enhancing efficiency and accuracy.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- PANASONIC INTELLECTUAL PROPERTY MANAGEMENT CO LTD
- Filing Date
- 2025-11-14
- Publication Date
- 2026-07-02
AI Technical Summary
Existing technologies require expertise and significant user effort for defining classes in action recognition models, leading to a heavy burden in creating labeled data for object and action recognition, especially for inexperienced users.
A class definition support system and method that utilizes a language model to assist users in generating class definition information through natural language dialogue, and an annotation system that automatically labels images based on user selection, reducing the burden and improving efficiency.
Enables inexperienced users to create accurate recognition models with reduced effort by supporting class definition and annotation processes, allowing efficient labeling of large image datasets.
Smart Images

Figure JP2025039953_02072026_PF_FP_ABST
Abstract
Description
Class Definition Support System, Class Definition Support Method, Annotation System, and Annotation Method ,
[0006] ,
[0005] ,
[0001] The present disclosure relates to a class definition support system, a class definition support method, an annotation system, and an annotation method used for learning to generate an action recognition model extracted from a photographed image of a site.
[0002] In manufacturing and logistics workplaces, etc., by analyzing the working conditions of workers and reviewing how the work is carried out based on the analysis results, it is possible to improve work efficiency. When analyzing such working conditions of workers, using a recognition model generated by machine learning is expected to improve the accuracy of work analysis and reduce the burden on users related to work analysis. In such a situation, in machine learning, a large amount of labeled data is required to improve the accuracy of object recognition and action recognition.
[0003] As a technology for performing work analysis using such a recognition model, conventionally, an annotation screen is displayed on a terminal device, and on the annotation screen, the user is made to specify an object to be recognized for a frame image, and the user is made to specify an action to be recognized for an interval video, and a technique for creating data used for learning to generate an object recognition model (object detection model) and an action recognition model (action classification model) is known (see Patent Document 1).
[0004] Japanese Patent Application Laid-Open No. 2024-81132
[0005] According to the conventional technology, regarding a user operation for designating an image area (object image) of an object on a frame image of a photographed video of a site, a user operation for inputting a division of an interval in which a predetermined action is confirmed in a photographed video of a site, etc., by devising the GUI, the user operation for annotation can be efficiently advanced.
[0006] On the other hand, when extracting action images (image regions of human action scenes) from captured images, it is necessary to properly define classes, which are effective combinations of multiple objects related to the action to be recognized, that is, the person as the subject of the action and the objects that are the target of the action (including people). However, when the correlation between multiple objects related to the action becomes complex, properly defining classes requires expertise from the user, and it has been difficult for inexperienced users to properly define classes.
[0007] Furthermore, with conventional technologies, generating large amounts of labeled data to improve the accuracy of object recognition and action recognition requires users to visually inspect frame images, identify object images (image regions of objects) and action images (image regions of human action scenes) contained within the frame images, and then perform operations to assign recognition class labels (object class, action class) to those object images and action images. As a result, the burden on users performing the annotation work cannot be sufficiently reduced. In particular, when performing annotation work on all frame images of the entire recorded video, operations on a large number of images are required, resulting in the problem of requiring an enormous amount of time for the annotation work.
[0008] Therefore, the main purpose of this disclosure is to provide a class definition support system and a class definition support method that enable even inexperienced users to appropriately create the class definition information necessary for annotation and training in order to generate a recognition model.
[0009] Furthermore, the primary objective of this disclosure is to provide an annotation system and annotation method that significantly reduces the burden on users performing annotation work to generate recognition models, enabling them to easily create recognition models with sufficient accuracy in a short amount of time.
[0010] The class definition support system of this disclosure is a class definition support system that supports user operations for obtaining class definition information relating to an event to be recognized, and comprises a terminal device in which the user performs an operation to input information relating to an event to be recognized that describes the content of the event to be recognized, and a server device connected to the terminal device via a network and which obtains the class definition information based on the information relating to the event to be recognized, wherein the server device is configured to cause a language model to perform a process to collect the information relating to the event to be recognized necessary for obtaining the class definition information while engaging in natural language dialogue with the user, and to obtain the class definition information generated by the language model.
[0011] Furthermore, the class definition support method of this disclosure is a class definition support method that causes a processor to execute a process to support user operations for obtaining class definition information relating to an event to be recognized, and in obtaining the class definition information based on the event to be recognized information that describes the content of the event to be recognized, the method is configured to cause a language model to perform a process to collect the event to be recognized information necessary for obtaining the class definition information while engaging in natural language dialogue with the user, and to obtain the class definition information generated by the language model.
[0012] Furthermore, the annotation system of this disclosure is an annotation system that performs processing related to annotation of target images used for learning to generate a recognition model, which are extracted from on-site captured images, and comprises a terminal device in which a user performs an operation to select a target image that corresponds to a recognition class to be recognized, and a server device connected to the terminal device via a network, which performs annotation processing to assign a recognition class label in response to the user's operation to select a target image, wherein the server device extracts target images from captured images, classifies the extracted target images into one or more image groups, extracts a representative image from the image group based on predetermined extraction conditions, presents the representative image to the user as a candidate image, and, in response to the user's operation to select a candidate image that corresponds to a recognition class, assigns the same recognition class label to all target images included in the image group to which the selected candidate image belongs as the annotation processing.
[0013] Furthermore, the annotation method disclosed herein is an annotation method that causes a processor to perform annotation processing on target images used for training to generate a recognition model, which are extracted from on-site captured images, and comprises extracting target images from captured images, classifying the extracted target images into one or more image groups, extracting representative images from the image groups based on predetermined extraction conditions, presenting the representative images to the user as candidate images, and, in response to the user's operation to select a candidate image that corresponds to a recognition class, assigning the same recognition class label to all target images included in the image group to which the selected candidate image belongs as an annotation process.
[0014] According to this disclosure, the user's work in creating class definition information is appropriately supported, so even users unfamiliar with the process can perform the task of creating class definition information properly.
[0015] Furthermore, according to this disclosure, by simply selecting some object images that belong to an object class, all object images in the same cluster as the selected object images will be labeled at once. This allows users to perform annotation work on a large number of object images efficiently and without much effort.
[0016] Overall configuration diagram of the model generation system according to this embodiment Block diagram showing the schematic configuration of the model generation server Sequence diagram showing an overview of the processing related to class definition performed by the model generation server, user terminal, and LLM server Explanatory diagram showing an example of class definition information Explanatory diagram showing an example of class definition information Explanatory diagram showing an overview of the object image extraction processing and object feature extraction processing performed by the model generation server Explanatory diagram showing an overview of the total combination extraction processing, effective combination extraction processing, behavior image extraction processing, and behavior feature extraction processing performed by the model generation server Explanatory diagram showing an overview of the clustering processing and representative image extraction processing performed by the model generation server Explanatory diagram showing the screen of class definition mode displayed on the user terminal When selecting a model on the user terminal Diagrams illustrating the screen in class definition mode, the screen in object recognition mode displayed on the user terminal, the screen in action recognition mode displayed on the user terminal, the screen in inference mode displayed on the user terminal, a flowchart showing the processing steps performed on the model generation server, a flowchart showing the input video processing steps performed on the model generation server, a flowchart showing the class definition relay processing steps performed on the model generation server, a flowchart showing the object candidate extraction processing steps performed on the model generation server, a flowchart showing the object recognition annotation processing steps performed on the model generation server, a flowchart showing the action candidate extraction processing steps performed on the model generation server, and a flowchart showing the action recognition annotation processing steps performed on the model generation server.
[0017] The first invention made to solve the aforementioned problems is a class definition support system that assists a user in acquiring class definition information relating to an event to be recognized, comprising: a terminal device in which the user performs an operation to input recognition target information that describes the content of the event to be recognized; and a server device connected to the terminal device via a network and acquiring the class definition information based on the recognition target information, wherein the server device is configured to have a language model perform a process to collect the recognition target information necessary for acquiring the class definition information while engaging in natural language dialogue with the user, and to acquire the class definition information generated by the language model.
[0018] According to this, the user's work in creating class definition information is properly supported, so even users unfamiliar with the process can perform the task of creating class definition information correctly.
[0019] Furthermore, the second invention is configured such that the server device acquires information defining the object class corresponding to the action class as the class definition information relating to the action class.
[0020] According to this, based on class definition information, it is possible to appropriately extract action images from captured images that will be used to train the action recognition model.
[0021] Furthermore, the third invention is configured such that the server device acquires information regarding effective combinations of an action class and the object class it is targeting, as class definition information.
[0022] According to this, it is possible to extract more appropriate behavioral images from captured images.
[0023] Furthermore, the fourth invention is configured such that the server device displays a screen on the terminal device that visualizes the class definition information, which represents the correspondence between object classes and behavior classes, in a tabular format.
[0024] According to this, class definition information is visualized and presented to the user in a tabular format, making it easy for the user to determine whether the acquired class definition information is appropriate or not.
[0025] Furthermore, the fifth invention is configured such that the server device is connected via a network to a language model device equipped with the language model, and acquires the class definition information in cooperation with the language model device.
[0026] According to this, class definition information can be obtained inexpensively by using a language modeling device operated by another company.
[0027] Furthermore, the sixth invention is configured such that the server device acquires instruction text in natural language entered by a user in the terminal device from the terminal device, transmits the instruction text to the language model device to instruct it to execute class definition processing, and acquires the class definition information based on the instruction text from the language model device.
[0028] According to this, appropriate class definition information can be obtained by using a language modeling device operated by another company.
[0029] Furthermore, the seventh invention is configured such that the server device transmits the captured image along with the instruction text to the language model device to instruct it to execute the class definition process, and obtains the class definition information based on the instruction text and the captured image from the language model device.
[0030] According to this, the language model device can collect information missing from the instruction text from the captured image, thereby generating more appropriate class definition information and further reducing the burden on the user who inputs the instruction text.
[0031] Furthermore, the eighth invention is a class definition support method that causes a processor to execute a process to support user operations for acquiring class definition information relating to an event to be recognized, wherein, in acquiring the class definition information based on recognition target information that describes the content of the event to be recognized, the method is configured to cause a language model to perform a process to collect the recognition target information necessary for acquiring the class definition information while engaging in natural language dialogue with the user, and to acquire the class definition information generated by the language model.
[0032] According to this, similar to the first invention, the user's work in creating class definition information is appropriately supported, so even users unfamiliar with the process of creating class definition information can perform it properly.
[0033] The ninth invention, made to solve the aforementioned problems, is an annotation system that performs annotation processing on target images used for learning to generate a recognition model extracted from on-site captured images, comprising: a terminal device in which a user performs an operation to select a target image that corresponds to a recognition class to be recognized; and a server device connected to the terminal device via a network, which performs annotation processing to assign a recognition class label in response to the user's operation to select a target image, wherein the server device extracts target images from captured images, classifies the extracted target images into one or more image groups, extracts a representative image from the image group based on predetermined extraction conditions, presents the representative image to the user as a candidate image, and, in response to the user's operation to select a candidate image that corresponds to a recognition class, assigns the same recognition class label to all target images included in the image group to which the selected candidate image belongs as the annotation processing.
[0034] According to this system, users simply select a subset of target images that belong to a recognition class, and all target images in the same cluster as the selected images are automatically labeled. This allows users to efficiently perform annotation work on a large number of target images without requiring much effort.
[0035] Furthermore, the tenth invention is configured such that the server device displays an annotation screen showing a list of candidate images on the terminal device, and executes the annotation process in response to the user's operation of selecting a candidate image corresponding to a recognition class on the annotation screen.
[0036] This makes it easier for users to select candidate images that match the recognition class.
[0037] Furthermore, the eleventh invention is configured such that, in response to a user operation specifying a recognition class, the server device displays a list of candidate images that are assumed to correspond to the specified recognition class on the annotation screen, and displays a correct / incorrect input unit on the annotation screen for the user to input whether or not the displayed candidate images are appropriate as target images corresponding to the specified recognition class.
[0038] According to this, for the candidate images displayed in a list, the user only needs to input whether it is appropriate as a target image corresponding to the specified recognition class, so the operation of the user to select a candidate image corresponding to the recognition class becomes even easier. Note that the correct / incorrect input unit may be, for example, a checkbox.
[0039] Further, in the 12th invention, the server device is configured to detect an object from a photographed image using an object detection model generated by machine learning, and extract an object image from the photographed image based on the detection result.
[0040] According to this, an object image can be appropriately extracted from the photographed image.
[0041] Further, in the 13th invention, the server device is configured to extract an action image including an image area of an object related to an action from a photographed image based on class definition information defining an object class corresponding to an action class.
[0042] According to this, an action image can be appropriately extracted from the photographed image.
[0043] Further, in the 14th invention, the server device is configured to extract feature amounts from a target image and classify the target image into an image group based on the feature amounts of the target image.
[0044] According to this, target images with similar features can be appropriately grouped into one cluster. As a result, target images with features similar to the target image selected by the user can be labeled collectively.
[0045] Further, in the 15th invention, the server device is configured to extract a representative image from an image group based on extraction conditions specified by a user.
[0046] According to this, by appropriately specifying extraction conditions, the user can appropriately extract a representative image from the image group and present it as a candidate image.
[0047] Further, in the sixteenth invention, when the server device cannot obtain the accuracy that satisfies the user with the recognition model generated by learning using the target image labeled by the annotation process, the server device uses the previous processing result to extract the target image from the captured image again, reclassifies the re-extracted target image into the image groups, re-extracts the representative image from the re-classified image groups, and re-presents the re-extracted representative image to the user as a candidate image.
[0048] According to this, by performing additional learning of the recognition model, the accuracy of the recognition model can be improved. In this case, for example, as the previous processing result, the recognition model generated previously may be used to re-extract the target image from the captured image.
[0049] Further, in the seventeenth invention, the server device is configured to classify the target image into the number of image groups corresponding to the specified number of candidate images according to the user operation that specifies the number of candidate images to be presented.
[0050] According to this, the user can adjust the number of target images to be presented as candidate images. In this case, the number of representative images corresponding to the specified number of candidate images is extracted from the image groups and presented to the user as candidate images. In the present disclosure, the number of image groups to be generated is specified by the user, but as the extraction condition for extracting the representative image from the image groups, the number of representative images to be extracted from one image group may be specified by the user.
[0051] Further, in the eighteenth invention, the server device is configured to extract the representative image from the image groups based on the specified rank according to the user operation that specifies the rank regarding the position from the center of gravity of the image groups as the extraction condition.
[0052] According to this, the user can appropriately extract and present the representative image (candidate image) from the image groups by appropriately specifying the rank based on the center of gravity of the image groups. The rank regarding the position from the center of gravity of the image groups may be, for example, the rank (rank in order of the distance from the center of gravity) given in ascending order of the distance from the center of gravity of the image groups.
[0053] Furthermore, the 19th invention is configured such that the server device randomly extracts representative images from the image group as the extraction condition.
[0054] According to this, the process of extracting a representative image from a set of images can be easily performed.
[0055] Furthermore, the 20th invention is an annotation method that causes a processor to perform annotation processing on target images used for learning to generate a recognition model extracted from on-site captured images, wherein the method extracts target images from captured images, classifies the extracted target images into one or more image groups, extracts a representative image from the image group based on predetermined extraction conditions, presents the representative image to the user as a candidate image, and, in response to the user's operation to select a candidate image corresponding to a recognition class, the annotation process collectively assigns the same recognition class label to all target images included in the image group to which the selected candidate image belongs.
[0056] According to this, similar to the first invention, by simply having the user select some target images that belong to a recognition class, target images included in the same cluster as the selected target images are labeled all at once. Therefore, users can perform annotation work on a large number of target images efficiently and without much effort.
[0057] The embodiments of this disclosure will be described below with reference to the drawings.
[0058] Figure 1 is an overall diagram of the model generation system according to this embodiment.
[0059] This system (class definition support system, object recognition annotation system, action recognition annotation system) comprises a model generation server 1 (server device, class definition support device, object recognition annotation device, action recognition annotation device), a user terminal 2 (terminal device), and an LLM server 3 (language model device). The model generation server 1 and the user terminal 2 are connected via a network. The model generation server 1 and the LLM server 3 are also connected via a network.
[0060] The model generation server 1 performs processes such as acquiring training images (object images, action images) necessary for generating recognition models (object recognition models, action recognition models), annotation to assign recognition classes (object classes, action classes) to the training images, machine learning to generate recognition models using the training images, and verifying the accuracy of the generated recognition models.
[0061] User terminal 2 displays an operation screen where the user performs operations necessary for generating recognition models (object recognition models, action recognition models) based on the control of model generation server 1. On the operation screen, the user performs operations such as inputting training videos and verification videos, inputting processing conditions related to the processing performed by model generation server 1, and checking the accuracy of the generated recognition model.
[0062] The LLM server 3 includes a language model. The language model is used to obtain information necessary for generating training images (class definition information) through interaction with the user. The LLM server 3 includes a model in which processing functions for data other than language are added to the language model, for example, a multimodal LLM (Large Language Model). Alternatively, the server may include a model that has both image processing and language processing functions, for example, a VLM (Vision-Language Model).
[0063] The model generation server 1 may consist of a single information processing device, or it may consist of multiple information processing devices. For example, it may consist of a front-end server that handles the GUI for annotation, and a back-end server that performs processing such as annotation and training to generate object recognition models and action recognition models.
[0064] Next, we will describe the general configuration of the model generation server 1. Figure 2 is a block diagram showing the general configuration of the model generation server 1.
[0065] The model generation server 1 comprises a communication unit 11, a storage unit 12, and a processor 13.
[0066] The communication unit 11 communicates with the user terminal 2 and the LLM server 3.
[0067] The memory unit 12 stores programs and other data that are executed by the processor 13.
[0068] Furthermore, the memory unit 12 stores training videos (frame images) and verification videos (frame images). The training videos and verification videos are recordings of the worker's work. The training videos are used for training to generate object recognition models and action recognition models. The verification videos are used to verify the accuracy of the generated object recognition models and action recognition models.
[0069] Furthermore, the memory unit 12 stores object images and action images. Object images are extracted from the frame image as rectangular image regions surrounding the detected object. Object images are used as training images to generate an object recognition model. Action images are extracted from the frame image as rectangular image regions containing multiple objects related to a person's actions. Action images are used as training images to generate an action recognition model.
[0070] Furthermore, the memory unit 12 stores class definition information, object recognition annotation information, and action recognition annotation information. Class definition information is information that defines the event class to be recognized. Specifically, class definition information is information regarding the correspondence between object classes and action classes, that is, information regarding effective combinations of action classes and the object classes to which they are targeted. Class definition information is used in the process of extracting action images from frame images. Object recognition annotation information is information regarding the object class assigned to the object image and is used for training to generate the object recognition model. Action recognition annotation information is information regarding the action class assigned to the action image and is used for training to generate the action recognition model.
[0071] The object class represents the type of object to be recognized (e.g., person, driver, tablet device). The action class represents the type of action to be recognized (e.g., driver fastening action, tablet device touch operation).
[0072] Furthermore, the memory unit 12 stores an object recognition model and an action recognition model. The object recognition model recognizes objects contained in the video to be processed and is generated by the processor 13. The action recognition model recognizes actions contained in the video to be processed and is generated by the processor 13.
[0073] The processor 13 performs various processes by executing programs stored in the memory unit 12. In this embodiment, the processor 13 performs video acquisition processing, condition setting processing, GUI processing, class definition relay processing, object detection processing, object candidate extraction processing, object recognition annotation processing, object recognition model learning processing, object recognition inference processing, action candidate extraction processing, action recognition annotation processing, action recognition model learning processing, and action recognition inference processing. Note that the processor 13 can distinguish and execute the object detection processing and the object recognition inference processing by using different recognition models for each.
[0074] In the input video processing, the processor 13 acquires a learning video in which the user has performed input operations on the user terminal 2 and obtains frame images that make up that learning video. In addition, in the input video processing, the processor 13 acquires a verification video in which the user has performed input operations on the user terminal 2 and obtains frame images that make up that verification video.
[0075] In this embodiment, a video file is input to the user terminal 2 in response to user operations, and the video file is transmitted from the user terminal 2 to the model generation server 1. However, the video file may also be transmitted to the model generation server 1 from an external device (server, online storage, etc.). Alternatively, the video file may be input to the model generation server 1 offline via a suitable storage medium on which the video file is stored.
[0076] In the condition setting process, the processor 13 sets processing conditions related to object candidate extraction, object recognition annotation, object recognition inference, action candidate extraction, action recognition annotation, and action recognition inference, in response to user input operations on the user terminal 2.
[0077] In GUI processing, the processor 13 generates a screen to be displayed on the user terminal 2 and acquires input information corresponding to the user's operations on the screen. Specifically, the class definition mode screen 101 (see Figure 8), the object recognition mode screen 201 (see Figure 10), the action recognition mode screen 301 (see Figure 11), and the inference mode screen 401 (see Figure 12) are displayed on the user terminal 2.
[0078] In the class definition relay process, the processor 13 relays the communication between the user terminal 2 and the LLM server 3 regarding the class definition, and while engaging in natural language dialogue with the user, it causes the language model to perform the process of collecting recognition target information necessary for obtaining class definition information. The class definition information generated by the LLM server 3 is presented to the user on the user terminal 2, and when the user obtains class definition information that satisfies them and the user instructs the completion of the class definition process, the class definition information is stored in the storage unit 12.
[0079] In the object detection process, processor 13 detects objects from the frame image using an object detection model generated by machine learning. During this process, object classification is performed, and the object class for the detected object is predicted. Furthermore, in the object detection process, a zero-shot learning object detection model is used for the first run, and a few-shot learning object detection model is used for subsequent runs. Other general-purpose object detection methods may also be used in the object detection process.
[0080] In the object candidate extraction process, the processor 13 extracts object images that are the target of the user's labeling operation as candidate images. Specifically, the processor 13 extracts object images (rectangular image regions surrounding objects) from the frame image based on the detection results of the object detection process, extracts feature quantities for each object image, then classifies the object images into multiple clusters (groups of object images) based on the feature quantities for each object image, and extracts representative images for each cluster as candidate images. At this time, the object images are classified into object classes, and the object candidate extraction process is performed for each object class. In addition, by classifying object images into clusters based on feature quantities, object images with similar features are grouped into one cluster.
[0081] In the object recognition annotation process, the processor 13 presents representative images for each cluster extracted in the object candidate extraction process to the user as candidate images. In response to the user's operation to select a candidate image corresponding to a specified object class, the processor 13 assigns the label of the specified object class to all object images included in the cluster to which the selected candidate image (representative image) belongs.
[0082] In the object recognition model training process, the processor 13 performs machine learning using the labeled object images obtained in the object recognition annotation process to generate an object recognition model.
[0083] In the object recognition inference process, the processor 13 performs inference using the object recognition model to derive object recognition results from the verification video. The inference results from the object recognition inference process are presented to the user, who can then verify the accuracy of the object recognition model based on the inference results.
[0084] In the action candidate extraction process, the processor 13 extracts action images that are the target of the user's labeling operation as candidate images. Specifically, the processor 13 extracts valid combinations of a person (subject of the action) and an object (object of the action) corresponding to a given action based on class definition information, then extracts action images that include the image regions of the person and object that make up the valid combination, extracts feature quantities for each action image, then classifies the action images into multiple clusters (groups of action images) based on the feature quantities for each object image, and extracts representative images for each cluster as candidate images. At this time, the action images are classified into action classes, and the action candidate extraction process is performed for each action class.
[0085] In the behavior recognition annotation process, the processor 13 presents representative images for each cluster extracted in the behavior candidate extraction process to the user as candidate images. In response to the user's operation to select a candidate image corresponding to a specified behavior class, the processor 13 assigns the label of the specified behavior class to all behavior images included in the cluster to which the selected candidate image (representative image) belongs.
[0086] In the behavior recognition model learning process, the processor 13 generates a behavior recognition model by performing machine learning using the labeled behavior images obtained in the behavior recognition annotation process.
[0087] In the behavior recognition inference process, the processor 13 performs inference to derive behavior recognition results from the verification video using the behavior recognition model. The inference results from the behavior recognition inference process are presented to the user, and the user can verify the accuracy of the behavior recognition model based on the inference results.
[0088] Next, we will explain the processes related to class definition performed on the model generation server 1, user terminal 2, and LLM server 3. Figure 3 is a sequence diagram showing an overview of the processes related to class definition. In the example shown in Figure 3, the action of tightening screws on an engine using a screwdriver in a factory is the subject of class definition.
[0089] In this embodiment, the model generation server 1, the user terminal 2, and the LLM server 3 work together to generate class definition information. The class definition information is information regarding the correspondence between object classes and action classes, that is, information regarding effective combinations of action classes and the object classes they target.
[0090] Model generation server 1 performs class definition relay processing. In class definition relay processing, communication regarding class definitions between user terminal 2 and LLM server 3 is relayed.
[0091] First, the learning video (frame images) acquired from the user terminal 2 and stored in the memory unit 12 is sent to the LLM server 3.
[0092] Furthermore, user terminal 2 processes input of instruction text and learning videos in response to user operations. At this time, the user performs an operation to input instruction text that includes a sentence describing the content of the action to be recognized (recognition target information). The input instruction text is sent to LLM server 3 via model generation server 1.
[0093] Next, when the LLM server 3 receives training videos from the model generation server 1 and instruction text from the user terminal 2 via the model generation server 1, it performs dialogue processing and class definition processing.
[0094] In the dialogue process, natural language response text (response information) is generated in response to natural language instruction text (instruction information) entered by the user. Through repeated natural language dialogue with the user, the information necessary for class definition is collected. If any necessary information for class definition is missing, response text is generated asking for the missing information.
[0095] In the class definition process, the content of the action to be recognized, as instructed by the user in the instruction text, is interpreted, and the information necessary for class definition is extracted from the training video, and the class definition information is output. At this time, the training video and instruction text are input to the multimodal LLM, and the response text and class definition information are output from the multimodal LLM. The response text and class definition information are sent to the user terminal 2 via the model generation server 1.
[0096] Next, the user terminal 2 processes the output of the answer text and class definition information provided by the LLM server 3. During the output process, the answer text and class definition information are displayed on the screen and presented to the user. At this point, the user determines whether the presented class definition information is at a satisfactory level.
[0097] The above procedure is repeated until class definition information that satisfies the user is obtained and the user instructs the user terminal 2 to complete the class definition process. When the user instructs the user to complete the class definition process, the class definition information is stored in the storage unit 12 of the model generation server 1.
[0098] Furthermore, the LLM server 3 is instructed in advance (prompted) to perform class definition processing in response to a class definition request from the model generation server 1. In other words, when training videos or instruction texts are sent to the LLM server 3, it is instructed in advance to perform class definition processing based on those training videos and instruction texts. The LLM server 3 is also instructed in advance to specify the output format of the class definition information. For example, the LLM server 3 is instructed to output in tabular format (see Figures 4A and 4B).
[0099] In this embodiment, information necessary for class definition is collected through natural language interaction between the user and the LLM server 3, and appropriate class definition information is obtained. This appropriately supports the user's class definition work, and even novice users can obtain appropriate class definition information.
[0100] In this embodiment, class definition processing is performed on the LLM server 3 based on the interaction between the user and the LLM server 3, and on the video provided by the user. However, various processes can be performed by utilizing the functions of the LLM server 3. For example, the LLM server 3 may perform processing to extract problematic actions that become bottlenecks in the work, based on the interaction between the user and the LLM server 3, and on the video provided by the user.
[0101] Next, we will explain class definition information. Figures 4A and 4B are explanatory diagrams showing examples of class definition information. In Figures 4A and 4B, possible combinations of a person's actions and the objects they interact with are represented by the symbol "○".
[0102] As shown in Figures 4A and 4B, the class definition information presents the user with a table showing the correspondence between object classes and action classes, that is, valid combinations of actions and the objects they target. Note that the relationship between object classes and action classes in the class definition information stored in the memory unit 12 may be represented using binary values, 0 and 1.
[0103] The example shown in Figure 4A is for work in the engine assembly process. In this case, the action classes for screw tightening and touch are set for the actions to be recognized. In addition, the object classes for the objects to which the actions are performed are set for person, screwdriver, engine, and tablet terminal. In the screw tightening action class, since a screwdriver is used to tighten screws on the engine, the corresponding object classes are screwdriver and engine. In the touch action class, since the worker performs a touch operation on the tablet terminal, the corresponding object class is tablet.
[0104] The example shown in Figure 4B is for work at a truck berth. In this case, the following action classes are set for the actions to be recognized: loading, unloading, moving, and talking. In addition, the following object classes are set for the objects that are the targets of the actions: person, truck, cart (cage cart), and cardboard box (luggage). In the loading action class, since loading is performed on a truck, the corresponding object class is truck. In the unloading action class, since loading is performed on a cart, the corresponding object class is cart. In the moving action class, since carts and cardboard boxes are moved, the corresponding object classes are cart and cardboard box. In the talking action class, since conversation is performed on a person, the corresponding object class is person.
[0105] Next, we will explain the object image extraction process and object feature extraction process performed on the model generation server 1. Figure 5 is an explanatory diagram showing an overview of the object image extraction process and object feature extraction process. Figure 5 is an example of work at the truck berth.
[0106] In the model generation server 1, the object candidate extraction process, that is, the process of extracting object images to be presented to the user as candidate object images (candidate images) to be used for training to generate an object recognition model, involves both an object image extraction process and an object feature extraction process.
[0107] In the object image extraction process, rectangular image regions containing objects are extracted from the frame image as object images based on the detection results of the object detection process. In the example shown in Figure 5, the image regions of the person, cart, truck, and cardboard box are extracted as object images. Note that the object image extraction process may also utilize the object detection results obtained by the LLM server 3 during the class definition process.
[0108] In the object feature extraction process, feature quantities are extracted from object images using a feature extraction model. Feature extraction models such as convolutional neural networks (CNNs) are used in this process. The features extracted during the object feature extraction process are then referenced during the clustering process of object images by object class.
[0109] In object detection processing, objects are detected from frame images using an object detection model. Initially, a zero-shot learning object detection model is used, while subsequent attempts use an object detection model tuned with few-shot learning. A zero-shot learning object detection model can perform object detection even without explicitly training the target class beforehand. Note that any model that can perform object detection without pre-specifying the target object can be used instead of a zero-shot learning model. In few-shot learning tuning, the model is updated with a small amount of training data, based on either the zero-shot learning object detection model or the previously tuned model. For example, few-shot learning is performed using object recognition annotation information obtained in the object recognition annotation process, i.e., information about the object class attached to the object image. This gradually reduces false positives in the object detection process, improving the accuracy of the object detection process, and further improving the accuracy of the object image extraction process performed based on the detection results of the object detection process.
[0110] Next, we will explain the total combination extraction process, effective combination extraction process, behavioral image extraction process, and behavioral feature extraction process performed on the model generation server 1. Figure 6 is an explanatory diagram showing an overview of the total combination extraction process, effective combination extraction process, behavioral image extraction process, and behavioral feature extraction process. The example shown in Figure 6 is for the operation of a truck berth.
[0111] In the model generation server 1, the action candidate extraction process, that is, the process of extracting action images to be presented to the user as candidates (candidate images) for action images (images containing scenes of human action) used for training to generate an action recognition model, performs the following processes: total combination extraction process, valid combination extraction process, object image extraction process, and object feature extraction process.
[0112] In the total combination extraction process, all combinations of people (the subject of the action) and objects (the object of the action) are extracted based on the object recognition annotation information obtained in the object recognition annotation process (see Figures 16A and 16B), that is, information on which object class labels (person, driver, etc.) are assigned to object images extracted from frame images.
[0113] In the valid combination extraction process, valid combinations of a person (subject of the action) and an object (object of the action) that match the class definition are extracted from all combinations of a person (subject of the action) and an object (object of the action), based on class definition information (see Figures 4A and 4B), that is, information regarding the correspondence between action classes and the object classes that are the targets of the action.
[0114] Here, we will explain an example of a class definition for operations at a truck berth (see Figure 4B). For example, if the object is a truck, it corresponds to a loading action, but if the object is a cart, it does not correspond to a loading action. On the other hand, if the object is a truck, it does not correspond to an unloading action, but if the object is a cart, it corresponds to an unloading action.
[0115] In the action image extraction process, a rectangular image region containing both a person (the subject of the action) and an object (the object of the action) that can be combined in accordance with a given action is extracted from the frame image as an action image (an image containing a scene of a person's action). In the example shown in Figure 6, a person is moving a cart and cardboard boxes. In this case, a rectangular image region containing the person and the cardboard boxes that can be combined in accordance with the action of "movement" is extracted as an action image. Also, a rectangular image region containing the person and the cart that can be combined in accordance with the action of "movement" is extracted as an action image. Furthermore, in the action image extraction process, an action class is set for the extracted action images.
[0116] In the behavioral feature extraction process, feature quantities are extracted from behavioral images using a feature extraction model. For example, a convolutional neural network (CNN) is used in the behavioral feature extraction process. The features extracted in the behavioral feature extraction process are then referenced during the clustering process of behavioral images by behavioral class.
[0117] Next, we will explain the clustering process and representative image extraction process performed on the model generation server 1. Figure 7 is an explanatory diagram showing an overview of the clustering process and representative image extraction process. In the example shown in Figure 7, the feature space is represented in two dimensions. Also, the feature quantities of the images (object images, behavior images) extracted from the frame images are represented by rectangles.
[0118] Model generation server 1 performs clustering processing for object classes and clustering processing for behavior classes, respectively. In object class clustering processing, each object image is classified into a cluster based on the features obtained for each object image in the object feature extraction processing, for each object class. In behavior class clustering processing, each behavior image is classified into a cluster based on the features obtained for each behavior image in the behavior feature extraction processing, for each behavior class.
[0119] Furthermore, clustering is performed separately for each class (object class, action class). Specifically, in clustering for object classes, a cluster (group of object images) is created for each object class (e.g., driver, engine, etc.). In clustering for action classes, a cluster is created for each action class (e.g., screw tightening, touch operation, etc.).
[0120] Furthermore, the clustering process generates a number of clusters corresponding to the number of candidate images specified by the user. In other words, object images and behavioral images are classified into a number of clusters corresponding to the number of candidate images specified by the user.
[0121] Next, the model generation server 1 performs representative image extraction processing. In this representative image extraction processing, a representative image for each cluster is extracted from the images (object images, behavior images) contained in the cluster based on predetermined extraction conditions. In addition, one representative image is extracted from each cluster during the representative image extraction processing. The example shown in Figure 7 is when the number of candidate images (number of clusters) is set to 6, and 6 clusters are created, and 6 representative images for each cluster are presented to the user as candidate images. The number of candidate images (number of clusters) can be specified by the user. Furthermore, the number of candidate images (number of clusters) can be set individually for each class.
[0122] In this embodiment, images (object images, behavior images) included in a cluster are assigned a rank based on their position (distance) from the cluster's centroid, and the user can specify this rank as an extraction criterion when extracting a representative image. Specifically, the images are assigned a rank (centroid distance rank) in order of increasing distance from the cluster's centroid. For example, if the user specifies a rank of 1, the image closest to the centroid is extracted as the representative image. If the user specifies a rank of 2, the image second closest to the centroid is extracted as the representative image. The cluster's centroid is the average of the feature quantities of each image included in the cluster, and represents the image with the most typical features within the cluster.
[0123] The extraction criteria for extracting representative images can be set individually for each class. Specifically, in clustering processes related to object classes, extraction criteria are set for each object class (e.g., driver, engine, etc.). Similarly, in clustering processes related to behavior classes, extraction criteria are set for each behavior class (e.g., screw tightening, touch operation, etc.).
[0124] In this embodiment, representative images are extracted from the images (object images, behavior images) included in each cluster based on the centroid distance ranking. However, the extraction criteria for extracting representative images are not limited to the centroid distance ranking. For example, representative images may be randomly selected from each cluster.
[0125] In this embodiment, one representative image is extracted from each cluster, but multiple representative images (for example, two) may be extracted from a single cluster.
[0126] Next, we will describe the class definition mode screen 101 displayed on the user terminal 2. Figure 8 is an explanatory diagram showing the class definition mode screen 101.
[0127] The Class Definition mode screen 101 is provided with tabs 102 (mode selection section) for "Class Definition," "Object Recognition," "Action Recognition," and "Inference." The Class Definition mode screen 101 is displayed first, and also appears when the user operates the "Class Definition" tab 102 on other screens 201, 301, and 401 (see Figures 10, 11, and 12). When the user operates the "Object Recognition" tab 102, the screen transitions to the Object Recognition mode screen 201 (see Figure 10). When the user operates the "Action Recognition" tab 102, the screen transitions to the Action Recognition mode screen 301 (see Figure 11). When the user operates the "Inference" tab 102, the screen transitions to the Inference mode screen 401 (see Figure 12).
[0128] Furthermore, the class definition mode screen 101 is provided with a video input unit 103. The video input unit 103 allows the user to input learning videos of an operator's work. Specifically, learning videos can be input by, for example, moving the icon of a video file (a file containing the learning video) onto the video input unit 103 using a mouse operation (drag and drop operation), or by selecting a video file on a file selection screen (not shown) that appears when the mouse is right-clicked on the video input unit 103 using a mouse operation (click operation).
[0129] The input training videos are uploaded from the user terminal 2 to the model generation server 1, and then transferred to the LLM server 3 for use in class definition processing. Furthermore, the training videos are used in machine learning for generating recognition models (object recognition models, action recognition models) performed on the model generation server 1.
[0130] Furthermore, in this embodiment, a video file (a file containing learning videos) is input, but files containing various types of data other than videos may also be input. For example, files containing data such as still images, audio, and text may be input. These types of data other than videos can also be used in the class definition processing in the LLM server 3.
[0131] Furthermore, the screen 101 in class definition mode is equipped with an interaction unit 104. In the interaction unit 104, interaction takes place between the user and the LLM server 3, and through this interaction, information necessary for class definition is collected from the user. In the example shown in Figure 8, the action of tightening screws on an engine using a screwdriver in a factory becomes the subject of class definition.
[0132] The dialogue unit 104 displays message boxes 111 and 112, an input box 113, and a "Send" button 114. The user can enter instruction text (prompt) in the input box 113. When the user enters instruction text and operates the "Send" button 114, the entered instruction text is sent from the user terminal 2 to the LLM server 3 via the model generation server 1.
[0133] Message box 111 displays the instruction text previously entered by the user. Message box 112 displays the response text generated by the LLM server 3 in response to the instruction text. Message boxes 111 and 112 scroll as the dialogue progresses, with the instruction text and response text being exchanged. This dialogue can be repeated until the user determines that an appropriate class definition has been obtained.
[0134] Furthermore, the class definition mode screen 101 is provided with a definition result output unit 105. The definition result output unit 105 displays the class definition information generated by the LLM server 3. In the example shown in Figure 8, the correspondence between action classes and object classes is displayed as class definition information in a table format, with action classes in the rows and object classes in the columns. In addition, prompts regarding the output conditions are input to the LLM server 3 when the definition result output unit 105 outputs.
[0135] Furthermore, the class definition mode screen 101 is provided with a "Class Definition Complete" button 106. When the user confirms that the class definition information displayed on the definition result output unit 105 is correct, they operate the "Class Definition Complete" button 106. As a result, the model generation server 1 confirms the class definition information and stores it in the storage unit 12.
[0136] Next, we will describe the screen 101 of the class definition mode displayed on the user terminal 2 when selecting a model. Figure 9 is an explanatory diagram showing the screen 101 of the class definition mode when selecting a model.
[0137] The screen 101 in class definition mode is provided with a model selection unit 107. The model selection unit 107 is also provided on the screen 201 in object recognition mode (see Figure 10), the screen 301 in action recognition mode (see Figure 11), and the screen 401 in inference mode (see Figure 12).
[0138] When the user operates the model selection unit 107, a drop-down list 121 is displayed. In the drop-down list 121, the user can select a recognition model (object recognition model, action recognition model) that has been created in the past. In the example shown in Figure 9, the serial number and creation date and time are displayed for each recognition model as identification information. The drop-down list 121 also has a "Create New" item, and the user can create a new recognition model by selecting the "Create New" item.
[0139] When the user selects a recognition model (object recognition model, action recognition model) from the dropdown list 121 of the model selection unit 107, the selected recognition model, the training video used to generate the selected recognition model, and the class definition information created when generating the selected recognition model are read out.
[0140] This allows for the additional training and retraining of previously created recognition models (object recognition models, action recognition models). Furthermore, previously created recognition models can be used in class definitions. Inference processing (object recognition inference processing, action recognition inference processing) can also be performed using previously created recognition models. Therefore, the accuracy of previously created recognition models can be improved when targeting the same work site as the current one. Additionally, when repurposing a previously created recognition model for a different work site (work process) to the current work site (work process), the accuracy of the recognition model can be improved, making the repurposing of recognition models easier.
[0141] Next, we will explain the object recognition mode screen 201 (annotation screen) displayed on the user terminal 2. Figure 10 is an explanatory diagram showing the object recognition mode screen 201. On the object recognition mode screen 201, the user performs operations related to annotation and training for generating the object recognition model.
[0142] The object recognition mode screen 201 is provided with a learning count display unit 202. The learning count display unit 202 displays the number of learning sessions, that is, the number of times learning (including additional learning and retraining) has been performed in the past for the creation and updating of the object recognition model.
[0143] Furthermore, the object recognition mode screen 201 is equipped with an "Object Candidate Extraction" button 203. When the user operates the "Object Candidate Extraction" button 203, a process (object candidate extraction process) is executed to extract object images as candidate images for all object classes (people, drivers, etc.) that are the target of the user's labeling operation. In the case of additional training and retraining, when the user operates the "Object Candidate Extraction" button 203, the object candidate extraction process is executed based on the updated class definition information and training videos.
[0144] Furthermore, the object recognition mode screen 201 is provided with a class designation unit 204. When the user operates the class designation unit 204, a dropdown list (not shown) is displayed. The user can select an object class (person, driver, etc.) from the dropdown list.
[0145] Furthermore, the object recognition mode screen 201 is provided with an image number specification unit 205. The image number specification unit 205 consists of a slider bar. By moving the slider, the user can increase or decrease the number of candidate images displayed in the representative image display unit. In the example shown in Figure 10, there are 10 candidate images, 10 clusters are generated, and 10 representative images for each cluster are displayed as candidate images. Note that if one representative image is selected as a candidate image for each cluster, the number of candidate images will match the number of clusters generated in the clustering process. Also, if multiple (e.g., two) representative images are extracted from one cluster, the number of clusters generated will be the number of candidate images divided by the number of representative images extracted from one cluster. In this case, if the division does not result in an equal distribution of clusters, a larger number of candidate images may be extracted from a particular cluster.
[0146] In the clustering process, the object images may be sorted into clusters with optimal characteristics based on the distribution of features in each object image, and the optimal number of candidate images corresponding to the number of clusters may be presented to the user. Alternatively, the image number specification unit 205 may present the optimal number of candidate images (number of clusters) to the user as an initial value.
[0147] Furthermore, the object recognition mode screen 201 is provided with an extraction condition specification unit 206. When the user operates the extraction condition specification unit 206, a drop-down list (not shown) is displayed. In the drop-down list, the user can specify the extraction conditions for extracting a representative image from the cluster. Specifically, the user can specify the centroid distance rank (a rank assigned in order of proximity to the centroid) as the extraction condition. In the example shown in Figure 10, since the centroid distance rank specified by the user is 1, the object image closest to the centroid is extracted as the representative image (candidate image).
[0148] Furthermore, the object recognition mode screen 201 is provided with a candidate image display unit 207. In the candidate image display unit 207, object images (thumbnails) that are assumed to belong to the object class specified in the class designation unit 204 are displayed in a row as candidate images. The candidate images are representative images for each cluster classified based on the feature quantities of the object images. In addition, the candidate images are extracted based on the object class provisionally set for the object images.
[0149] The candidate image display unit 207 is provided with a checkbox 208 (correct / incorrect input unit) for each candidate image. The user visually inspects each candidate image and determines whether each candidate image is appropriate as an object image corresponding to the object class specified in the class specification unit 204. The user then performs an operation (labeling operation) to check the checkbox 208 corresponding to the candidate image that is determined to be appropriate as an object image corresponding to the specified object class. In the example shown in Figure 10, the object class of "person" is specified in the class specification unit 204, and the user performs an operation to check the checkbox 208 corresponding to the candidate image containing a person in the candidate image display unit 207.
[0150] At this time, the model generation server 1 targets the cluster to which the candidate images that the user has checked in checkbox 208 (in this case, three checkboxes are checked), that is, the candidate images (representative images) that belong to the specified object class, and performs a process (labeling process) to assign the label of the specified object class to all object images included in the cluster.
[0151] In the candidate image display unit 207, object images that are assumed to belong to the object class specified in the class designation unit 204 are displayed as candidate images based on the provisional object class set for each object image. However, since there may be errors in the provisional object class set for each object image, the candidate image display unit 207 also displays inappropriate candidate images that do not belong to the object class specified in the class designation unit 204. Therefore, by visually inspecting each candidate image and checking only the checkbox 208 corresponding to the candidate image that is judged to be appropriate as an object image belonging to the object class specified in the class designation unit 204, inappropriate object images that do not belong to the specified object class can be excluded from the object images for training.
[0152] Furthermore, the operation of checking the checkbox 208 corresponding to the candidate image that corresponds to the object class specified in the class specification unit 204 (labeling operation) is performed for each object class. In other words, the user can switch the object class of the candidate image displayed in the representative image display unit by selecting a different object class from the drop-down list of the class specification unit 204, and by repeating the labeling operation for each object class, the labeling operation for all necessary object classes is completed.
[0153] Furthermore, the object recognition mode screen 201 is provided with an "Object Recognition Learning" button 209. When the user operates the "Object Recognition Learning" button 209, the model generation server 1 executes a process (object recognition model learning process) to generate an object recognition model using training object images.
[0154] Next, we will explain the behavior recognition mode screen 301 (annotation screen) displayed on the user terminal 2. Figure 11 is an explanatory diagram showing the behavior recognition mode screen 301. On the behavior recognition mode screen 301, the user performs operations related to annotation and training for generating the behavior recognition model.
[0155] The screen 301 in the behavior recognition mode is provided with a learning count display unit 302. The learning count display unit 302 displays the number of learning sessions, that is, the number of times learning (including additional learning and retraining) has been performed in the past for the creation and updating of the behavior recognition model.
[0156] Furthermore, the screen 301 in action recognition mode is equipped with an "Action Candidate Extraction" button 303. When the user operates the "Action Candidate Extraction" button 303, a process (action candidate extraction process) is executed to extract action images as candidate images for all action classes (screw tightening, touch, etc.) that are the target of the user's labeling operation. In the case of additional training and retraining, when the user operates the "Action Candidate Extraction" button 303, the action candidate extraction process based on the updated class definition information and training videos is executed.
[0157] Furthermore, the screen 301 in the action recognition mode is provided with a class selection unit 304. When the user operates the class selection unit 304, a drop-down list (not shown) is displayed. The user can select an action class (such as screw tightening or touching) from the drop-down list.
[0158] Furthermore, the screen 301 in the action recognition mode is provided with an image number specification unit 305. The image number specification unit 305 is composed of a slider bar. By moving the slider, the user can increase or decrease the number of candidate images displayed in the representative image display unit. In the example shown in Figure 10, there are 10 candidate images, 10 clusters are generated, and 10 representative images for each cluster are displayed as candidate images. Note that if one representative image is selected as a candidate image for each cluster, the number of candidate images will match the number of clusters generated in the clustering process. Also, if multiple (for example, two) representative images are extracted from one cluster, the number of clusters generated will be the number of candidate images divided by the number of representative images extracted from one cluster. In this case, if the division does not result in an equal distribution of clusters, a larger number of candidate images may be extracted from a particular cluster.
[0159] In the clustering process, the behavioral images may be sorted into optimal clusters based on the distribution of features in each behavioral image, and the optimal number of candidate images corresponding to the number of clusters may be presented to the user. Alternatively, the image number specification unit 305 may present the optimal number of candidate images (number of clusters) to the user as an initial value.
[0160] Furthermore, the screen 301 in the behavior recognition mode is provided with an extraction condition specification unit 306. When the user operates the extraction condition specification unit 206, a drop-down list (not shown) is displayed. In the drop-down list, the user can specify the extraction conditions when extracting a representative image from the cluster. Specifically, the user can specify the centroid distance rank (a rank assigned in order of proximity to the centroid) as the extraction condition. In the example shown in Figure 11, since the centroid distance rank specified by the user is 1, the behavior image closest to the centroid is extracted as the representative image (candidate image).
[0161] Furthermore, the screen 301 in the behavior recognition mode is provided with a candidate image display unit 307. In the candidate image display unit 307, behavior images (thumbnails) that are assumed to belong to the behavior class specified in the class designation unit 304 are displayed in a row as candidate images. The candidate images are representative images for each cluster classified based on the feature quantities of the behavior images. In addition, the candidate images are extracted based on the object class provisionally set for the behavior images.
[0162] The candidate image display unit 307 is provided with a checkbox 308 (correct / incorrect input unit) for each candidate image. The user visually inspects each candidate image and determines whether each candidate image is appropriate as an action image corresponding to the action class specified in the class designation unit 304. The user then performs an operation (labeling operation) to check the checkbox 308 corresponding to the candidate image that is determined to be appropriate as an action image corresponding to the specified action class. In the example shown in Figure 11, the action class for touch is specified in the class designation unit 304, and the user performs an operation to check the checkbox 308 corresponding to the candidate image that includes a touch action scene in the candidate image display unit 307.
[0163] At this time, the model generation server 1 targets the cluster to which the candidate images that the user has checked in checkbox 308 (in this case, three checkboxes are checked), that is, the candidate images (representative images) that correspond to the specified behavior class belong, and performs a process (labeling process) to assign the label of the specified behavior class to all behavior images included in the cluster.
[0164] In the candidate image display unit 307, based on the provisional action class set for each action image, action images that are assumed to belong to the action class specified by the class designation unit 304 are displayed as candidate images. However, since there may be errors in the provisional action class set for each action image, the candidate image display unit 307 also displays inappropriate candidate images that do not belong to the action class specified by the class designation unit 304. Therefore, by visually inspecting each candidate image and checking only the checkbox 308 corresponding to the candidate image that is judged to be appropriate as an action image belonging to the action class specified by the class designation unit 304, inappropriate action images that do not belong to the specified action class can be excluded from the action images for training.
[0165] Furthermore, the operation of checking the checkbox 308 corresponding to the candidate image that corresponds to the behavior class specified in the class specification unit 304 (labeling operation) is performed for each behavior class. In other words, the user can switch the behavior class of the candidate image displayed in the representative image display unit by selecting a different behavior class from the drop-down list of the class specification unit 304, and by repeating the labeling operation for each behavior class, the labeling operation for all necessary behavior classes is completed.
[0166] Furthermore, the screen 301 in the behavior recognition mode is provided with a "Behavior Recognition Learning" button 309. When the user operates the "Behavior Recognition Learning" button 309, the model generation server 1 executes a process (behavior recognition model learning process) to generate a behavior recognition model using training behavior images.
[0167] Next, we will describe the inference mode screen 401 displayed on the user terminal 2. Figure 12 is an explanatory diagram showing the inference mode screen 401.
[0168] The inference mode screen 401 is provided with a threshold setting unit 402 for object recognition and a threshold setting unit 403 for action recognition. The threshold setting units 402 and 403 are composed of sliders. The thresholds can be adjusted by the user by moving the sliders in the threshold setting units 402 and 403. The thresholds are compared with scores generated by the recognition models (object recognition model, action recognition model) to determine whether or not a predetermined event (object, action) can be recognized.
[0169] Furthermore, the inference mode screen 401 is provided with a video input unit 404. The video input unit 404 allows the user to input verification videos of the worker's work. Specifically, for example, verification videos can be input by moving the icon of a video file (a file containing verification videos) onto the video input unit 404 using a mouse operation (drag and drop operation), or by selecting a video file on a file selection screen (not shown) that appears when the mouse is right-clicked on the video input unit 404 using a mouse operation (click operation).
[0170] Here, it is preferable that a different video from the training video input to the video input unit 103 in the class definition mode screen 101 (see Figure 8) is input as the verification video. This allows the user to appropriately verify the recognition accuracy of the object recognition model and the action recognition model.
[0171] Furthermore, the inference mode screen 401 is equipped with a "Object Recognition Inference" button 405 and a "Behavior Recognition Inference" button 406. When the user operates the "Object Recognition Inference" button 405, inference is performed by the object recognition model on the verification video input to the video input unit 404. When the user operates the "Behavior Recognition Inference" button 406, inference is performed by the behavior recognition model on the verification video input to the video input unit 404.
[0172] Furthermore, the inference mode screen 401 is provided with a recognition result output unit 407. The recognition result output unit 407 displays a recognition result video. The recognition result video is a video in which images representing the recognition results are superimposed on the verification video. Specifically, rectangular frames surrounding objects detected from each frame image of the verification video are superimposed on the verification video. In addition, characters representing the recognized objects and characters representing the recognized actions are superimposed on the verification video. Information regarding the processing status may also be superimposed on the verification video.
[0173] Here, the user can view the recognition result video displayed in the recognition result output unit 407 to check the accuracy of the recognition model (object recognition model, action recognition model), and if the accuracy of the recognition model is insufficient, they can perform additional training or retraining of the recognition model.
[0174] In other words, if the accuracy of the object recognition model is insufficient, the user can return to the object recognition mode screen 201 (see Figure 10) by operating the "Object Recognition" tab 102, restart the object candidate extraction process by operating the "Extract Object Candidates" button 203, restart the object recognition annotation process by performing a labeling operation on the presented candidate images, and restart the object recognition model training process by operating the "Object Recognition Training" button 209.
[0175] Furthermore, if the accuracy of the behavior recognition model is insufficient, the user can return to the behavior recognition mode screen 301 (see Figure 11) by operating the "Behavior Recognition" tab 102, restart the behavior candidate extraction process by operating the "Extract Behavior Candidates" button 303, restart the behavior recognition annotation process by performing a labeling operation on the presented candidate images, and restart the behavior recognition model learning process by operating the "Behavior Recognition Learning" button 309.
[0176] In this embodiment, the user can improve the accuracy of the object recognition model by redoing the object candidate extraction process, object recognition annotation process, and object recognition model training process. Furthermore, the user can improve the accuracy of the action recognition model by redoing the action candidate extraction process, action recognition annotation process, and action recognition model training process. In addition, since the user can specify various processing conditions on the object recognition mode screen 201 (see Figure 10) or the action recognition mode screen 301 (see Figure 11), the recognition model (object recognition model, action recognition model) can be customized to enable appropriate recognition processing according to the user's requests and the situation on site.
[0177] Next, we will explain the procedure for processing performed on the model generation server 1. Figure 13 is a flowchart showing the procedure for processing performed on the model generation server 1.
[0178] In the model generation server 1, the processor 13 first acquires the training video that the user has input on the user terminal 2 and obtains the frame images that make up the training video (input video processing) (ST101). At this time, the user inputs a video file (a file containing the training video) on the video input unit 103 of the class definition mode screen 101 (see Figure 8) displayed on the user terminal 2.
[0179] Next, the processor 13 works in cooperation with the LLM server 3 to obtain class definition information (class definition processing) (ST102). The class definition information is information regarding the correspondence between object classes and action classes, that is, valid combinations of action classes and the object classes they target.
[0180] Next, the processor 13 detects objects from the frame images of the training video (object detection process) (ST103). At this time, the first time, an object detection model based on zero-shot learning is used, and from the second time onward, an object detection model tuned with few-shot learning is used.
[0181] Next, when the user operates the "Extract Object Candidates" button 203 on the object recognition mode screen 201 (see Figure 10), the processor 13 extracts object images from the frame image based on the detection results of the object detection process, extracts feature quantities for each object image, then classifies the object images into multiple clusters based on the feature quantities for each object image, and extracts representative images for each cluster as candidate images (object candidate extraction process) (ST104).
[0182] Next, the processor 13 presents representative images for each cluster extracted in the object candidate extraction process to the user as candidate images, and in response to the user's operation to select a candidate image corresponding to a specified object class, it assigns the label of the specified object class to all object images included in the cluster to which the selected candidate image (representative image) belongs (object recognition annotation process) (ST105).
[0183] Next, when the user operates the "Object Recognition Learning" button 209 on the object recognition mode screen 201 (see Figure 10), the processor 13 performs machine learning using the labeled object images obtained in the object recognition annotation process to generate an object recognition model (object recognition model learning process) (ST106). The generated object recognition model is stored in the memory unit 12.
[0184] Next, the processor 13 acquires a verification video created by the user's input operations on the user terminal 2 and obtains frame images that make up the verification video (input video processing) (ST107). At this time, the user performs an operation to input a video file (a file containing the verification video) in the video input unit 404 on the inference mode screen 401 (see Figure 12) displayed on the user terminal 2.
[0185] Next, on the inference mode screen 401 (see Figure 12) displayed on the user terminal 2, when the user operates the "Object Recognition Inference" button 405, the processor 13 performs inference to derive object recognition results from the verification video using the object recognition model (object recognition inference process) (ST108).
[0186] Next, the processor 13 determines whether or not to restart the object recognition model in response to the user's operation on the user terminal 2 (ST109). At this time, the user visually checks the recognition result video (inference result) on the inference mode screen 401 (see Figure 12) displayed on the user terminal 2 and determines whether or not the recognition accuracy of the object recognition model is at a satisfactory level. If the recognition accuracy of the object recognition model is not at a satisfactory level, the user operates the "Object Recognition" tab 102, and it is determined that the object recognition model should be restarted.
[0187] If the object recognition model needs to be redone (Yes in ST109), the process returns to ST103, and the object detection process (ST103), object candidate extraction process (ST104), object recognition annotation process (ST105), and object recognition model training process (ST106) are performed again. At this time, the object candidate extraction process (ST110) is performed using the object recognition model generated in the previous object recognition model training process, so the object image, cluster, and representative image (candidate image) will be different from those generated in the previous object candidate extraction process.
[0188] On the other hand, if the object recognition model is not being redone, that is, if the user operates the "Action Recognition" tab 102 on the inference mode screen 401 (see Figure 12) displayed on the user terminal 2 (No. in ST109), the process proceeds to ST110.
[0189] Next, when the user operates the "Action Candidate Extraction" button 303 on the action recognition mode screen 301 (see Figure 11), the processor 13 extracts valid combinations of a person (subject of the action) and an object (object of the action) corresponding to a predetermined action based on the class definition information acquired in the class definition relay processing (ST102). Next, it extracts action images that include the image regions of the person and object that make up the valid combination, extracts feature quantities for each action image, then classifies the action images into multiple clusters based on the feature quantities for each object image, and extracts representative images for each cluster as candidate images (action candidate extraction processing) (ST110).
[0190] Next, the processor 13 presents representative images for each cluster extracted in the action candidate extraction process to the user as candidate images, and in response to the user's operation to select a candidate image corresponding to a specified action class, it assigns the label of the specified action class to all action images included in the cluster to which the selected candidate image (representative image) belongs (action recognition annotation process) (ST111).
[0191] Next, when the user operates the "Action Recognition Learning" button 309 on the action recognition mode screen 301 (see Figure 11), the processor 13 performs machine learning using the labeled action images obtained in the action recognition annotation process to generate an action recognition model (action recognition model learning process) (ST112). The generated action recognition model is stored in the memory unit 12.
[0192] Next, on the inference mode screen 401 (see Figure 12) displayed on the user terminal 2, when the user operates the "Action Recognition Inference" button 406, the processor 13 performs inference to derive action recognition results from the verification video using the action recognition model (Action Recognition Inference Processing) (ST113).
[0193] Next, the processor 13 determines whether to restart the action recognition model or the object recognition model in response to the user's operation on the user terminal 2 (ST114). At this time, the user visually inspects the recognition result video (inference result) on the inference mode screen 401 (see Figure 12) displayed on the user terminal 2 and determines whether the recognition accuracy of the action recognition model and the recognition accuracy of the object recognition model are at a satisfactory level. If the recognition accuracy of the action recognition model is not at a satisfactory level, the user can operate the "Action Recognition" tab 102 to determine whether to restart the action recognition model. Similarly, if the recognition accuracy of the object recognition model is not at a satisfactory level, the user can operate the "Object Recognition" tab 102 to determine whether to restart the object recognition model.
[0194] If the behavior recognition model needs to be redone, the process returns to ST110, and the behavior candidate extraction process (ST110), behavior recognition annotation process (ST111), and behavior recognition model learning process (ST112) are performed again.
[0195] Furthermore, if the object recognition model needs to be redone, the process returns to ST103, and the object detection process (ST103), object candidate extraction process (ST104), object recognition annotation process (ST105), and object recognition model training process (ST106) are performed again.
[0196] On the other hand, if the action recognition model and object recognition model are not being redone, the process is terminated.
[0197] Thus, in this embodiment, if the accuracy of the recognition model (object recognition model, action recognition model) is insufficient, the recognition model can be repeatedly subjected to additional training and retraining, making it easy to improve the accuracy of the recognition model.
[0198] Next, we will explain the input video processing performed by the model generation server 1 (ST101 and ST107 in Figure 13). Figure 14 is a flowchart showing the procedure for input video processing.
[0199] In the model generation server 1, the communication unit 11 first receives the video file transmitted (uploaded) from the user terminal 2 (ST201). At this time, the user terminal 2 inputs the video file according to the user's operation, and that video file is transmitted from the user terminal 2 to the model generation server 1.
[0200] Next, the processor 13 acquires the video (image) stored in the received video file (ST202). At this time, the processor 13 acquires a training video in the case of the first input video processing (ST101 in Figure 13), and acquires a verification video in the case of the next input video processing (ST107 in Figure 13).
[0201] Next, the processor 13 extracts frame images from the video (training video, verification video) (ST203). Then, the processor 13 stores the frame images in the storage unit 12 (ST204).
[0202] Next, we will explain the procedure for the class definition relay processing (ST102 in Figure 13) performed on the model generation server 1. Figure 15 is a flowchart showing the procedure for the class definition relay processing.
[0203] In the model generation server 1, the processor 13 first transmits training videos (frame images) from the communication unit 11 to the LLM server 3 (ST301). The training videos are acquired from the user terminal 2 by input video processing (Figure 14) and stored in the storage unit 12.
[0204] Furthermore, when the communication unit 11 receives instruction text (prompt) entered by the user at the user terminal 2, the processor 13 transmits the received instruction text from the communication unit 11 to the LLM server 3 (ST302).
[0205] At this time, on the user terminal 2, when the user inputs instruction text containing recognition target information that describes the content of the action to be recognized on the class definition mode screen 101 (see Figure 8), the instruction text is sent to the model generation server 1. On the LLM server 3, upon receiving the training video and instruction text from the model generation server 1, a dialogue process is performed to generate a response text in response to the instruction text, and a class definition process is performed to generate class definition information based on the training video and instruction text.
[0206] Next, in the model generation server 1, when the communication unit 11 receives the response text and class definition information from the LLM server 3, the processor 13 transmits the received response text and class definition information from the communication unit 11 to the user terminal 2 (ST303).
[0207] At this time, when the user terminal 2 receives the response text and class definition information from the model generation server 1, the received response text and class definition information are displayed on the class definition mode screen 101 (see Figure 8). The user checks the displayed class definition information and determines whether satisfactory class definition information has been obtained. The user also checks the displayed response text, and if the response text indicates that the information necessary for the class definition is missing, the user inputs instruction text to supplement the necessary information.
[0208] Next, in the model generation server 1, the processor 13 determines whether the user has instructed the completion of the class definition process based on the user's operation on the user terminal 2 (ST304). At this time, if the user is satisfied with the contents of the class definition information, they operate the "Complete Class Definition" button 106 on the class definition mode screen 101 (see Figure 8). On the other hand, if the user is not satisfied with the contents of the class definition information, they input new instruction text (prompt) in the dialogue section 104 of the class definition mode screen 101 (see Figure 8).
[0209] If the user then instructs the class definition process to be completed (Yes in ST304), the class definition information is stored in the storage unit 12.
[0210] On the other hand, if the user does not instruct the completion of the class definition process (No in ST304), the process returns to ST302. At this time, on the user terminal 2, the instruction text (prompt) newly entered by the user on the class definition mode screen 101 (see Figure 8) is sent to the model generation server 1.
[0211] Next, the object candidate extraction process (ST104 in Figure 13) and object recognition annotation process (ST105 in Figure 13) performed on the model generation server 1 will be described. Figure 16A is a flowchart showing the procedure for the object candidate extraction process. Figure 16B is a flowchart showing the procedure for the object recognition annotation process.
[0212] As shown in Figure 16A, in the object candidate extraction process, first, the processor 13 extracts an object image (a rectangular image region surrounding an object) from the frame image based on the detection result of the object detection process (ST103 in Figure 13) (object image extraction process) (ST401).
[0213] Next, the processor 13 extracts feature quantities for each object image using a feature extraction model (object feature extraction process) (ST402). The feature quantities extracted in this object feature extraction process are referenced during the clustering process for each object class of the object image.
[0214] Next, the processor 13 classifies each object image into a cluster (group of object images) based on the features of each object image (clustering process) (ST403). The clustering process is performed for each provisional object class.
[0215] Next, the processor 13 extracts a representative image for each cluster from the object images contained in each cluster (representative image extraction process) (ST404). At this time, for example, if the centroid distance ranking specified by the user is 1, the object image closest to the centroid of the cluster is extracted as the representative image.
[0216] As shown in Figure 16B, in the object recognition annotation process, the processor 13 first presents the representative images for each cluster extracted in the representative image extraction process (ST404 in Figure 16A) to the user as candidate images (ST501). Specifically, the representative images for each cluster extracted are displayed as a list of candidate images on the candidate image display unit 207 of the object recognition mode screen 201 (see Figure 10).
[0217] At this time, on the object recognition mode screen 201 (see Figure 10), the user specifies an object class, and a list of candidate images that are expected to belong to the specified object class is displayed. The user visually inspects the displayed candidate images, determines whether each candidate image belongs to the specified object class, and checks the checkbox 208 for the candidate images that belong to the specified object class.
[0218] Next, the processor 13, in response to user operations on the user terminal 2, obtains information on candidate images checked by the user in checkbox 208, i.e., candidate images corresponding to the specified object class, and identifies the cluster to which the candidate image belongs (ST502).
[0219] Next, the processor 13 targets the cluster to which the candidate images corresponding to the specified object class belong, and assigns the label of the specified object class to all object images included in the cluster (labeling process) (ST503). Note that in the labeling process, binary labels, that is, labels indicating whether or not the specified object is included, may be assigned.
[0220] This object recognition annotation process provides object recognition annotation information, in which object class labels are assigned to object images extracted from frame images. Furthermore, object images included in the cluster to which candidate images (representative images) deemed appropriate by the user for a specified object class belong are labeled collectively. This allows users to efficiently label a large number of object images. Additionally, because appropriate training object images are obtained, the number of object images required to generate an object recognition model with sufficient accuracy can be significantly reduced.
[0221] Next, we will explain the action candidate extraction process (ST110 in Figure 13) and the action recognition annotation process (ST111 in Figure 13) performed on the model generation server 1. Figure 17A is a flowchart showing the procedure for the action candidate extraction process. Figure 17B is a flowchart showing the procedure for the action recognition annotation process.
[0222] As shown in Figure 17A, in the action candidate extraction process, the processor 13 first extracts all combinations of people (subjects of the action) and objects (objects of the action) based on the object recognition annotation information obtained in the object recognition annotation process (see Figure 16B), that is, information on which object class labels (person, driver, etc.) have been assigned to object images extracted from frame images (total combination extraction process) (ST601).
[0223] Next, the processor 13 extracts combinations that match the class definition from all combinations of person (subject of action) and object (object of action), based on the class definition information obtained in the class definition relay processing (see Figure 15), that is, information on valid combinations of action classes and the object classes that are the targets of the action (valid combination extraction processing) (ST602). As a result, impossible combinations of action and object (combinations without the "○" symbol in the examples of Figures 4A and 4B) are excluded in the subsequent processing.
[0224] Next, the processor 13 extracts an action image from the frame image that includes a person (subject of the action) and an object (object of the action) whose combination is valid in response to a predetermined action, that is, a rectangular image region surrounding each image region of the person and the object (action image extraction process) (ST603).
[0225] Next, the processor 13 extracts features from each behavioral image using a feature extraction model (behavioral feature extraction process) (ST604). The features extracted in this behavioral feature extraction process are referenced during the clustering process of behavioral images by behavioral class.
[0226] Next, the processor 13 classifies each action image into a cluster (group of action images) based on the features of each action image (clustering process) (ST605). The clustering process is performed for each provisional object class.
[0227] Next, the processor 13 extracts a representative image for each cluster from the action images contained in each cluster (representative image extraction process) (ST606). At this time, for example, if the centroid distance ranking specified by the user is 1, the action image closest to the centroid of the cluster is extracted as the representative image.
[0228] As shown in Figure 17B, in the behavior recognition annotation process, the processor 13 first presents the representative images for each cluster extracted in the representative image extraction process (ST606 in Figure 17A) to the user as candidate images (ST701). Specifically, the extracted representative images for each cluster are displayed as candidate images on the candidate image display unit 307 of the behavior recognition mode screen 301 (see Figure 11).
[0229] At this time, on the action recognition mode screen 301 (see Figure 11), the user specifies an object class, and a list of candidate images that are expected to belong to the specified object class is displayed. The user visually inspects the displayed candidate images, determines whether each candidate image belongs to the specified action class, and checks the checkbox 308 for the candidate images that belong to the specified action class.
[0230] Next, the processor 13, in response to user operations on the user terminal 2, obtains information on candidate images that the user has checked in checkbox 308, i.e., candidate images that correspond to a specified behavior class, and identifies the cluster to which the candidate image belongs (ST702).
[0231] Next, the processor 13 targets the cluster to which the candidate images corresponding to the specified behavior class belong, and assigns the label of the specified behavior class to all behavior images included in the cluster (labeling process) (ST703). Note that in the labeling process, binary labels, that is, labels indicating whether or not the specified behavior is included, may be assigned.
[0232] This behavior recognition annotation process yields behavior recognition annotation information, in which behavior images extracted from frame images are labeled with behavior classes. Furthermore, behavior images included in the cluster to which a candidate image (representative image) deemed appropriate by the user to belong to a specified behavior class are labeled collectively. This allows users to efficiently label a large number of behavior images. Additionally, because appropriate training behavior images are obtained, the number of behavior images required to generate a sufficiently accurate behavior recognition model can be significantly reduced.
[0233] As described above, embodiments have been explained as examples of the technology disclosed in this application. However, the technology in this disclosure is not limited to these embodiments and can be applied to embodiments that have been modified, replaced, added, or omitted. Furthermore, it is possible to create new embodiments by combining the components described in the above embodiments.
[0234] For example, in this embodiment, the model generation server 1 generates an object recognition model and also generates an action recognition model based on that object recognition model, but it is also possible to generate only the object recognition model.
[0235] The class definition support system and class definition support method relating to this disclosure have the effect of enabling even inexperienced users to appropriately create the class definition information necessary for annotation and training to generate a recognition model, and are useful as a class definition support system, class definition support device, and class definition support method that support user operations to acquire class definition information related to the events to be recognized.
[0236] Furthermore, the annotation system and annotation method relating to this disclosure have the effect of significantly reducing the burden on users who perform annotation work to generate recognition models, enabling the creation of recognition models with sufficient accuracy quickly and easily. They are useful as annotation systems and annotation methods that perform processing related to annotation of target images extracted from on-site captured images and used for training to generate recognition models.
[0237] 1: Model generation server (server device, class definition support device, object recognition annotation device, action recognition annotation device) 2: User terminal (terminal device) 3: LLM server (language model device) 11: Communication unit 12: Memory unit 13: Processor 101: Class definition mode screen 102: Tab 103: Video input unit 104: Dialogue unit 105: Definition result output unit 106: Button 107: Model selection unit 111, 112: Message box 113: Input box 114: Button 121: Drop-down list 201: Object recognition mode screen 202: Training count display unit 203: Button 204: Class specification unit 205: Image count specification unit 206: Extraction condition specification unit 207: Candidate image display unit 208: Checkbox (correct / incorrect input unit) 209: Button 301: Action recognition mode screen 302: Training count display unit 303: Button 304: Class specification unit 305: Image count specification unit 306: Extraction condition specification unit 307: Candidate image display unit 308: Checkbox (correct / incorrect input unit) 309: Button 401: Inference mode screen 402, 403: Threshold setting unit 404: Video input unit 405, 406: Button 407: Recognition result output unit
Claims
1. A class definition support system that assists a user in obtaining class definition information relating to an event to be recognized, comprising: a terminal device in which the user performs an operation to input information relating to an event to be recognized that describes the content of the event to be recognized; and a server device connected to the terminal device via a network and which obtains the class definition information based on the information relating to the event to be recognized, wherein the server device causes a language model to perform a process to collect the information relating to the event to be recognized necessary for obtaining the class definition information while engaging in natural language dialogue with the user, and obtains the class definition information generated by the language model.
2. The class definition support system according to claim 1, characterized in that the server device acquires information defining an object class corresponding to an action class as class definition information relating to an action class.
3. The class definition support system according to claim 2, characterized in that the server device acquires information regarding effective combinations of an action class and the object class to which it is applied as class definition information.
4. The class definition support system according to claim 3, characterized in that the server device displays on the terminal device a screen that visualizes the class definition information representing the correspondence between object classes and action classes in a tabular format.
5. The class definition support system according to claim 1, characterized in that the server device is connected via a network to a language model device equipped with the language model, and acquires the class definition information in cooperation with the language model device.
6. The class definition support system according to claim 5, characterized in that the server device acquires instruction text in natural language entered by a user at the terminal device, transmits the instruction text to the language model device to instruct it to execute class definition processing, and acquires the class definition information based on the instruction text from the language model device.
7. The class definition support system according to 6, characterized in that the server device transmits the captured image along with the instruction text to the language model device to instruct it to execute the class definition process, and obtains the class definition information based on the instruction text and the captured image from the language model device.
8. A class definition support method that causes a processor to execute a process to assist a user in obtaining class definition information relating to an event to be recognized, characterized in that, when obtaining the class definition information based on information relating to an event to be recognized that describes the content of the event to be recognized, the method causes a language model to perform a process to collect the information relating to an event to be recognized necessary for obtaining the class definition information while engaging in a natural language dialogue with the user, and obtains the class definition information generated by the language model.
9. An annotation system that performs annotation processing on target images used for training to generate a recognition model extracted from on-site captured images, comprising: a terminal device in which a user performs an operation to select target images that belong to a recognition class to be recognized; and a server device connected to the terminal device via a network, which performs annotation processing to assign a recognition class label in response to the user's operation to select a target image, wherein the server device extracts target images from captured images, classifies the extracted target images into one or more image groups, extracts a representative image from the image group based on predetermined extraction conditions, presents the representative image to the user as a candidate image, and, in response to the user's operation to select a candidate image that belongs to a recognition class, assigns the same recognition class label to all target images included in the image group to which the selected candidate image belongs as the annotation processing.
10. The annotation system according to claim 9, characterized in that the server device displays an annotation screen that lists candidate images on the terminal device, and executes the annotation process in response to a user's operation to select a candidate image corresponding to a recognition class on the annotation screen.
11. The annotation system according to claim 10, characterized in that the server device displays a list of candidate images that are expected to correspond to the specified recognition class on the annotation screen in response to an operation by a user specifying a recognition class, and displays a correct / incorrect input section on the annotation screen for the user to input whether or not the displayed candidate images are appropriate as target images corresponding to the specified recognition class.
12. The annotation system according to claim 9, characterized in that the server device detects objects from captured images using an object detection model generated by machine learning, and extracts target images from the captured images based on the detection results.
13. The annotation system according to claim 9, characterized in that the server device extracts a target image from a captured image that includes an image region of an object related to an action, based on class definition information that defines an object class corresponding to an action class.
14. The annotation system according to claim 9, characterized in that the server device extracts features from the target image and classifies the target image into a group of images based on the features of the target image.
15. The annotation system according to claim 9, characterized in that the server device extracts representative images from a group of images based on extraction conditions specified by the user.
16. The annotation system according to 15, characterized in that, if the recognition model generated by learning using the target images to which labels have been assigned by the annotation process does not yield an accuracy that satisfies the user, the server device uses the results of the previous processing to extract the target images again from the captured images, reclassifies the re-extracted target images into a group of images, re-extracts a representative image from the reclassified group of images, and re-presents the re-extracted representative image to the user as a candidate image.
17. The annotation system according to claim 9, characterized in that the server device classifies the target images into a number of image groups corresponding to the number of candidate images specified by the user in response to the user's operation of specifying the number of candidate images to be presented.
18. The annotation system according to claim 9, characterized in that the server device extracts a representative image from the image group based on the specified rank in response to a user operation specifying a rank regarding the position of the centroid of the image group as the extraction condition.
19. The annotation system according to claim 9, characterized in that the server device randomly extracts representative images from the image group as the extraction condition.
20. An annotation method that causes a processor to perform annotation processing on target images used for training to generate a recognition model extracted from on-site captured images, characterized in that: a target image is extracted from captured images; the extracted target image is classified into one or more image groups; a representative image is extracted from the image groups based on predetermined extraction conditions; the representative image is presented to the user as a candidate image; and, in response to the user's operation to select a candidate image corresponding to a recognition class, the same recognition class label is collectively assigned to all target images included in the image group to which the selected candidate image belongs.