Generation device, generation system, generation method, and program
The generation device tracks and summarizes objects in images to generate precise answer texts by identifying objects, generating state information, and selecting relevant details based on question context, addressing the inaccuracies in existing video surveillance search systems.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NEC CORP
- Filing Date
- 2025-12-10
- Publication Date
- 2026-07-02
AI Technical Summary
Existing video surveillance search systems struggle to accurately identify and summarize objects in images, leading to incomplete or inappropriate answer texts when generating responses to search queries.
A generation device and method that tracks the appearance time and position of objects in multiple images, generates state information for each object, and selects relevant information based on the question text to create appropriate answer texts using a language model.
Enables the generation of accurate and relevant answer texts by focusing on specific objects and scenes within images, improving the precision of responses to search queries.
Smart Images

Figure JP2025043039_02072026_PF_FP_ABST
Abstract
Description
Generation Device, Generation System, Generation Method, and Program
[0001] The present disclosure relates to a generation device, a generation system, a generation method, and a program.
[0002] Techniques for obtaining an answer text for a question text by utilizing a language model have been developed. Generally, when generating an answer text by utilizing a language model, the question text and generation information are input into the language model. The input generation information is generated based on a summary sentence created from a plurality of images.
[0003] Here, Patent Document 1 discloses a video surveillance search system that displays an image including an object that matches a search condition.
[0004] Japanese Patent Application Laid-Open No. 2007-280043
[0005] In the video surveillance search system disclosed in Patent Document 1, there are cases where an image containing all objects that match the search condition and only the target object cannot be obtained. Therefore, a summary sentence focusing on each object cannot be created, and as a result, an appropriate answer text may not be obtained.
[0006] An object of the present disclosure is to provide a generation device, a generation system, a generation method, and a program for obtaining an appropriate answer text in view of the above-described problems.
[0007] The generation device according to the present disclosure is a generation device that generates the generation information used in a model for generating an answer text from a question text and generation information, and includes: an identification unit that identifies each object included in a plurality of images by tracking the appearance time and position of each object; a state information generation unit that generates state information of each identified object from the plurality of images; and a generation information generation unit that generates the generation information based on the state information of each object and the question text.
[0008] The generation device relating to this disclosure is a generation device that generates generated information used in a model that generates answer text from question text and generated information, and comprises: a state information generation unit that generates state information of an object included in a plurality of images; and a generation information generation unit that generates generated information by selecting state information related to the question text from the state information of the object.
[0009] The generation method relating to this disclosure is a generation method for generating generated information used in a model that generates answer text from question text and generated information, wherein a computer performs the following processes: identifying each object by tracking the appearance time and position of each object included in a plurality of images; generating state information for each identified object from the plurality of images; and generating the generated information based on the state information for each object and the question text.
[0010] The program relating to this disclosure is a program for generating generated information used in a model that generates answer text from question text and generated information, and causes a computer to perform the following processes: identify each object by tracking the appearance time and position of each object included in a plurality of images; generate state information for each identified object from the plurality of images; and generate generated information based on the state information for each object and the question text.
[0011] This disclosure provides a generation device, generation system, generation method, and program for obtaining appropriate response text.
[0012] This is a block diagram illustrating the generation device related to this disclosure. This is a flowchart showing the information processing method related to this disclosure. This is a block diagram illustrating the generation device related to this disclosure. This is a flowchart showing the information processing method related to this disclosure. This is a block diagram illustrating the generation system related to this disclosure. This is an overview diagram of the generation device and model device. This is a diagram showing an example of the operation of the generation system related to this disclosure. This is a diagram showing an example of the operation of the generation system related to this disclosure. This is a diagram showing an example of the operation of the generation system related to this disclosure. This is a diagram showing an example of the display unit of a terminal equipped with the generation device and model device related to this disclosure. This is a block diagram showing an example of the configuration of the generation device related to this disclosure.
[0013] The present disclosure will be described below through embodiments, but this does not limit the claims to the following embodiments. Furthermore, not all of the configurations described in the embodiments are necessarily essential for solving the problem. In each drawing, the same elements are denoted by the same reference numerals, and redundant explanations are omitted where necessary.
[0014] <Embodiment 1> <Generation Device> The configuration of the generation device according to this disclosure will be described below with reference to Figure 1. Figure 1 is a block diagram illustrating the generation device according to this disclosure. As shown in Figure 1, the generation device 10 comprises an identification unit 11, a state information generation unit 12, and a generation information generation unit 13. The generation device 10 generates generation information used in a model that generates an answer text from a question text and generation information.
[0015] <Question Text and Answer Text> First, let's explain the question text and answer text. The question text and answer text consist of words and sentences. A sentence consists of one or more sentences. Alternatively, the question text may consist of one sentence, and the answer text may consist of multiple sentences. The reverse is also true. The input and output of the question text and answer text can be done, for example, by voice. Alternatively, the question text may be entered as text, and the answer text may be displayed. Text input can be, for example, by keyboard input or touch panel input.
[0016] Here are some examples of question and answer texts. For example, the question might be, "What color clothes was the first person to enter the store wearing?" The answer would be, "The first person to enter the store was wearing red clothes." Another example is, "What color clothes was the first person to enter wearing?" The answer would be, "The first person to enter was wearing red clothes." Yet another example is, "What color clothes was the first person to enter wearing? What did they buy?" The answer would be, "The first person to enter was wearing red clothes. They bought a drink."
[0017] The question is, "Where were the people from ward XX between 3 PM and 4 PM?" The answer is, "They were in the dining hall from 3 PM to 3:30 PM. They were in the rehabilitation room from 3:30 PM to 4 PM."
[0018] <Identification Unit> The identification unit 11 shown in Figure 1 identifies each object by tracking the appearance time and location of each object included in multiple images. Objects include people and objects. For example, if the multiple images include three people, A, B, and C, the identification unit 11 identifies each of A, B, and C by tracking the appearance time and location of A, B, and C in each image. Images include both moving and still images.
[0019] <State Information Generation Unit> The state information generation unit 12 shown in Figure 1 generates state information for each identified object from multiple images. The state information is information that indicates the state of each object and consists of words and sentences. The sentences included in the state information consist of one or more sentences.
[0020] State information, when the subject is a person, includes, for example, information about the person's actions, gestures, height and build, location, belongings, and clothing. State information, when the subject is an object, includes, for example, information about the object's location, color, shape, and size.
[0021] State information can include, for example, "I am standing in front of a vending machine with a red bag on my back," "I have entered a store," or "It is placed on a yellow shelf." It can also consist of multiple sentences, such as, "I have parked my car in the parking lot. I am waiting inside the car."
[0022] The state information generation unit 12 will be explained using the example of the identification unit 11 identifying three people, A, B, and C, from multiple images. For example, the state information generation unit 12 generates state information for person A such as "Purchased a drink" or "Left the store." Similarly, the state information generation unit 12 generates state information for person B such as "Purchased a rice ball" or "Paying at the store." Similarly, the state information generation unit 12 generates state information for person C such as "Purchased sweets" or "Paying at the store."
[0023] In this way, the state information generation unit 12 generates state information for each of the identified targets A, B, and C. The state information generation unit 12 may also associate time information with the state information of each identified target. In this case, the state information may be a list organized by time of day or period.
[0024] <Generated Information Generation Unit> The generated information generation unit 13 shown in Figure 1 generates generated information based on the state information of each target and the question text. Generated information is information used in the learning model that generates the answer text. The generated information generation unit 13 will be explained in detail below.
[0025] <Generation Example> An example of generation of generated information by the generated information generation unit 13 will be explained. For example, the generated information generation unit 13 generates generated information by selecting the target's state information according to the question text from among the state information of each target. The generated information generation unit 13 will be explained assuming that the state information generation unit 12 has generated the state information of three people, A, B, and C.
[0026] Here, the state information for object A is assumed to be "Purchased a drink" and "Left the store." The state information for object B is assumed to be "Purchased a rice ball" and "Paying at the store." The state information for object C is assumed to be "Purchased sweets" and "Paying at the store."
[0027] For example, suppose the question text is "What did the first person to leave the store in the image purchase?" In this case, the generated information generation unit 13 selects the state information for subject A, "Purchased a drink," and "Left the store," from the state information of the three subjects A, B, and C, which corresponds to the question text "the first person to leave the store in the image." As a result, the generated information generation unit 13 generates the generated information "Purchased a drink," and "Left the store." Based on this generated information and the question text, the model is used to obtain the answer text "A purchased a drink first."
[0028] In this way, the generation device 10 generates state information for each target by tracking and identifying the target. Then, the generation device 10 generates generated information based on the state information of each target and the question text. With this configuration, appropriate generated information can be input to the learning model. Therefore, appropriate answer text can be obtained.
[0029] <Generation Method> Next, the generation method related to this disclosure will be explained. Figure 2 is a flowchart showing the generation method related to this disclosure. The generation method related to this disclosure generates generation information used in the model that generates the response text.
[0030] First, as shown in Figure 2, the identification unit 11 identifies each object by tracking the appearance time and location of each object included in multiple images (step ST11).
[0031] Next, as shown in Figure 2, the state information generation unit 12 generates state information for each identified object from multiple images (step ST12).
[0032] Next, as shown in Figure 2, the generated information generation unit 13 generates generated information based on the state information of each target and the question text (step ST13). With this configuration, appropriate generated information can be input to the learning model. Therefore, appropriate answer text can be obtained.
[0033] <Embodiment 2> <Generation Device> The configuration of the generation device according to this disclosure will be described below with reference to Figure 3. Figure 3 is a block diagram illustrating the generation device according to this disclosure. As shown in Figure 3, the generation device 20 comprises a state information generation unit 22 and a generation information generation unit 23. The generation device 20 is a device that generates generation information used in a model that generates answer text from question text and generation information.
[0034] The state information generation unit 22 shown in Figure 3 generates state information of an object included in multiple images. The generated information generation unit 23 shown in Figure 3 generates generated information by selecting state information related to the question text from the state information of the object.
[0035] <Generation Example> Here, we will explain the generation device 20 shown in Figure 3 while comparing it with the generation device 10 shown in Figure 1. The generation device 10 shown in Figure 1 generates state information for each identified target A, B, and C. For example, the generation device 10 generates "Purchased a drink." and "Left the store." as state information for target A. The generation device 10 generates "Purchased a rice ball." and "Paying at the store." as state information for target B. The generation device 10 generates "Purchased sweets." and "Paying at the store." as state information for target C.
[0036] In contrast, the generation device 20 shown in Figure 3 works as follows: The state information generation unit 22 generates state information for the target (person), not state information for each of the targets A, B, and C, such as "Purchased a drink," "Left the store," "Purchased a rice ball," "Purchased sweets," and "Paying at the store." In other words, the generation device 20 shown in Figure 3 does not have an identification unit 11 like the generation device 10 shown in Figure 1, so it generates state information for the target (person), not state information for each of the targets A, B, and C.
[0037] Then, the generated information generation unit 23 selects state information related to the question text from the state information of the target (person) and generates generated information.
[0038] In this way, the generation device 20 generates state information of the object contained in multiple images. Then, the generation device 20 generates generated information by selecting state information related to the question text from the state information of the object. With this configuration, generated information related to the question text can be input to the learning model. Therefore, an appropriate answer text can be obtained.
[0039] <Generation Method> Next, the generation method related to this disclosure will be explained. Figure 4 is a flowchart showing the generation method related to this disclosure. The generation method related to this disclosure generates generation information used in the model that generates the response text.
[0040] First, as shown in Figure 4, the state information generation unit 22 generates state information for the objects included in multiple images (step ST22).
[0041] Next, as shown in Figure 4, the generated information generation unit 23 generates generated information by selecting state information related to the question text from the target state information (step ST33). With this configuration, generated information related to the question text can be input to the learning model. Therefore, an appropriate answer text can be obtained.
[0042] <Embodiment 3> <Information Processing System>The configuration of the generation system according to the present disclosure will be described below with reference to FIG. 5. FIG. 5 is a block diagram illustrating the generation system according to the present disclosure. As shown in FIG. 5, the generation system 50 includes a generation device 100 and a model device 200.
[0043] An overview of the generation device 100 and the model device 200 will be described. FIG. 6 is a schematic diagram of the generation device and the model device. As shown in FIG. 6, the generation device 100 generates generation information using a plurality of images as inputs. As shown in FIG. 6, the model device 200 generates an answer text from a question text and the generation information using a language model LM (language models). The language model LM is an example of a learning model.
[0044] As described above, the generation device 100 generates the generation information used for the language model LM of the model device 200. Hereinafter, each configuration of the generation device 100 will be described, and the operation of the generation device 100 will be described with specific examples.
[0045] <Generation Device (Identification Unit)>The identification unit 101 shown in FIG. 5 identifies each object by tracking the appearance time and position of each object included in a plurality of images. The identification unit 101 may use a known technique for performing tracking based on the appearance time and position of each object.
[0046] <Generation Device (State Information Generation Unit)>The state information generation unit 102 shown in FIG. 5 includes a division unit 1021. The division unit 1021 generates divided images. The divided image is an image obtained by dividing each of the input plurality of images so that each object identified by the identification unit 101 is individually included. The division unit 1021 generates a divided image by dividing it into an arbitrary size so as to include the periphery of each object. In other words, the division unit 1021 performs cutout editing so that each object identified by the identification unit 101 is individually included.
[0047] The state information generation unit 102 may generate state information of each object having the object type from the divided images based on the information regarding the object type included in the question text among the identified objects. The object type is an object indicating the type of an object such as a person or an object.
[0048] For example, when the question text is "Who was the first person to enter the store?", the question text includes "person". Therefore, the state information generation unit 102 generates state information for each identified object having the object type of "person" from the divided images. To explain more specifically, assume that each identified object is person E, person F, person G, and object H. In this case, the state information generation unit 102 generates state information for person E, person F, and person G having the object type of "person" from the divided images among the identified person E, person F, person G, and object H.
[0049] In addition, when the division unit 1021 does not include the identification unit 11, each image is divided into an appropriate size without dividing so that each object is included. Then, the state information generation unit 102 may be configured to generate the state information of the object from the divided images divided into this appropriate size.
[0050] <Generation device (generation information generation unit)> The generation information generation unit 103 includes a target filter unit 1031 and a related filter unit 1032. The target filter unit 1031 selects the state information of the object corresponding to the question text from the state information of each object.
[0051] The related filter unit 1032 further selects the state information related to the question text from the state information selected by the target filter unit 1031. More specifically, the related filter unit 1032 selects the state information related to the scene related to the question text from the selected state information.
[0052] In other words, in the generation information generation unit 103, the state information is filtered for the object corresponding to the question text using the target filter unit 1031, and further filtered for the scene related to the question text using the related filter unit 1032. Thereby, the generation information generation unit 103 generates the generation information. Details of the target filter unit 1031 and the related filter unit 1032 will be described later.
[0053] In this example, the generated information generation unit 103 is shown to include a target filter unit 1031 and a related filter unit 1032, but any configuration including at least one of the target filter unit 1031 and the related filter unit 1032 is acceptable.
[0054] In other words, if the generated information generation unit 103 does not have a related filter unit 1032 but does have a target filter unit 1031, it selects the target status information corresponding to the question text from among the status information of each target. If the generated information generation unit 103 does not have a target filter unit 1031 but does have a related filter unit 1032, it selects the status information related to the question text from among the status information of each target.
[0055] Furthermore, if the generation system 50 does not include a target filter unit 1031 but does include a related filter unit 1032, it may be configured not to include an identification unit 11. In other words, the generation system 50 may be configured to generate information by selecting state information related to the question text from the state information of the target, without identifying the target.
[0056] In this way, the generation information generation unit 103 filters the state information, allowing appropriate generation information to be input to the model device 200. As a result, appropriate response text can be obtained using the model device 200.
[0057] <Filtering> Next, an example of obtaining answer text from question text will be explained with reference to Figures 7 and 8. Figures 7 and 8 are diagrams showing an example of the operation of the generation system related to this disclosure. Hereafter, the reference numerals shown in Figures 5 and 6 will be used as appropriate.
[0058] <Filtering based on the target> Referring to Figure 7, an example of filtering based on the target according to the question text to obtain the answer text will be explained. In this case, the generation device 100 shown in Figure 5 is configured to include an identification unit 101, a state information generation unit 102 (dividing unit 1021), and a generation information generation unit 103 (target filter unit 1031).
[0059] As shown in Figure 7, the generation system 50 receives multiple images I1 to In (where n is a natural number) and a question text Q1 as input.
[0060] As shown in Figure 7, the identification unit 101 identifies each object by tracking the appearance time and location of each object included in the multiple images I1 to In (step S1). Here, it is assumed that the multiple images I1 to In include objects A and B.
[0061] Next, as shown in Figure 7, the division unit 1021 divides the multiple images I1 to In so that each of the objects (A and B) identified by the identification unit 101 is individually included, and generates divided images IDA and IDB (step S2). As shown in Figure 7, divided image IDA is an image divided so that object A is included, and object A is included in each of the images IDA1 to IDAn (where n is a natural number). Divided image IDB is an image divided so that object B is included, and has a similar structure to divided image IDA.
[0062] Next, as shown in Figure 7, the state information generation unit 102 generates state information for each object identified from the segmented image based on the instruction text (step S3). The instruction text is, for example, an instruction based on information about the object species contained in the question text.
[0063] As a result, as shown in Figure 7, the state information generation unit 102 generates state information DA for target A and state information DB for target B. State information DA1 shown in Figure 7 is, for example, "entered the store at the beginning of the video," and state information DA2 is, for example, "picked up a drink." In other words, state information DA for target A contains multiple state information DA1 to DAn (where n is a natural number). The same applies to state information DB for target B. In other words, as shown in Figure 7, the generation system 50 generates a list of state information for each target.
[0064] Next, as shown in Figure 7, the generated information generation unit 103 uses the target filter unit 1031 to select the target status information corresponding to the question text from the status information of each target and generates generated information (step S4). This filters the status information of each target to the status information of target A or B corresponding to the question text. The generated information generation unit 103 then uses the filtered status information as generated information.
[0065] Next, as shown in Figure 7, the model device 200 takes the question text Q1 and the generated information in step ST4 as input and generates the answer text A1 using the language model LM (step S5). This provides the answer text A1 for the question text Q1.
[0066] In this way, the generation system 50 focuses on a specific object among the objects contained in multiple images, filters them, and creates generated information. The model device 200 uses this generated information to generate the response text. With this configuration, even if an object is included across multiple images, the object can be identified and state information for each identified object can be generated. Therefore, it is possible to generate generated information by focusing on a specific object. As a result, the generation system 50 inputs the filtered generated information into the language model LM, and can obtain an appropriate response text.
[0067] <Filtering related to relevant scenes> Referring to Figure 8, an example of filtering related to scenes in the question text to obtain an answer will be explained. In this case, the generation device 100 shown in Figure 5 will be configured to include a state information generation unit 102 (dividing unit 1021) and a generation information generation unit 103 (related filter unit 1032). Furthermore, the generation device 100 will be described as a configuration that does not include an identification unit 101.
[0068] As shown in Figure 8, the generation system 50 receives multiple images I1 to In (where n is a natural number) and question text Q2 as input.
[0069] As shown in Figure 8, since the splitting unit 1021 has not identified the target, it divides multiple images into appropriate sizes and generates a split image ID (step S22). As shown in Figure 8, the split image ID includes multiple image IDs 1 to 1 (where n is a natural number). In other words, as shown in Figure 8, the splitting unit 1021 divides the input multiple images I1 to 1 (where n is a natural number) into short clip image IDs 1 to 1 (where n is a natural number).
[0070] Next, as shown in Figure 8, the state information generation unit 102 generates state information about the target from the segmented image based on the instruction text (step S33). The instruction text is, for example, an instruction based on information about the target species contained in the question text.
[0071] Unlike Figure 7, in Figure 8, the objects are not identified, so state information is not generated for each object. Instead, state information D is generated for all objects included in images IDA1 to IDAn. State information D contains multiple state information D1 to Dn (where n is a natural number).
[0072] Next, as shown in Figure 8, the generated information generation unit 103 uses the related filter unit 1032 to select state information related to the scene associated with the question text from the target state information and generates generated information (step S44). This filters the target state information D to include scenes related to the question text. The generated information generation unit 103 then uses the filtered state information as generated information.
[0073] Let's explain in more detail. State information D includes state information before entering the store and state information after entering the store. The scene related to the question text is, for example, if the question text is "Who bought coffee after entering the store?", then "after entering the store" is the related scene. In this case, the generated information generation unit 103 selects the state information after entering the store from the state information before entering the store and the state information after entering the store, and uses that state information as generated information. The generated information generation unit 103 may also associate the start time and end time of the related scene with the state information after entering the store and use that as generated information. That is, the generated information generation unit 103 may generate generated information that includes the start time and end time of the related scene.
[0074] Although not shown in Figure 8, the generation device 100 may also be configured to search for scenes related to the question text from multiple images in step ST44 and extract those related scenes. This allows the user to confirm the extracted related scenes.
[0075] Next, as shown in Figure 8, the model device 200 takes the question text Q2 and the generated information in step ST44 as input and generates the answer text A2 using the language model LM (step S5). This provides the answer text A2 for the question text Q2.
[0076] In this way, the generation system 50 creates generated information by filtering and focusing on scenes related to the question text. The model device 200 generates the answer text using the generated information filtered to scenes related to the question text. Therefore, the generation system 50 can obtain highly accurate answer text.
[0077] <Example Answer> Next, with reference to Figure 9, a concrete example of obtaining answer text from question text will be explained. Figure 9 is a diagram showing an example of the operation of the generation system related to this disclosure. Figure 9 also explains an example in which both filtering related to the target and filtering related to related scenes are performed.
[0078] In this case, the generation device 100 shown in Figure 5 is configured to include an identification unit 101, a state information generation unit 102 (dividing unit 1021), and a generation information generation unit 103 (target filter unit 1031 and related filter unit 1032).
[0079] As shown in Figure 9, the generation system 50 receives multiple images I11 and a question text Q100 as input. As shown in Figure 9, the image group I11 includes multiple images of the store before entering and images of the store interior. Also, as shown in Figure 9, the image group I11 includes target individuals U1, U2, and U3. The question text Q100 is, "What did the first person to enter the store in the images purchase?"
[0080] As shown in Figure 9, the identification unit 101 identifies each of the target persons U1, U2, and U3 included in the image group I11 by tracking their appearance time and location (step S1). For example, the identification unit 101 identifies the person included in the pre-entry image shown in the upper part of Figure 9 of the image group I11 and the person included in the post-entry image shown in the lower part of Figure 9 of the image group I11 as the same person, and identifies them as target person U1. The same applies to target persons U2 and U3.
[0081] Next, as shown in Figure 9, the division unit 1021 divides the image group I11 so that each of the target persons U1, U2, and U3 identified by the identification unit 101 is individually included, and generates a divided image group I12 (step S2). As shown in Figure 9, among the divided image group I12, the divided image shown in the upper part of Figure 9 is the divided image of target person U1. Among the divided image group I12, the divided image shown in the middle part of Figure 9 is the divided image of target person U2. Among the divided image group I12, the divided image shown in the lower part of Figure 9 is the divided image of target person U3.
[0082] Next, as shown in Figure 9, the state information generation unit 102 generates state information for each target person U1, U2, and U3 identified from the segmented image group I12 based on the instruction text (step S3).
[0083] Specifically, the state information generation unit 102 generates state information DA10 for the target person U1, based on the instruction text "Describe the actions of the person in the image," which includes phrases like "He entered the store at the beginning of the image," "He picked up an item," "He purchased a drink," and "He was using his mobile phone in front of the store."
[0084] In Figure 9, the status information for target persons U2 and U3 is omitted, but the status information generation unit 102 generates status information for target persons U2 and U3 in the same manner. Also, although not shown in Figure 9, each of the status information DA10 for target person U1 may be numbered.
[0085] The state information generation unit 102 determines the instruction text, for example, as follows: Based on the information about the target species "person" included in the question text "What did the first person to enter the store in the image purchase?", the state information generation unit 102 determines the instruction text "Explain the actions of the person in the image".
[0086] Next, as shown in Figure 9, the generated information generation unit 103 uses the target filter unit 1031 to select the status information of each target person U1, U2, and U3 according to the question text (step S45).
[0087] Specifically, the generated information generation unit 103 determines the instruction text "Generate generated information about the person who entered the store first" based on the question text "the person who entered the store first in the image". As a result, the generated information generation unit 103 selects the state information of target person U1 from the state information of target people U1, U2, and U3.
[0088] The generated information generation unit 103 filters by focusing on the target person U1 corresponding to the question text. Therefore, even if target people U1, U2, and U3 are included across images (image group I11), the unit can focus on the target person U1 corresponding to the question text.
[0089] Next, as shown in Figure 9, the generated information generation unit 103 uses the related filter unit 1032 to select state information related to the scene associated with the question text from the generated information of the target person U1, and generates generated information (step S46).
[0090] Specifically, the generated information generation unit 103 determines the instruction text "Generate generated information regarding what happened after entering the store" from the question text "entered the store" and "made a purchase". Based on this, the generated information generation unit 103 selects the following state information from the state information of the target person U1: "entered the store at the beginning of the image," "picked up an item," "purchased a drink," and "was using a mobile phone in front of the store."
[0091] The generated information generation unit 103 selects the generated information related to what happens after entering the store: "The user entered the store at the beginning of the image," "The user picked up an item," and "The user purchased a drink." The generated information generation unit 103 then generates this state information, "The user entered the store at the beginning of the image," "The user picked up an item," and "The user purchased a drink," as generated information. In this way, the generated information generation unit 103 filters the generated information to include scenes related to the question text. The generated information generation unit 103 may also generate information that includes the start and end times of the related scenes.
[0092] Next, as shown in Figure 9, the model device 200 takes the question text Q100 and the generated information in step ST46 as input and uses the language model LM to generate the answer text A100 (step S5). As a result, for the question text Q100 "What did the first person to enter the store in the image purchase?", the answer text A100 "The first person to enter the store in the image purchased a drink." is obtained.
[0093] In this way, the generation system 50 filters out objects included in multiple images by focusing on a specific object, and further filters by focusing on a scene related to the question text to create generated information. The model device 200 uses this generated information to generate the answer text. With this configuration, even if multiple objects are included across images, it is possible to obtain highly accurate answer text related to the appropriate object.
[0094] In other words, the generation system 50 filters all the state information by referring to the question text, and generates appropriate generated information based on the target and related scenes. The generation system 50 then inputs this generated information to the model device 200. As a result, a more appropriate answer text can be obtained for the question text.
[0095] In the generation system 50 shown in Figure 9 above, an example configuration is shown in which filtering is performed in the order of filtering for the target and then filtering for related scenes. However, the system is not limited to this, and the generation system 50 may also be configured to perform filtering in the order of filtering for related scenes and then filtering for the target.
[0096] In Figure 9 above, the generation system 50 shows an example of obtaining answer text in response to a question text related to the store. However, it is not limited to this, and the generation system 50 may also be used for reports and records created in writing, such as investigation reports on facility security, work records for patient care operations, and care records.
[0097] <Display Example> An example of the display of question text and answer text will be explained with reference to Figure 10. Figure 10 is a diagram showing an example of the display unit of a terminal equipped with the generation device and model device according to this disclosure. As shown in Figure 10, the display unit 500 displays an image input unit 501, a question text input unit 502, and an answer text display unit 503.
[0098] In the example shown in Figure 10, the image input unit 501 displays "Please select an image." The user can select and input an image stored in the terminal by touching the image input unit 501. The question text input unit 502 displays "Please enter the question text." The user can input the question text via the question text input unit 502, either by text or voice.
[0099] When a user inputs an image and a question text, the answer text display unit 503 displays the answer text corresponding to the question text. Furthermore, the system is not limited to a configuration where the answer text is displayed on the answer text display unit 503; it may also be configured to display a file containing the answer text, allowing the user to download that file.
[0100] <Configuration Example> Figure 11 is a block diagram showing an example configuration of a generation device according to this disclosure. Figure 11 is a block diagram showing an example configuration of the generation devices 10, 20, 100 (hereinafter referred to as generation device 10, etc.) described above. Referring to Figure 11, generation device 10, etc. includes a network interface 1201, a processor 1202, and a memory 1203. The network interface 1201 may be used to communicate with a network node. The network interface 1201 may include, for example, a network interface card (NIC) compliant with the IEEE 802.3 series. IEEE stands for Institute of Electrical and Electronics Engineers.
[0101] The processor 1202 reads and executes software (computer programs) from the memory 1203, thereby performing the processing of the generation device 10, etc., as described using a flowchart in the above embodiment. The processor 1202 may be, for example, a microprocessor, an MPU, or a CPU. The processor 1202 may include multiple processors.
[0102] Memory 1203 is composed of a combination of volatile and non-volatile memory. Memory 1203 may include storage located away from the processor 1202. In this case, the processor 1202 may access memory 1203 via an I / O (Input / Output) interface, which is not shown.
[0103] In the example shown in Figure 11, memory 1203 is used to store a group of software modules. The processor 1202 can read these software modules from memory 1203 and execute them, thereby enabling the generation device 10 and other processes described in the above embodiment.
[0104] As explained using Figure 11, each processor in the generation device 10, etc., executes one or more programs that include a set of instructions for causing a computer to perform the algorithm described in the diagram.
[0105] In the examples described above, the program includes a set of instructions (or software code) that, when loaded into a computer, cause the computer to perform one or more of the functions described in the embodiments. The program may be stored on a non-temporary computer-readable medium or a physical storage medium. Examples, but not limited to, include random-access memory (RAM), read-only memory (ROM), flash memory, solid-state drive (SSD) or other memory technologies, CD-ROM, digital versatile disc (DVD), Blu-ray® disc or other optical disc storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage devices. The program may be transmitted over a temporary computer-readable medium or a communication medium. The program may also be included in a program product. Examples, but not limited to, include temporary computer-readable medium or a communication medium that includes electrically, optically, acoustically or otherwise propagating signals.
[0106] Although the present disclosure has been described in accordance with the above embodiments, it goes without saying that the present disclosure is not limited to the configuration of the above embodiments, but also includes various modifications, alterations, and combinations that a person skilled in the art could make within the scope of the claims of the present patent application.
[0107] Each drawing is merely illustrative to illustrate one or more embodiments. Each drawing may be associated with one or more other embodiments, rather than being associated with only one specific embodiment. As those skilled in the art will understand, various features or steps described with reference to any one drawing can be combined with features or steps shown in one or more other drawings, for example, to create embodiments not explicitly shown or described. Not all features or steps shown in any one drawing to illustrate an exemplary embodiment are necessarily required, and some features or steps may be omitted. The order of steps described in any of the drawings may be changed as appropriate.
[0108] Some or all of the above embodiments may also be described as follows, but are not limited to the following: (Note 1) A generation device for generating generated information used in a model for generating answer text from question text and generated information, comprising: an identification unit that identifies each of the objects included in a plurality of images by tracking the appearance time and position of each object; a state information generation unit that generates state information for each of the identified objects from the plurality of images; and a generation information generation unit that generates the generated information based on the state information for each of the objects and the question text. (Note 2) The generation device according to Note 1, wherein the generation information generation unit generates the generated information by selecting the state information for an object corresponding to the question text from the state information for each of the objects. (Note 3) The generation device according to Note 1, wherein the state information generation unit generates divided images by dividing each of the plurality of images so that each of the identified objects is individually included, and generates the state information for each of the identified objects from the divided images. (Note 4) The generation device according to Note 2, wherein the state information generation unit generates state information for each of the targets identified from the segmented images based on information about the target species contained in the question text. (Note 5) The generation device according to Note 2, wherein the generation information generation unit further generates the generation information by selecting state information related to the question text from among the selected state information. (Note 6) The generation device according to Note 4, wherein the state information related to the question text is state information relating to a scene related to the question text. (Note 7) A generation device for generating the generation information used in a model that generates an answer text from a question text and the generation information, comprising: a state information generation unit that generates state information for targets contained in a plurality of images; and a generation information generation unit that generates the generation information by selecting state information related to the question text from among the state information for the targets. (Note 8) The generation device according to Note 7, wherein the state information related to the question text is state information relating to a scene related to the question text.(Note 9) A generation system comprising: a model device that generates answer text from question text and generated information; and a generation device that generates the generated information used in the model device, wherein the generation device generates state information for each object included in a plurality of images, and generates the generated information by selecting the state information for each object from the state information for each object according to the question text. (Note 10) A generation method for generating generated information used in a model that generates answer text from question text and generated information, wherein a computer performs a process to identify each object by tracking the appearance time and position of each object included in a plurality of images, generate state information for each identified object from the plurality of images, and generate the generated information based on the state information for each object and the question text. (Note 11) A program for generating generated information used in a model for generating answer text from question text and generated information, the program causing a computer to perform the following processes: identify each object by tracking the appearance time and position of each object included in a plurality of images; generate state information for each identified object from the plurality of images; and generate the generated information based on the state information for each object and the question text.
[0109] Some or all of the elements (e.g., configuration and function) described in Appendices 2 to 6 that are dependent on Appendice 1 may also be dependent on Appendices 7, 9 to 11 in the same way as in Appendices 2 to 6. Some or all of the elements described in any appendice may be applied to various hardware, software, recording means, systems, and methods for recording software.
[0110] This application claims priority based on Japanese Patent Application No. 2024-230975, filed on 26 December 2024, and incorporates all of its disclosures herein.
[0111] 10, 100 Generator 11, 101 Identification Unit 12, 22, 102 State Information Generation Unit 13, 23, 103 Generation Information Generation Unit 200 Model Device 1031 Target Filter Unit 1032 Related Filter Unit 1201 Network Interface 1202 Processor 1203 Memory A1, A2, A100 Answer Text DA, DA10, DB, D State Information IDA, IDB, ID Segmented Image I11 Image Group I12 Segmented Image Group Q1, Q2, Q100 Question Text
Claims
1. A generation device for generating generated information used in a model for generating answer text from question text and generated information, comprising: an identification unit that identifies each object by tracking the appearance time and position of each object included in a plurality of images; a state information generation unit that generates state information for each of the identified objects from the plurality of images; and a generation information generation unit that generates generated information based on the state information for each of the objects and the question text.
2. The generation device according to claim 1, wherein the generation information generation unit generates the generation information by selecting the state information of a target corresponding to the question text from among the state information of each target.
3. The generation apparatus according to claim 1 or 2, wherein the state information generation unit generates divided images by dividing each of the plurality of images so that each of the identified objects is individually included, and generates the state information of each of the identified objects from the divided images.
4. The generation apparatus according to claim 3, wherein the state information generation unit generates state information for each of the identified targets that have the target species from the segmented images, based on information about the target species contained in the question text.
5. The generation device according to claim 2, wherein the generation information generation unit further generates the generation information by selecting state information related to the question text from among the selected state information.
6. The generation apparatus according to claim 5, wherein the state information associated with the question text is state information relating to a scene associated with the question text.
7. A generation device for generating generated information used in a model that generates answer text from question text and generated information, comprising: a state information generation unit that generates state information of an object included in a plurality of images; and a generation information generation unit that generates generated information by selecting state information related to the question text from the state information of the object.
8. The generation apparatus according to claim 7, wherein the state information associated with the question text is state information relating to a scene associated with the question text.
9. A generation system comprising: a model device that generates answer text from question text and generated information; and a generation device that generates the generated information used in the model device, wherein the generation device generates state information for each object included in a plurality of images, and generates the generated information by selecting the state information of each object corresponding to the question text from among the state information of each object.
10. A generation method for generating generated information used in a model that generates answer text from question text and generated information, wherein a computer performs the following processes: identifying each object by tracking the appearance time and position of each object included in a plurality of images; generating state information for each identified object from the plurality of images; and generating the generated information based on the state information for each object and the question text.
11. A program for generating generated information used in a model for generating answer text from question text and generated information, the program causing a computer to perform the following processes: identify each object by tracking the appearance time and position of each object included in a plurality of images; generate state information for each identified object from the plurality of images; and generate the generated information based on the state information for each object and the question text.