Method and system for searching for event in captured image
The method and system address the inefficiency of manual video streaming by classifying events in video data using a multimodal model, enabling efficient and user-friendly event search through embedding vector similarity matching.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- MOTOV CO LTD
- Filing Date
- 2024-12-12
- Publication Date
- 2026-06-18
Smart Images

Figure KR2024096822_18062026_PF_FP_ABST
Abstract
Description
Method and System for Searching for Events in Recorded Video
[0001] The present disclosure relates to a method and system for searching for events in captured video. More specifically, the present disclosure relates to a method and system for searching for predefined events in black box video.
[0002] Previously, to find the part where a specific event was recorded in video data from a black box, one had to download the video data and manually stream it one by one to search for it.
[0003] Accordingly, there is a need for a technology that classifies specific events from video data recorded by a black box and provides video data for a corresponding event when a user performs a search for that specific event.
[0004] The technical problem to be solved through some embodiments of the present disclosure is to provide a method and system for searching for an event in a captured video, which classifies a specific event in video data recorded by a black box and provides video data for the corresponding event when a user performs a search for the specific event.
[0005] The technical problems of the present disclosure are not limited to those mentioned above, and other unmentioned technical problems will be clearly understood by a person skilled in the art of the present disclosure from the description below.
[0006] A method for searching for events in a captured image according to some embodiments of the present disclosure for solving the aforementioned technical problem may include, in a method performed by a computing system, the steps of: acquiring a real-time captured image; acquiring an image sequence for a predefined major event from the real-time captured image; generating an event embedding vector representing data for the image sequence using a multimodal model and storing the event embedding vector in an event database; receiving an event search query in natural language form from a user terminal; performing preprocessing on the event search query and acquiring a search target in a structured format; generating a search embedding vector for the search target using the multimodal model; performing a search on a first embedding vector included in the event database using the search embedding vector; and transmitting data for an event corresponding to a second embedding vector whose similarity to the search embedding vector is greater than or equal to a threshold value as a result of the search to the user terminal.
[0007] In one embodiment, the step of acquiring the image sequence may include: identifying an object included in a frame, which is a captured image of the real-time video; determining whether the identified object satisfies the conditions of the main event; and tagging the frame as the main event using the result of the determination.
[0008] In one embodiment, the tagging step may include generating text data describing the frame and mapping the text data to the frame and storing it.
[0009] In one embodiment, the event embedding vector may be generated based on the visual features of the image sequence, and the search embedding vector may be generated based on the text of the event search query.
[0010] In one embodiment, the second embedding vector includes a third embedding vector and a fourth embedding vector, and the step of transmitting to the user terminal may include, when the first similarity between the search embedding vector and the third embedding vector is higher than the second similarity between the search embedding vector and the fourth embedding vector, the step of assigning a higher priority to the data for an event corresponding to the third embedding vector than to the data for an event corresponding to the fourth embedding vector.
[0011] An event search system in a captured image according to some embodiments of the present disclosure for solving the technical problem described above comprises: a communication interface; a memory on which a computer program is loaded; and one or more processors on which the computer program is executed, wherein the computer program comprises: an operation of acquiring a real-time captured image; an operation of acquiring an image sequence for a predefined major event from the real-time captured image; an operation of generating an event embedding vector representing data for the image sequence using a multimodal model and storing the event embedding vector in an event database; an operation of receiving an event search query in natural language form from a user terminal; an operation of performing preprocessing on the event search query and acquiring a search target in a structured format; an operation of generating a search embedding vector for the search target using the multimodal model; and an operation of performing a search on a first embedding vector included in the event database using the search embedding vector. And, as a result of performing the above search, it may include instructions for executing an operation to transmit data for an event corresponding to a second embedding vector, which has a similarity to the search embedding vector greater than or equal to a threshold value, to the user terminal.
[0012] FIG. 1 is a system configuration diagram for explaining the configuration and operation of an event search system according to some embodiments of the present disclosure.
[0013] FIG. 2 is a flowchart for explaining a method for searching for events in a captured image according to some embodiments of the present disclosure.
[0014] FIG. 3 is a detailed flowchart for explaining a method for searching for events in a captured image according to some embodiments of the present disclosure, described with reference to FIG. 2.
[0015] FIG. 4 is a diagram illustrating a method for obtaining an image sequence of a major event according to some embodiments of the present disclosure.
[0016] FIG. 5 is a drawing for explaining a method for searching for events in a captured image according to some embodiments of the present disclosure.
[0017] FIG. 6 illustrates an exemplary computing device capable of implementing systems according to some embodiments of the present disclosure.
[0018] Hereinafter, various embodiments of the present disclosure will be described in detail with reference to the attached drawings. The advantages and features of the present disclosure and the methods for achieving them will become clear by referring to the embodiments described below in detail together with the attached drawings. However, the technical concept of the present disclosure is not limited to the following embodiments but can be implemented in various different forms. The following embodiments are provided merely to complete the technical concept of the present disclosure and to fully inform those skilled in the art of the scope of the present disclosure, and the technical concept of the present disclosure is defined only by the scope of the claims.
[0019] In describing the various embodiments of the present disclosure, if it is determined that a detailed description of related known configurations or functions could obscure the essence of the present disclosure, such detailed description is omitted.
[0020] Unless otherwise defined, terms used in the following embodiments (including technical and scientific terms) may be used in a meaning commonly understood by those skilled in the art to which this disclosure pertains, but this may vary depending on the intent of those skilled in the art, case law, the emergence of new technology, etc. The terms used in this disclosure are for describing the embodiments and are not intended to limit the scope of this disclosure.
[0021] In the following embodiments, singular expressions include plural concepts unless the context clearly specifies them as singular. Additionally, plural expressions include singular concepts unless the context clearly specifies them as plural.
[0022] In addition, terms such as first, second, A, B, (a), (b), etc. used in the following embodiments are used merely to distinguish one component from another, and the essence, order, or sequence of the said component is not limited by such terms.
[0023] Hereinafter, various embodiments of the present disclosure will be described in detail with reference to the attached drawings.
[0024] Hereinafter, with reference to FIG. 1, the configuration and operation of an event search system according to some embodiments of the present disclosure will be described. FIG. 1 is a system configuration diagram for explaining the configuration and operation of an event search system according to some embodiments of the present disclosure.
[0025] As illustrated in FIG. 1, an event search system may be configured to include a cloud system (10), an on-device device (20), and a user terminal (30). The cloud system (10) may be configured to include a multimodal system (11) and an event database (12). The on-device device (20) may be configured to include a camera module (21), a sensor module (22), and a storage module (23). However, the scope of the present disclosure is not limited thereto. In some cases, the event search system may be configured to include additional modules / devices / systems not illustrated in FIG. 1. Alternatively, the event search system may be configured in a form in which at least some of the components (10 to 30) illustrated in FIG. 1 are excluded.
[0026] The on-device device (20) can capture normal video at specific times using the built-in camera module (21). The storage module (23) can store the captured video along with metadata (e.g., time of shooting, shooting location, etc.) for the video in a storage device (not shown) built into the on-device device (20). The sensor module (22) can detect various data (e.g., sound data, etc.) regarding the video captured by the camera module (21) and store this as metadata for the captured video in a storage device (not shown) built into the on-device device (20).
[0027] The on-device device (20) can transmit real-time captured video and metadata about the real-time captured video to the cloud system (10).
[0028] The cloud system (10) can obtain real-time captured video and metadata regarding the real-time captured video from the on-device device (20). The cloud system (10) can obtain an image sequence of a predefined major event from the real-time captured video. The major event may be predefined by an administrator of the cloud system (10), etc. For example, the major event may be related to a vehicle accident, excessive vibration, a collision, etc. The image sequence may be a set of frames of video captured before and after the time when the major event occurred. The frames may be, for example, still images at one-second intervals. However, the scope of the present disclosure is not limited thereto.
[0029] The multimodal system (11) can generate an event embedding vector representing data for the image sequence using a multimodal model and store the event embedding vector in an event database (12). The multimodal system (11) can generate an event embedding vector by vectorizing the images of the image sequence using a multimodal model.
[0030] The user terminal (30) can receive an event search query in the form of natural language from the user. For example, the user can input an event search query in the form of text or voice through the user interface. The event search query may be a query to search for a specific event from a captured video. The user terminal (30) can transmit the event search query to the cloud system (20).
[0031] The cloud system (10) can perform preprocessing on the event search query and obtain a search target in a standardized format. For example, the cloud system (10) can use a Large Language Model (LM) model installed within the cloud system (10) to perform preprocessing on the event search query and obtain a search target in a standardized format. However, the scope of the present disclosure is not limited thereto, and the cloud system (10) can obtain a search target in a standardized format by performing preprocessing on the event search query in various ways. A method for obtaining a search target will be explained in detail with reference to FIG. 2.
[0032] The multimodal system (11) can generate a search embedding vector for the search target by using the same multimodal model that generated the event embedding vector. The cloud system (10) can search for a second embedding vector that has a similarity to the search embedding vector greater than or equal to a threshold value for the first embedding vector included in the event database (12). The cloud system (10) can transmit data corresponding to the second embedding vector (e.g., captured image and metadata for the captured image, etc.) to a user terminal (30).
[0033] According to the present embodiment, an image of a video captured by an on-device device and unstructured text data created by a user terminal are each vectorized using the same multimodal model, and by comparing each vector, an event corresponding to a user's request can be searched for in the captured video. That is, by converting two data of different forms into embedding vectors of the same form, comparison between the two data can be facilitated. Therefore, according to the present embodiment, since the user does not need to directly stream the captured video to find a specific event, there is an advantage in that user convenience is increased.
[0034] Each of the components (11 and 12) of the cloud system (10) described above may be implemented in at least one computing device. For example, all functions of the cloud system (10) may be implemented in one computing device, or a first function of the cloud system (10) may be implemented in a first computing device and a second function may be implemented in a second computing device. Alternatively, a specific function of the cloud system (10) may be implemented in multiple computing devices.
[0035] A computing device may include any device equipped with computing functions, and for an example of such a device, refer to FIG. 6. Since a computing device is a collection of various components (e.g., memory, processor, etc.) that interact, it may be referred to as a 'computing system' depending on the case. Of course, the term computing system may also encompass the concept of a collection of multiple computing devices that interact.
[0036] Meanwhile, in some embodiments, the components (10 to 30) of the event search system may communicate through a network. Here, the network may be implemented as any type of wired or wireless network, such as a Local Area Network (LAN), a Wide Area Network (WAN), a mobile radio communication network, or Wibro (Wireless Broadband Internet).
[0037] Up to now, the configuration and operation of an event search system according to several embodiments of the present disclosure have been described with reference to FIG. 1. Hereinafter, various methods that can be performed in the event search system described above will be described with reference to FIG. 2 and subsequent drawings.
[0038] For the sake of ease of understanding, the following description will continue by assuming that all steps / operations of the methods described below are performed in the aforementioned cloud system (10). Therefore, if the subject of a specific step / operation is omitted, it can be understood that it is performed in the cloud system (10). However, in an actual environment, some steps / operations of the methods described below may be performed on other computing devices.
[0039] Hereinafter, with reference to FIG. 2, a method for searching for an event in a captured image according to some embodiments of the present disclosure will be described. FIG. 2 is a flowchart for explaining a method for searching for an event in a captured image according to some embodiments of the present disclosure.
[0040] Referring to FIG. 2, the cloud system (10) can acquire real-time captured video from the on-device device (20) (S100). As previously described, the on-device device (20) can capture video using a camera module (21). The on-device device (20) can acquire various metadata (e.g., shooting time, shooting location, sound data, etc.) regarding the captured video using a sensor module (22). The storage module (23) can store the captured video and the metadata therefor in real-time in a storage device built into the on-device device (20) and transmit it to the cloud system (10).
[0041] Subsequently, the cloud system (10) can obtain an image sequence of a predefined major event from the real-time captured video (S200). The major event may be predefined by an administrator of the cloud system (10), etc. For example, the major event may be related to a vehicle accident, excessive vibration, a collision, etc. The image sequence may be a set of frames of video captured before and after the time when the major event occurs. The frames may be, for example, still images at one-second intervals. However, the scope of the present disclosure is not limited thereto.
[0042] Step S200 will be explained in more detail with reference to Fig. 3.
[0043] Subsequently, the multimodal system (11) can generate an event embedding vector representing data for the image sequence using a multimodal model and store the event embedding vector in an event database (12) (S300). The multimodal model is a model capable of processing images and text simultaneously, rather than a single type of data (e.g., text or image), and can learn the association between images and text by mapping text data and image data into a common embedding space. A CLIP model, etc., may be used as the multimodal model.
[0044] The data for the above image sequence may be data reflecting real-time captured video and metadata therefor. For example, the multimodal system (11) can use the multimodal model to generate an embedding vector for an image of a Model A vehicle and an embedding vector for the text "Model A vehicle" similarly. The multimodal system (11) can convert the text and the image into embedding vectors, respectively, and calculate the similarity between the converted embedding vectors. Therefore, even if a user inputs an event search query in text form to search for a specific event in the captured video, the cloud system (10) can easily search for the corresponding event from the real-time captured video.
[0045] Subsequently, the cloud system (10) can receive an event search query in natural language form from a user terminal (30) (S400). For example, the user can input an event search query in text or voice form through a user interface. The event search query may be a query to search for a specific event from a captured video.
[0046] Subsequently, the cloud system (10) can perform preprocessing on the event search query and obtain a search target in a standardized format (S500). For example, the cloud system (10) can perform preprocessing on the event search query using a Large Language Model (LM) model installed within the cloud system (10) and obtain a search target in a standardized format. However, the scope of the present disclosure is not limited thereto, and the cloud system (10) can obtain a search target in a standardized format by performing preprocessing on the event search query in various ways.
[0047] For example, if the event search query is 'find a video of a truck collision at an intersection,' the cloud system (10) can perform preprocessing on the event search query using the LLM model. As a result of the preprocessing, the cloud system (10) can obtain a search target for the event search query in a standardized format (e.g., JSON), such as 'Distance: Intersection, Vehicle Type: Truck, Event: Collision.'
[0048] Afterwards, the multimodal system (11) can generate a search embedding vector for the search target using the same multimodal model that generated the event embedding vector (S600).
[0049] Subsequently, the cloud system (10) can perform a search for a first embedding vector included in an event database using the search embedding vector (S700). That is, the cloud system (10) can search for a second embedding vector that has a similarity to the search embedding vector greater than or equal to a threshold value for the first embedding vector included in the event database (12).
[0050] As a result of performing step S700, data (e.g., captured image and metadata for the captured image, etc.) for an event corresponding to a second embedding vector whose similarity to the search embedding vector is greater than or equal to a reference value can be transmitted to a user terminal (30) (S800).
[0051] The above event embedding vector may be generated based on the visual features of the image sequence, and the above search embedding vector may be generated based on the text of the event search query. That is, the above event embedding vector may be a vectorized image of the image sequence using a multimodal model, and the above search embedding vector may be a vectorized text of the event search query using the multimodal model. Although the image sequence and the above event search query are data of different forms, such as image and text formats, respectively, the cloud system (10) can generate a vector of the same form for each data using the same multimodal model and perform a similarity calculation based on the generated vector.
[0052] According to the present embodiment, an image of a video captured by an on-device device and unstructured text data created by a user terminal are each vectorized using the same multimodal model, and by comparing each vector, an event corresponding to a user's request can be searched for in the captured video. That is, by converting two data of different forms into embedding vectors of the same form, comparison between the two data can be facilitated. Therefore, according to the present embodiment, since the user does not need to directly stream the captured video to find a specific event, there is an advantage in that user convenience is increased.
[0053] Hereinafter, with reference to FIG. 3, a method for searching for an event in a captured image according to some embodiments of the present disclosure will be described. FIG. 3 is a detailed flowchart for describing a method for searching for an event in a captured image according to some embodiments of the present disclosure, which was described with reference to FIG. 2.
[0054] Referring to FIG. 3, the cloud system (10) can identify an object included in a frame, which is a captured image of a real-time video (S210). The frame is a still image at a specific point in time of the video, and may be a still image of the video at one-second intervals. However, the scope of the present disclosure is not limited thereto.
[0055] Afterward, the cloud system (10) can determine whether the identified object satisfies the conditions of a predefined major event (S220). The cloud system (10) can determine whether the identified object satisfies the conditions of a predefined major event by analyzing the motion of the identified object.
[0056] The cloud system (10) can tag the frame as the main event by using the result of the determination in step S220. That is, if it is determined that an object included in the frame satisfies the conditions of a predefined main event, the cloud system (10) can tag the frame as an image at the time when the main event occurred.
[0057] Hereinafter, with reference to FIG. 4, the embodiment described with reference to FIG. 3 will be further explained. FIG. 4 is a drawing for explaining a method for obtaining an image sequence of a major event according to some embodiments of the present disclosure.
[0058] Referring to FIG. 4, a frame (40) is shown as a captured image of a real-time video. By referring to the frame (40), the cloud system (10) can identify a first object (41), a 'white Model A cargo truck', and a second object (42), a 'black Model B passenger car', included in the frame (40).
[0059] The above-defined major event (43) is assumed to be a 'vehicle accident'. The cloud system (10) can determine whether a vehicle accident (43) occurs between the first object (41) and the second object (42). That is, the cloud system (10) can determine whether a vehicle accident (43) has occurred between the first object (41) and the second object (42) by analyzing the motion of the first object (41) and the second object (42). In this case, the cloud system (10) can determine whether a vehicle accident (43) has occurred between the first object (41) and the second object (42) by using not only real-time video captured by the shooting module (21) but also sensor data detected by the sensor module (22).
[0060] Frame (40) is an image of the moment when a vehicle accident (43) occurred between the first object (41) and the second object (42). Therefore, the cloud system (10) can tag Frame (40) as an event regarding the vehicle accident (43).
[0061] According to the present embodiment, by tagging the frame at the time when a predefined major event occurred, the time of occurrence of the major event can be more easily identified. Therefore, according to the present embodiment, there is an advantage in that an event corresponding to the event search query can be searched with a fast response speed to a user's event search query request.
[0062] Meanwhile, in one embodiment, the cloud system (10) can generate text data describing the frame and store the text data by mapping it to the frame.
[0063] This will be explained with reference to Fig. 4.
[0064] The cloud system (10) can generate text data describing the frame (40), such as “Vehicle type: ‘White A model truck’ and ‘Black B model passenger car’, Event: Vehicle accident”. The cloud system (10) can map the text data to the frame (40) and store it as an image sequence.
[0065] According to the present embodiment, when the cloud system (10) generates an event embedding vector for an image sequence using a multimodal model, the event embedding vector can be generated by referencing both the image and the text data describing the image. Therefore, there is an advantage that an event embedding vector can be generated that more accurately represents the frames included in the real-time captured video.
[0066] Hereinafter, with reference to FIG. 5, a method for searching for events in a captured image according to some embodiments of the present disclosure will be described. FIG. 5 is a drawing for explaining a method for searching for events in a captured image according to some embodiments of the present disclosure.
[0067] In one embodiment, among the event embedding vectors stored in the event database (12), there may be multiple embedding vectors whose similarity to the search embedding vector generated from the event search query received from the user terminal (30) is greater than or equal to a threshold value. Among the embedding vectors included in the event database (12), let the embedding vectors whose similarity to the search embedding vector is greater than or equal to the threshold value be called the third embedding vector and the fourth embedding vector, respectively.
[0068] If the first similarity between the search embedding vector and the third embedding vector is higher than the second similarity between the search embedding vector and the fourth embedding vector, the cloud system (10) may assign a higher priority to the data for the event corresponding to the third embedding vector than to the data for the event corresponding to the fourth embedding vector. By assigning the highest priority to the event embedding vector most similar to the search embedding vector, the cloud system (10) may provide the user terminal (30) with a ranking result for each embedding vector in the case where there are multiple embedding vectors whose similarity to the search embedding vector generated from the event search query received from the user terminal (30) is greater than or equal to a threshold value.
[0069] Hereinafter, the above embodiment will be further explained with reference to FIG. 5. FIG. 5 is a drawing for explaining a method for searching for events in a captured image according to some embodiments of the present disclosure.
[0070] Referring to FIG. 5, a first image sequence (500) corresponding to the third embedding vector and a second image sequence (510) corresponding to the fourth embedding vector are shown.
[0071] It is assumed that the event search query received from the user terminal (30) is "find a video of a truck collision at an intersection." The search target in a standardized format for the event search query may be "Distance: Intersection, Vehicle type: Truck, Event: Collision". The cloud system (10) can generate a first search embedding vector for the above search target using a multimodal model.
[0072] The first image sequence (500) is an image sequence in which a collision (503) occurs between a 'white Model A cargo truck' (501) and a 'black Model B passenger car' (502). The second image sequence (510) is an image sequence in which a collision (513) occurs between a 'black Model B passenger car' (511) and a 'black Model B passenger car' (512).
[0073] In this case, the first image sequence (500) and the second image sequence (510) are both image sequences in which a vehicle collision event occurred, but only in the first image sequence (500) is a 'truck' present among the recognized objects. Therefore, the first similarity between the first search embedding vector and the third embedding vector will be calculated to be higher than the second similarity between the first search embedding vector and the fourth embedding vector. The cloud system (10) can assign a higher priority to the data for the event corresponding to the third embedding vector than to the data for the event corresponding to the fourth embedding vector.
[0074] The cloud system (10) may provide the user terminal (30) with a ranking result for each embedding vector when there are multiple embedding vectors that have a similarity to the first search embedding vector generated from an event search query received from the user terminal (30) and are greater than or equal to a threshold value by giving the highest priority to the event embedding vector that is most similar to the first search embedding vector.
[0075] According to the present embodiment, events with a high probability of matching an event search query received from a user terminal can be provided as search results. Accordingly, there is an advantage that the user experience of a user utilizing the event search system according to the present disclosure can be enhanced.
[0076] FIG. 6 is a hardware configuration diagram of a computing device according to some embodiments of the present disclosure. The computing device (1000) of FIG. 6 may include one or more processors (1100), a system bus (1600), a communication interface (1200), a memory (1400) for loading a computer program (1500) executed by the processor (1100), and a storage (1300) for storing the computer program (1500).
[0077] The computing device (1000) of FIG. 6 may present a hardware structure of one or more computing devices constituting a cloud system (10) described with reference to FIG. 1, for example.
[0078] The processor (1100) controls the overall operation of each component of the computing device (1000). The processor (1100) may perform operations on at least one application or program for executing methods / operations according to various embodiments of the present disclosure. The memory (1400) stores various data, instructions and / or information. The memory (1400) may load one or more computer programs (1500) from storage (1300) to execute methods / operations according to various embodiments of the present disclosure. Storage (1300) may store one or more computer programs (1500) non-temporarily.
[0079] A computer program (1500) may include one or more instructions in which methods / operations according to various embodiments of the present disclosure are implemented. When the computer program (1500) is loaded into memory (1400), a processor (1100) may perform methods / operations according to various embodiments of the present disclosure by executing the one or more instructions.
[0080] In one embodiment, a computer program (1500) may include instructions for executing operations such as: acquiring a real-time captured image; acquiring an image sequence for a predefined major event from the real-time captured image; generating an event embedding vector representing data for the image sequence using a multimodal model and storing the event embedding vector in an event database; receiving an event search query in the form of natural language from a user terminal; performing preprocessing on the event search query and acquiring a search target in a structured format; generating a search embedding vector for the search target using the multimodal model; performing a search for a first embedding vector included in the event database using the search embedding vector; and transmitting data for an event corresponding to a second embedding vector whose similarity to the search embedding vector is greater than or equal to a threshold value as a result of the search to the user terminal.
[0081] Various embodiments of the present disclosure and effects according to those embodiments have been described with reference to FIGS. 1 to 6. The effects according to the technical concept of the present disclosure are not limited to those mentioned above, and other unmentioned effects will be clearly understood by a person skilled in the art from the description below.
[0082] Furthermore, just because the above embodiments describe a plurality of components being combined into one or operating in combination, the technical concept of the present disclosure is not necessarily limited to these embodiments. That is, within the scope of the purpose of the technical concept of the present disclosure, all such components may be selectively combined into one or more combinations to operate.
[0083] The technical concept of the present disclosure described above may be implemented as computer-readable code on a computer-readable medium. A computer program recorded on a computer-readable recording medium may be transmitted to another computing device via a network such as the Internet and installed on said other computing device, thereby being used on said other computing device.
Claims
1. In a method performed by a computing system, Step of acquiring real-time video footage; A step of obtaining an image sequence for a predefined major event from the above real-time captured video; A step of generating an event embedding vector representing data for the image sequence using a multimodal model, and storing the event embedding vector in an event database; A step of receiving an event search query in natural language form from a user terminal; A step of performing preprocessing on the above event search query and obtaining a search target in a standardized format; A step of generating a search embedding vector for the search target using the above multimodal model; A step of performing a search for a first embedding vector included in the event database using the search embedding vector; and A step comprising transmitting to the user terminal data for an event corresponding to a second embedding vector whose similarity to the search embedding vector is greater than or equal to a threshold value as a result of performing the above search, Method for searching for events in recorded video.
2. In Paragraph 1, The step of acquiring the above image sequence is, A step of identifying an object included in a frame, which is a captured image of the above real-time video; A step of determining whether the identified object satisfies the conditions of the main event; and A step comprising tagging the frame as the main event using the result of the above determination, Method for searching for events in recorded video.
3. In Paragraph 2, The above tagging step is, A method comprising the step of generating text data describing the above frame and storing the text data by mapping it to the above frame. Method for searching for events in recorded video.
4. In Paragraph 1, The above event embedding vector is, It was generated based on the visual features of the above image sequence, and The above search embedding vector is, generated based on the text of the above event search query, Method for searching for events in recorded video.
5. In Paragraph 1, The above second embedding vector is, Includes a third embedding vector and a fourth embedding vector, The step of transmitting to the above user terminal is, If the first similarity between the search embedding vector and the third embedding vector is higher than the second similarity between the search embedding vector and the fourth embedding vector, the method includes the step of assigning a higher priority to the data for the event corresponding to the third embedding vector than to the data for the event corresponding to the fourth embedding vector. Method for searching for events in recorded video.
6. Communication interface; Memory where a computer program is loaded; and The computer program described above includes one or more processors on which it is executed, The above computer program is, Action of acquiring real-time video footage; The operation of acquiring an image sequence for a predefined major event from the above real-time captured video; The operation of generating an event embedding vector representing data for the image sequence using a multimodal model, and storing the event embedding vector in an event database; An operation to receive an event search query in natural language form from a user terminal; The operation of performing preprocessing on the above event search query and obtaining a search target in a standardized format; The operation of generating a search embedding vector for the search target using the above multimodal model; An operation of performing a search for a first embedding vector included in the event database using the above search embedding vector; and Instructions including an operation to transmit data for an event corresponding to a second embedding vector, whose similarity to the search embedding vector is greater than or equal to a threshold value, to the user terminal as a result of performing the above search. Event search system in recorded video.