Electronic device, operating method of electronic device, and storage medium
The electronic device uses modality-specific multimodal AI models to efficiently analyze video content, reducing costs and improving accuracy through threshold-based agreement checks for knowledge graph generation.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SAMSUNG ELECTRONICS CO LTD
- Filing Date
- 2025-12-22
- Publication Date
- 2026-07-02
AI Technical Summary
Massive AI models required for video content analysis incur high costs and resource consumption due to the complexity and volume of data, necessitating a cost-effective analysis method.
An electronic device employing a multimodal AI model to analyze video content modality-specifically, acquiring knowledge graphs through individual modality models, with threshold-based agreement checks to ensure accuracy and efficiency.
Reduces costs and enhances analysis accuracy by leveraging modality-specific models to generate highly accurate knowledge graphs from video content.
Smart Images

Figure KR2025022487_02072026_PF_FP_ABST
Abstract
Description
Electronic device, method of operation of electronic device, and storage medium
[0001] The present disclosure relates to an electronic device, a method of operating the electronic device, and a storage medium.
[0002] Artificial intelligence (AI) systems are computer systems that implement human-level intelligence, where machines learn and make judgments on their own, and recognition rates improve with use.
[0003] AI technology can be composed of machine learning (deep learning) technology that uses algorithms to self-classify and learn the characteristics of input data, and component technologies that utilize machine learning algorithms to mimic functions such as cognition and judgment of the human brain.
[0004] The elemental technologies may include, for example, linguistic understanding technology that recognizes human language / characters, visual understanding technology that recognizes objects like human vision, reasoning / prediction technology that judges information to logically reason and predict, knowledge representation technology that processes human experience information into knowledge data, and motion control technology that controls autonomous driving of vehicles and the movement of robots.
[0005] Linguistic understanding is a technology that recognizes, applies, or processes human language / text, and may include natural language processing, machine translation, dialogue systems, question answering, or speech recognition / synthesis.
[0006] Visual understanding is a technology that perceives and processes objects like human vision, and may include object recognition, object tracking, image search, person recognition, scene understanding, spatial understanding, or image enhancement.
[0007] Inference prediction is a technology that logically reasones and predicts by judging information, and may include knowledge / probability-based inference, optimization prediction, preference-based planning, or recommendation.
[0008] Knowledge representation is a technology that automatically processes human experience information into knowledge data and may include knowledge construction (data generation / classification) or knowledge management (data utilization).
[0009] The information described above may be provided as related art for the purpose of aiding understanding of this document. None of the foregoing is to be claimed as prior art related to this document, nor is it to be used to determine prior art.
[0010] Video analysis can be performed to recognize objects, backgrounds, or situations within video content, or to identify significant patterns or events. The results recognized through video analysis can be generated as a knowledge graph. A knowledge graph is structured information designed to represent the structure and relationships of data, and it can provide knowledge beyond simple information through the relationships between data. For example, knowledge graphs can be usefully employed to understand the relationships, context, and interactions between objects within video content.
[0011] Generally, video content contains complex and vast amounts of data. Therefore, analyzing video content or generating knowledge graphs requires high-level computational power and massive AI models capable of understanding the connectivity between data. However, massive AI models pose a significant cost burden due to the high volume of training and the consumption of substantial resources. Consequently, there is a demand for cost-effective analysis methods related to video content.
[0012] One embodiment of the present disclosure may provide an electronic device, a method of operating the electronic device, and a storage medium.
[0013] One embodiment of the present disclosure may provide an electronic device for performing modality-specific analysis of video content, a method of operating the electronic device, and a storage medium.
[0014] One embodiment of the present disclosure may provide an electronic device capable of obtaining highly accurate analysis results through modality-specific analysis using individual modality models, a method of operating the electronic device, and a storage medium.
[0015] One embodiment of the present disclosure may provide an electronic device, a method of operating the electronic device, and a storage medium that can reduce costs compared to using a large AI model.
[0016] An electronic device according to one embodiment of the present disclosure may include: at least one processor comprising a processing circuit; and a memory comprising at least one storage medium for storing instructions. When the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to: acquire a knowledge graph corresponding to each of a first modality corresponding to at least two of image data, video data, audio data, or text data included in a first segment of video content through a multimodal artificial intelligence (AI) model, and if the degree of agreement between the first modalities corresponding to common information in the knowledge graph corresponding to each of the first modalities is greater than or equal to a first threshold, acquire a first knowledge graph of the video content corresponding to the first segment based on the knowledge graph corresponding to each of the first modalities.
[0017] According to one embodiment, the first section may include at least one frame among the frames included in the video content input during the set section that includes a feature different from a second knowledge graph corresponding to the video content obtained for the second section prior to the first section.
[0018] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to acquire a knowledge graph corresponding to each of the first modalities through the multimodal AI model based on the first modalities, a knowledge graph corresponding to each of the second modalities included in the second section prior to the first section, or a prompt corresponding to an instruction for analysis associated with each of the first modalities.
[0019] According to one embodiment, the multimodal AI model may include a multimodal language model for acquiring a knowledge graph corresponding to each of the first modalities as text-type information.
[0020] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to: acquire a similarity between a knowledge graph for each of the first modalities and a knowledge graph corresponding to each of the second modalities included in the second section prior to the first section if the similarity is greater than or equal to the first threshold, and acquire the first knowledge graph if the similarity is greater than or equal to the second threshold.
[0021] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to: obtain a similarity between a knowledge graph corresponding to each of the first modalities and a knowledge graph corresponding to each of the second modalities included in a second section prior to the first section, and if the obtained similarity is greater than or equal to a second threshold, obtain a degree of agreement between the first modalities corresponding to the common information.
[0022] According to one embodiment, the similarity between a knowledge graph corresponding to each of the first modalities and a knowledge graph corresponding to each of the second modalities may be a similarity corresponding to a modality-specific weight based on the degree of change of each of the second modalities.
[0023] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to: generate a prompt based on different information between the knowledge graph corresponding to each of the first modalities and the knowledge graph corresponding to each of the second modalities, if the similarity is less than the second threshold, and to re-acquire the knowledge graph corresponding to each of the first modalities through the multimodal AI model based on the first modalities, the knowledge graph corresponding to each of the second modalities, or the prompt.
[0024] According to one embodiment, when the instructions are executed individually or collectively by the at least one processor, the electronic device may be caused to: if the degree of agreement is less than the first threshold, to reacquire a knowledge graph corresponding to each of the first modalities through the multimodal AI model based on a prompt related to different features between the first modalities and the first modalities corresponding to the common information.
[0025] According to one embodiment, the first threshold value may be a threshold value based on the degree of agreement between second modalities obtained in a second section prior to the first section.
[0026] A method of operating an electronic device according to one embodiment of the present disclosure may include: an operation of acquiring a knowledge graph corresponding to each of a first modality corresponding to at least two of image data, video data, audio data, or text data included in a first section of video content through a multimodal artificial intelligence (AI) model; and an operation of acquiring a first knowledge graph of the video content corresponding to the first section based on the knowledge graph corresponding to each of the first modalities, if the degree of agreement between the first modalities corresponding to common information in the knowledge graph corresponding to each of the first modalities is greater than or equal to a first threshold value.
[0027] According to one embodiment, the first section may include at least one frame among the frames included in the video content input during the set section that includes a feature different from a second knowledge graph corresponding to the video content obtained for the second section prior to the first section.
[0028] According to one embodiment, the operation of acquiring a knowledge graph corresponding to each of the first modalities may include acquiring a knowledge graph corresponding to each of the first modalities through the multimodal AI model based on the first modalities, a knowledge graph corresponding to each of the second modalities included in the second section prior to the first section, or a prompt corresponding to an instruction for analysis associated with each of the first modalities.
[0029] According to one embodiment, the multimodal AI model may include a multimodal language model for acquiring a knowledge graph corresponding to each of the first modalities as text-type information.
[0030] According to one embodiment, the operation of acquiring the first knowledge graph may include: acquiring a similarity between a knowledge graph corresponding to each of the first modalities and a knowledge graph corresponding to each of the second modalities included in a second section prior to the first section, if the degree of agreement is greater than or equal to the first threshold; and acquiring the first knowledge graph if the acquired similarity is greater than or equal to the second threshold.
[0031] According to one embodiment, the operation method may further include: an operation of obtaining a similarity between a knowledge graph corresponding to each of the first modalities and a knowledge graph corresponding to each of the second modalities included in a second section prior to the first section; and, if the obtained similarity is greater than or equal to a second threshold, an operation of obtaining a degree of agreement between the first modalities corresponding to the common information.
[0032] According to one embodiment, the similarity between a knowledge graph corresponding to each of the first modalities and a knowledge graph corresponding to each of the second modalities may be a similarity corresponding to a modality-specific weight based on the degree of change of each of the second modalities.
[0033] According to one embodiment, the operation method may further include, if the similarity is less than the second threshold, generating a prompt based on different information between a knowledge graph corresponding to each of the first modalities and a knowledge graph corresponding to each of the second modalities; and reacquiring a knowledge graph corresponding to each of the first modalities through the multimodal AI model based on the first modalities, the knowledge graph corresponding to each of the second modalities, or the prompt.
[0034] According to one embodiment, the method may further include the operation of reacquiring a knowledge graph corresponding to each of the first modalities through the multimodal AI model based on a prompt related to different features between the first modalities and the first modalities corresponding to the common information, if the degree of agreement is less than the first threshold.
[0035] According to one embodiment, the first threshold value may be a threshold value based on the degree of agreement between second modalities obtained in a second section prior to the first section.
[0036] A storage medium storing at least one computer-readable instruction according to one embodiment of the present disclosure may include: an operation in which, when the at least one instruction is executed by at least part of at least one processor of an electronic device, the electronic device is caused to perform at least one operation, and the at least one operation may include an operation of obtaining a knowledge graph corresponding to each of a first modality corresponding to at least two of image data, video data, audio data, or text data included in a first section of video content through a multimodal artificial intelligence (AI) model; and an operation of obtaining a first knowledge graph of the video content corresponding to the first section based on the knowledge graph corresponding to each of the first modalities, if the degree of agreement between the first modalities corresponding to common information in the knowledge graph corresponding to each of the first modalities is greater than or equal to a first threshold value.
[0037] FIG. 1a is a block diagram showing the configuration of an electronic device according to one embodiment.
[0038] FIG. 1b is a block diagram illustrating the detailed configuration of an electronic device according to one embodiment.
[0039] FIG. 2 is a drawing illustrating a knowledge graph acquisition device included in an electronic device according to one embodiment.
[0040] FIG. 3 is a diagram illustrating the operation of a key point identification module according to one embodiment.
[0041] FIG. 4 is a diagram illustrating the operation of a first prompt configuration module and an analysis module according to one embodiment.
[0042] FIG. 5a is a drawing illustrating an image modality included in a major section according to one embodiment.
[0043] FIG. 5b is a diagram illustrating an image modality knowledge graph according to one embodiment.
[0044] FIG. 6 is a diagram illustrating collision detection between modalities by an inspection module according to one embodiment.
[0045] FIG. 7 is a diagram illustrating a knowledge consistency check by an inspection module according to one embodiment.
[0046] FIG. 8 is a diagram illustrating a knowledge graph integration operation according to one embodiment.
[0047] FIG. 9 is a flowchart illustrating the operation of an electronic device according to one embodiment.
[0048] FIG. 10 is a flowchart illustrating the operation of an electronic device according to one embodiment acquiring a knowledge graph corresponding to each of the first modalities.
[0049] FIG. 11 is a flowchart illustrating the operation of an electronic device for reacquiring a knowledge graph according to one embodiment.
[0050] FIG. 12 is a flowchart illustrating the operation of an electronic device for acquiring a knowledge graph corresponding to video content according to one embodiment.
[0051] FIG. 13 is a flowchart illustrating the operation of an electronic device that obtains a degree of agreement between modalities through a knowledge consistency check according to one embodiment.
[0052] FIG. 14 is a flowchart illustrating the operation of an electronic device for reacquiring a knowledge graph by modality according to one embodiment.
[0053] In the following description, the attached drawings are referenced, and specific examples of implementation are illustrated within the drawings. Additionally, other examples may be used and structural modifications may be made without departing from the scope of the various examples.
[0054] Hereinafter, embodiments of the present disclosure are described in detail with reference to the drawings so that those skilled in the art can easily practice them. However, the present disclosure may be embodied in various different forms and is not limited to the embodiments described herein. In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components. Furthermore, in the drawings and related descriptions, descriptions of well-known functions and configurations may be omitted for clarity and brevity.
[0055] FIG. 1a is a block diagram showing the configuration of an electronic device according to one embodiment.
[0056] Referring to FIG. 1a, the electronic device (100) may include memory (110) and a processor (120).
[0057] According to one embodiment, the memory (110) may store various data or information used by at least one component (e.g., processor (120)) of the electronic device (100). For example, the memory (110) may store at least one program for processing and controlling the processor (120) and may store input and / or output data. The memory (110) may store at least one artificial intelligence (AI) model (e.g., multimodal AI model, multimodal language model, or language model) and may include volatile memory or non-volatile memory.
[0058] According to one embodiment, the processor (120) can control the overall operation of the electronic device (100). The processor (120) can perform operations or data processing regarding the control and / or communication of at least one other component of the electronic device (100). For example, the processor (120) can be electrically connected to the memory (110) and can perform the operations of the electronic device (100) described below by executing instructions of a program stored in the memory (110).
[0059] According to one embodiment, the processor (120) may correspond to a plurality of processors that divide a plurality of operations among the processors and perform them individually or collectively.
[0060] According to one embodiment, the processor (120) may include a processing circuit that executes instructions of a program stored in memory (110). The processor (120) may include at least one of a CPU (central processing unit), NPU (neural processing unit), GPU (graphics processing unit), MPU (micro processing unit), MCU (micro controller unit), AP (application processor), CP (communication processor), SoC (system on chip), or IC (integrated circuit) sensor hub, supplementary processor, communication processor, ASIC (application specific integrated circuit), or FPGA (field programmable gate arrays), and may have multiple cores.
[0061] FIG. 1b is a block diagram illustrating the detailed configuration of an electronic device according to one embodiment.
[0062] Referring to FIG. 1b, the electronic device (100) may include additional components in addition to the memory (110) and processor (120) shown in FIG. 1a. For example, the additional components may include at least one of a display (130), a transceiver (140), an input / output interface (150), an operation interface (160), a speaker (170), or a microphone (180).
[0063] According to one embodiment, the display (130) may or may not be included in the electronic device (100). For example, if the electronic device (100) is a display device such as a mobile device (e.g., a smartphone or tablet), a computing device (e.g., a PC (personal computer) or a laptop), a monitor device, a wearable device (e.g., a smart watch or HMD (head mounted display)), or a home appliance (e.g., a TV (television)), the display (130) may be included in the electronic device (100). The display (130) may perform various display operations according to the function of the electronic device (100). For example, the display (130) may visually provide various information (e.g., information in the form of text, images, or graphics) and may display various screens based on the control of the processor (120).
[0064] According to one embodiment, the display (130) can be implemented in various forms based on an LCD (liquid crystal display), LED (light emitting diode), OLED (organic light emitting diode), QLED (quantum dot light emitting diode), or Micro LED. The display (130) can be implemented as a touchscreen combined with a touch sensor, a flexible display, or a three-dimensional (3D) display, and can be implemented as a plurality of displays.
[0065] According to one embodiment, the electronic device (100) may be a source device capable of providing content or information to an external display device. For example, the source device may be any one of a set-top box, a streaming device, a game console, a media player, a camera, a mobile device (e.g., a smartphone or tablet), a computing device (e.g., a PC or laptop), or an external server on a network (e.g., a content server, a broadcast server, or an application server).
[0066] According to one embodiment, when the electronic device (100) is a source device, the display (130) may or may not be included in the electronic device (100). When the display (130) is included in the electronic device (100), the display (130) may display status information of the electronic device (100) (e.g., power on / off information or operation mode information) as text, graphics, or icons. The display (130) included in the source device may be used to display relatively simple information and may be configured to be simple and small compared to a general display.
[0067] According to one embodiment, the communication unit (140) may support the establishment of a wireless communication channel or a wired communication channel between an electronic device (100) and an external electronic device (e.g., a source device, a display device, a server, or another electronic device), and the performance of communication through the established communication channel. The communication unit (140) may operate independently of a processor (e.g., an application processor) (120) and may include one or more communication processors that support wireless communication or wired communication. The communication unit (140) may include a wireless communication module or a wired communication module.
[0068] According to one embodiment, the wireless communication module may include a cellular communication module, a short-range communication module, or an infrared communication module. The cellular communication module may support communication according to various wireless communication standards such as 3G (3rd generation), 3GPP (3rd generation partnership project), LTE (long term evolution), LTE-A (LTE advanced), 4G (4th generation), 5G (5th generation), NR (new radio), or 6G (6th generation). The short-range communication module may include at least one of a Wi-Fi communication module, a Bluetooth communication module, an NFC (near field communication) communication module, a Zigbee communication module, an UWB (ultra wideband) communication module, or an RFID (radio frequency identification) communication module. The infrared communication module may perform infrared communication with an external electronic device (e.g., a remote control or remote control device) based on the IrDA (infrared data association) protocol.
[0069] According to one embodiment, the wired communication module may include at least one of a LAN (local area network) communication module, an Ethernet communication module, a power line communication (PLC) communication module, or a cable-based communication module (e.g., a pair cable, a coaxial cable, or a fiber optic cable).
[0070] According to one embodiment, the various types of communication modules described above may be integrated into a single component (e.g., a single chip) or implemented as multiple separate components (e.g., multiple chips).
[0071] According to one embodiment, the input / output interface (150) may include at least one of HDMI (high-definition multimedia interface), USB (universal serial bus), DP (display port), MHL (mobile high-definition link), Thunderbolt, VGA (video graphics array) port, RGB port, D-SUB (D-subminiature), or DVI (digital visual interface). The input / output interface (150) may input and output at least one of an audio signal and a video signal.
[0072] According to one embodiment, the input / output interface (150) may include separate ports for inputting and outputting only audio signals and for inputting and outputting only video signals, or it may be implemented as a single port for inputting and outputting both audio and video signals. The electronic device (100) may transmit at least one of the audio signal and video signal to an external electronic device (e.g., an external display device or an external speaker) through the input / output interface (150). An output port included in the input / output interface (150) may be connected to an external electronic device, and the electronic device (100) may transmit at least one of the audio signal and video signal to the external electronic device through the output port.
[0073] According to one embodiment, the input / output interface (150) may be connected to or operated in association with the communication unit (140). The input / output interface (150) may transmit information received from an external electronic device to the communication unit (140) or transmit information received through the communication unit (140) to an external electronic device.
[0074] According to one embodiment, the electronic device (100) can communicate with various devices by connecting to them through a communication unit (140) or an input / output interface (150). For example, the electronic device (100) can communicate with an external server (e.g., content server, broadcast server, or application server) through a wireless communication module (e.g., Wi-Fi communication module) or a wired communication module (e.g., Ethernet module). For example, the electronic device (100) may be connected to an external electronic device (e.g., mobile device, computing device, or home appliance) through an HDMI port included in the input / output interface (150), or connected to an external speaker or camera device through a USB port.
[0075] According to one embodiment, the electronic device (100) can transmit a control signal for controlling the display of an image or content and content (e.g., video data) to an external display device or receive them from a source device through a communication unit (140) or an input / output interface (150).
[0076] According to one embodiment, the operation interface (160) may receive commands or data to be used for a component of the electronic device (100) (e.g., processor (120)) from outside the electronic device (100) (e.g., user, remote control, or mobile device). The operation interface (160) may include at least one of a button, a touchpad, a mouse, a keyboard, or a stylus pen. If the display (130) is implemented as a touchscreen or if the electronic device (100) is capable of voice recognition services (e.g., voice control), the display (130) or the microphone (180) may be used as the operation interface (160).
[0077] According to one embodiment, the speaker (170) can output a sound signal or audio signal stored in the memory (110) or received through the communication unit (140) or microphone (180) to the outside of the electronic device (101). The speaker (170) may also output various notification sounds or voice messages based on the operation of the electronic device (100).
[0078] According to one embodiment, the microphone (180) can receive external sound (e.g., user voice) and convert it into audio data. For example, the microphone (180) may include various components such as an amplifier circuit that amplifies the user voice in analog form, an analog-to-digital converter that samples the amplified user voice and converts it into a digital signal, or a filter circuit that removes noise components from the converted digital signal. The microphone (180) can receive the user's voice while active and be used for voice recognition services.
[0079] According to one embodiment, the electronic device (100) can perform an operation based on a user voice signal received through a microphone (180). For example, when the electronic device (100) receives a user voice signal for displaying video content through the microphone (180), it can control the display (130) to display video content.
[0080] According to one embodiment, an electronic device (100) can control an external display device connected to the electronic device (100) based on a user voice signal received through a microphone (180). The electronic device (100) can generate a control signal to control the external display device so that an action corresponding to the user voice signal is performed on the external display device, and can transmit the generated control signal to the external display device via wireless communication (e.g., Wi-Fi communication, Bluetooth communication, or infrared communication). For example, when the electronic device (100) receives a user voice signal for displaying video content through the microphone (180), it can transmit a control signal for displaying video content to the external display device. The electronic device (100) may refer to various terminal devices capable of installing a remote control application, such as a smartphone, tablet, or AI speaker. The remote control application may support controlling the external display device from the electronic device (100).
[0081] According to one embodiment, an electronic device (100) may use a remote control device to control an external display device connected to the electronic device (100) based on a user voice signal received through a microphone (180). The electronic device (100) may transmit a control signal to the remote control device so that an operation corresponding to the user voice signal is performed on the external display device. The remote control device may transmit the control signal received from the electronic device (100) to the external display device. For example, when a user voice signal for displaying video content is received, the electronic device (100) may transmit a control signal to the remote control device to control the display of video content on the external display device, and the remote control device may transmit the received control signal to the external display device.
[0082] According to one embodiment, the electronic device (100) may receive a user voice signal through a microphone (180) included in the electronic device (100) or receive a user voice signal from an external electronic device including a microphone via wireless communication (e.g., Wi-Fi communication or Bluetooth communication). The external electronic device may refer to a remote control device or a smartphone, etc. The received user voice signal may be a digital voice signal, but may be an analog voice signal depending on the implementation example. If the electronic device (100) receives an analog voice signal, it may digitize the analog voice signal and transmit it to the processor (120) of the electronic device (100) or transmit it to an external electronic device.
[0083] According to one embodiment, an electronic device (100) can obtain text information corresponding to a user voice signal from an external server. The electronic device (100) can transmit a user voice signal (audio signal or digital signal) to an external server. The external server may include a speech recognition server. The speech recognition server can convert the user voice signal into text information using speech to text (STT). The external server can transmit the text information corresponding to the converted user voice signal to the electronic device (100).
[0084] According to one embodiment, the electronic device (100) can independently acquire text information corresponding to a user voice signal. The electronic device (100) can apply a STT function directly to a digital voice signal to convert it into text information and transmit the converted text information to an external server.
[0085] The external server can transmit information to the electronic device (100) in various ways.
[0086] According to one embodiment, an external server can transmit text information corresponding to a user voice signal to an electronic device (100). The external server may be a server that performs a voice recognition function of converting a user voice signal into text information.
[0087] According to one embodiment, an external server may transmit at least one of text information corresponding to a user voice signal or search result information corresponding to text information to an electronic device (100). The external server may be a server that performs a search result providing function that provides search result information corresponding to text information, in addition to a voice recognition function that converts a user voice signal into text information. For example, the external server may be a server that performs both a voice recognition function and a search result providing function. For example, the external server may perform only a voice recognition function, and the search result providing function may be performed on a separate server. The external server may transmit text information to a separate server to obtain search results and obtain search results corresponding to text information from the separate server.
[0088] According to one embodiment, the electronic device (100) can communicate with external electronic devices and external servers in various ways.
[0089] According to one embodiment, a communication module for communication with an external electronic device and an external server can be implemented in the same way. For example, the electronic device (100) can communicate with the external electronic device using a Bluetooth module and can also communicate with the external server using a Bluetooth module.
[0090] According to one embodiment, the communication module for communication with an external electronic device and an external server may be implemented differently. For example, the electronic device (100) may communicate with the external electronic device using a Bluetooth module and communicate with the external server using an Ethernet module or a Wi-Fi module.
[0091] FIG. 2 is a drawing illustrating a knowledge graph acquisition device included in an electronic device according to one embodiment.
[0092] Referring to FIG. 2, the electronic device (100) may include a knowledge graph acquisition device (200). According to one embodiment, the knowledge graph acquisition device (200) may be included in the processor (120) of the electronic device (100), be a component corresponding to the processor (120), or be an independent component electrically connected to the processor (120) and operating based on the control of the processor (120).
[0093] According to one embodiment, the knowledge graph acquisition device (200) may be a device that acquires or generates a knowledge graph by analyzing video content. The knowledge graph is a graph created from a knowledge base and may store interconnected descriptions of objects, such as entities (or objects), events, situations, or abstract concepts. For example, the knowledge graph of video content may represent the relationships between objects (e.g., people, things, items, or places) included in at least one frame in a graph structure. The knowledge graph of video content may be used to visualize and analyze the interrelationships between image data, video data, audio data, or text data included in the video content. The knowledge graph of video content may be used, for example, for scene search, natural language processing, or data analysis.
[0094] According to one embodiment, the knowledge graph acquisition device (200) may include a plurality of modules such as a content input module (202), a key point determination module (204), a first prompt configuration module (206), an analysis module (208), an inspection module (210), or a second prompt configuration module (218). At least two of the plurality of modules, or all of the plurality of modules, may be integrated into a single module. The operation of each module in the knowledge graph acquisition device (200) may be performed by the control of the processor (120) of the electronic device (100) and may substantially represent the operation of the processor (120) of the electronic device (100).
[0095] According to one embodiment, the content input module (202) may receive video content during a set interval. The video content may be input in set units (e.g., frame units). The video content may be stored in the memory (110) of the electronic device (100), received from an external server (e.g., content server, broadcast server, or application server) via a wireless communication module (e.g., Wi-Fi communication module) or a wired communication module (e.g., Ethernet module), or received from an external electronic device via an input / output interface (150) (e.g., HDMI port or USB port) or a wireless communication module (e.g., Wi-Fi communication module or Bluetooth communication module). For example, the external electronic device may be a display device (e.g., mobile device, computing device, wearable device, or TV) or a source device (e.g., set-top box, streaming device, game console, media player, camera, mobile device, or computing device). The video content may be content received in real time, or content that has been stored or received in advance.
[0096] Video content may include at least one of image data, video data, audio data, or text data. Image data may be associated with still images and / or videos, video data may be the video content itself, and text data may be associated with subtitles. Audio data may be associated with the voices of various characters within the video content, background music, animal sounds, or various other sounds. According to one embodiment, each of the image data, video data, audio data, or text data that may be included in the video content may be designated as a modality. According to one embodiment, if the video content includes at least two of image data, video data, audio data, or text data, the at least two data may be used as multimodal data.
[0097] According to one embodiment, a key point identification module (204) can identify a key point (or key section) requiring analysis in video content input during a set period. For example, the key point identification module (204) can identify a key point in the input video content that includes features different from the knowledge graph of the previous section. The knowledge graph of the previous section can be obtained from a knowledge graph database (220). The knowledge graph database (220) can be contained in the memory (110) of the electronic device (100) or an external server (e.g., a cloud server) and can store the knowledge graph of the previous section of the video content and the knowledge graph by modality of the previous section of the video content.
[0098] According to one embodiment, the key point identification module (204) can perform embedding to convert the knowledge graph of the previous section and the video input of the current section into data that can be compared. For example, the embedding can be performed based on an embedding model. The embedding model is an AI model (e.g., a deep learning model or a neural network model) for generating embedding vectors, and may be stored in an electronic device (100) or stored on an external server. The embedding model can compress the features of the input data into a vector and output it.
[0099] According to one embodiment, the key point identification module (204) compares a first embedding vector, which is the result of embedding a knowledge graph of a previous section, with a second embedding vector, which is the result of embedding a video input of a current section, and can identify a point in time where the first embedding vector and the second embedding vector are different as a key point. According to one embodiment, the key point identification module (204) may also compare the first embedding vector and the second embedding vector by converting them into a specific format or another dimension through a projection layer, respectively.
[0100] According to one embodiment, the key point identification module (204) may provide information regarding a first section containing a key point to the first prompt configuration module (206). For example, the first section may be set in frame units, but is not limited thereto and may also be set in various time units.
[0101] According to one embodiment, the first prompt configuration module (206) may perform a preprocessing operation for the analysis of video content corresponding to the first section. For example, the first prompt configuration module (206) may acquire first modalities corresponding to the first section in the video content, acquire a knowledge graph for each of the second modalities corresponding to the previous section (e.g., at least one second section prior to the first section), and configure a prompt for analysis. According to one embodiment, the prompt for analysis may instruct an analysis operation associated with each of the first modalities (e.g., object recognition operation, object interaction recognition operation, or scene or situation analysis operation), or instruct to output the result of the analysis operation in a set format (e.g., text format).
[0102] According to one embodiment, the first prompt configuration module (206) can provide the configured information to the analysis module (208) when the knowledge graph of each of the first modalities, the second modalities, or the prompt for analysis is configured.
[0103] According to one embodiment, the analysis module (208) can perform an analysis operation for each of the first modalities based on information provided by the first prompt configuration module (206). The analysis module (208) can perform an analysis operation for each of the first modalities using a multimodal AI model stored in the memory (110) of the electronic device (100) or on an external server. According to one embodiment, the multimodal AI model can receive various types of data (e.g., image data, video data, audio data, or text data) as input and output information of a set type. For example, the multimodal AI model may include a multimodal language model (e.g., a multimodal LLM (large language model)) for acquiring a knowledge graph corresponding to each of the first modalities as text-type information. According to one embodiment, the multimodal AI model may be stored in the electronic device (100) or on an external server.
[0104] According to one embodiment, the analysis module (208) can perform at least one of an image-language analysis operation, an audio-language analysis operation, a language-language analysis operation, or a video-language analysis operation using a multimodal AI model. The analysis module (208) can obtain analysis information (or text information) using language based on at least one analysis operation.
[0105] According to one embodiment, the analysis module (208) can obtain the analysis result of the image modality (or image data) of the first section as first analysis information through an image-language analysis operation. The first analysis information is text-type information and can correspond to a knowledge graph of the image modality of the first section.
[0106] According to one embodiment, the analysis module (208) can obtain the analysis result of the audio modality (or audio data) of the first section as second analysis information through an audio-language analysis operation. The second analysis information is text-type information and can correspond to a knowledge graph of the audio modality of the first section.
[0107] According to one embodiment, the analysis module (208) can obtain the analysis result of the text modality (or text data) of the first section as third analysis information through a language-language analysis operation. The third analysis information is text-type information and can correspond to a knowledge graph of the text modality of the first section.
[0108] According to one embodiment, the analysis module (208) can obtain the analysis result of the video modality (or video data) of the first section as fourth analysis information through a video-language analysis operation. The fourth analysis information is text-type information and can correspond to a knowledge graph of the video modality of the first section.
[0109] According to one embodiment, the inspection module (210) may perform an inspection operation on the first to fourth analysis information (or knowledge graphs corresponding to each of the first modalities) obtained by the analysis module (208). For example, the inspection operation may include an operation to perform at least one of a conflict inspection between modalities (212) or a knowledge consistency inspection (214). The conflict inspection between modalities (212) or the knowledge consistency inspection (214) may be performed based on an AI model. For example, the AI model may be an AI model stored in an electronic device (100) or stored in an external server, and may include a language model (e.g., LLM) for outputting the inspection result as language information (or text information).
[0110] According to one embodiment, the collision check (212) between modalities may represent an operation of checking whether the degree of agreement between first modalities regarding common information (e.g., object information, background information, or interaction information between objects) is greater than or equal to a first threshold, based on a knowledge graph corresponding to each of the first modalities obtained by the analysis module (208). According to one embodiment, the first threshold may be a threshold based on the degree of agreement between second modalities obtained in a second section prior to the first section. The degree of agreement between second modalities may be obtained based on a knowledge graph corresponding to each of the second modalities. According to one embodiment, the inspection module (210) may identify that the collision check (212) between modalities has passed based on the degree of agreement between first modalities being greater than or equal to the first threshold. According to one embodiment, the inspection module (210) can identify that the collision check (212) between modalities has failed and has not passed based on the fact that the degree of agreement between the first modalities is less than a first threshold value.
[0111] According to one embodiment, the knowledge consistency check (214) may perform an operation to check whether the similarity between a knowledge graph corresponding to each of the first modalities corresponding to the first section and a knowledge graph corresponding to each of the second modalities corresponding to at least one second section prior to the first section is greater than or equal to a second threshold. According to one embodiment, the inspection module (210) may identify that the knowledge consistency check (214) has passed based on the similarity being greater than or equal to the second threshold. According to one embodiment, the inspection module (210) may identify that the knowledge consistency check (214) has failed because it has not passed based on the similarity being less than the second threshold.
[0112] According to one embodiment, the inspection module (210) may perform a collision check (212) between modalities and a knowledge consistency check (214) in parallel or simultaneously, perform a knowledge consistency check (214) after a collision check (212) between modalities, or perform a collision check (212) between modalities after a knowledge consistency check (214). According to one embodiment, the inspection module (210) may perform a collision check (212) between modalities and a knowledge consistency check (214) independently, or perform either one of the two. For example, the inspection module (210) may perform a knowledge consistency check (214) without a collision check (212) between modalities, or perform a collision check (212) between modalities without a knowledge consistency check (214).
[0113] According to one embodiment, the inspection module (210) can identify whether at least one of the conflict check between modalities (212) or the knowledge consistency check (214) has been passed (216). If at least one of the conflict check between modalities (212) or the knowledge consistency check (214) has been passed, the inspection module (210) can update the knowledge graph DB (220) by adding the knowledge graph obtained by the analysis module (208) (e.g., a knowledge graph corresponding to each of the first modalities) to the knowledge graph DB (220) (219). According to one embodiment, if the knowledge graph DB (220) is contained in an external server, the knowledge graph obtained by the analysis module (208) can be transmitted to the external server via wireless communication (e.g., Wi-Fi communication or Bluetooth communication).
[0114] According to one embodiment, the inspection module (210) may instruct the second prompt configuration module (210) to generate a prompt for reacquiring the knowledge graph based on the fact that at least one of the conflict check (212) between modalities or the knowledge consistency check (214) is not passed.
[0115] For example, the inspection module (210) may instruct the second prompt configuration module (210) to generate a prompt based on a feature (or different feature) with low agreement between the first modalities, based on the failure of the conflict inspection (212) between the modalities.
[0116] For example, the inspection module (210) may instruct the second prompt configuration module (210) to generate a prompt based on the different characteristics between the knowledge graph corresponding to each of the first modalities included in the first section and the knowledge graph corresponding to each of the second modalities included in at least one second section prior to the first section, based on the failure of the knowledge consistency check (214).
[0117] According to one embodiment, the second prompt configuration module (210) can generate a prompt based on instructions from the inspection module (210) and output the generated prompt to the analysis module (208).
[0118] According to one embodiment, when a prompt is output from the second prompt configuration module (210), the analysis module (208) can re-perform an analysis operation for each of the first modalities based on a knowledge graph corresponding to each of the first modalities, the first modalities, or the second modalities. The analysis module (208) can re-obtain a knowledge graph corresponding to each of the first modalities by re-performing an analysis operation for each of the first modalities using a multimodal AI model. The re-obtained knowledge graph can be output to the inspection module (210) so that the aforementioned operations can be re-performed.
[0119] With reference to FIGS. 3 to 8, the operation of the main point identification module (204), the first prompt configuration module (206), and the inspection module (210) of FIG. 2 will be described in detail below.
[0120] FIG. 3 is a diagram illustrating the operation of a key point identification module according to one embodiment.
[0121] Referring to FIG. 3, the key point identification module (204) of FIG. 2 can perform the operation of identifying key points in video content (312) input during a set interval. The key point identification module (204) can identify key points of video content (312) based on video content (312) and knowledge graph (302) of the previous interval stored in the knowledge graph DB (220).
[0122] According to one embodiment, the key point identification module (204) can generate a graph embedding (306) from a prior knowledge graph (302) using a graph embedding model (304). The graph embedding model (304) can be used to generate a graph embedding (306), which is information vectorized from features of the prior knowledge graph (302) (e.g., object information and / or interaction information between objects).
[0123] According to one embodiment, the main viewpoint identification module (204) can generate video content (312) into video embeddings (316) of a set unit using a video embedding model (314). The set unit may be a frame unit, and the video embeddings (316) may be generated for each frame. The video embedding model (314) may be used to generate video embeddings (316), which are information vectorized into a set unit (e.g., frame unit) of features of the video content (312) (e.g., object information and / or mutual information between objects).
[0124] According to one embodiment, the graph embedding model (304) and the video embedding model (314) may be AI models included in the electronic device (100) or an external server. The graph embedding (306) and the video embedding (316) generated through the graph embedding model (304) and the video embedding model (314) may be input to the first projection layer (308) and the second projection layer (318), respectively.
[0125] According to one embodiment, the first projection layer (308) and the second projection layer (318) can output comparable information by converting the graph embedding (306) and the video embedding (316) respectively (e.g., changing dimensions, adjusting size or orientation, or removing unnecessary information). According to one embodiment, the first projection layer (308) can output a first projection vector by converting the graph embedding (306), and the second projection layer (318) can output a second projection vector by converting the video embedding (316).
[0126] According to one embodiment, the first projection vector and the second projection vector may be used for importance calculation (310). For example, the importance calculation (310) may be performed based on the following [Equation 1].
[0127]
[0128] Referring to [Equation 1], Importance represents importance, and Cosine Similarity can be calculated based on the following [Equation 2] as cosine similarity.
[0129]
[0130] Referring to [Equation 2], A can represent the first projection vector and B can represent the second projection vector. represents the inner product of the first projection vector and the second projection vector, and and Each can represent the magnitude or Euclidean distance of the first projection vector and the second projection vector.
[0131] According to one embodiment, the importance calculation (310) may be performed on a frame-by-frame basis, and the importance (320) of each frame may be determined as a value included in a set range (e.g., 0 to 1). According to one embodiment, the key point identification module (204) may identify at least one frame corresponding to a key point based on the importance (320) of each frame. For example, the key point identification module (204) may identify at least one frame with an importance of at least a threshold value (e.g., 0.5) or a set number of frames identified in order of importance (e.g., frames corresponding to the top 25% of importance) as being included in the key point. The key point identification module (204) may identify a section corresponding to the identified at least one frame as a key section for analysis.
[0132] FIG. 4 is a diagram illustrating the operation of a first prompt configuration module and an analysis module according to one embodiment.
[0133] Referring to FIG. 4, a knowledge graph corresponding to a major section can be obtained through an analysis module (208). To obtain the knowledge graph by the analysis module (208), a preprocessing operation by the first prompt configuration module (206) of FIG. 2 can be performed. According to one embodiment, the first prompt configuration module (206) can perform a preprocessing operation to input modalities included in the video content of the major section (e.g., at least two of image data, video data, audio data, or text data) (402), a previous knowledge graph (e.g., a knowledge graph and / or integrated knowledge graph corresponding to each of the modalities of the section prior to the major section) (404), or a prompt (406) for analysis into the analysis module (208).
[0134] According to one embodiment, based on the fact that the main section is a first section and the section prior to the main section is at least one second section, the first prompt configuration module (206) can perform the following operations. The first prompt configuration module (206) can acquire first modalities (e.g., image data, video data, audio data, or text data) included in the first section of the input video as modalities (402) for inputting to the analysis module (208). The first prompt configuration module (206) can input the acquired first modalities to the analysis module (208).
[0135] The first prompt configuration module (206) can obtain a knowledge graph and / or integrated knowledge graph corresponding to each of the second modalities (e.g., image data, video data, audio data, or text data) included in at least one second section from the knowledge graph DB (220) of FIG. 2. For example, the integrated knowledge graph is obtained based on the knowledge graph corresponding to each of the second modalities and may be a knowledge graph of video content for at least one second section. The knowledge graph and / or integrated knowledge graph corresponding to each of the second modalities obtained by the first prompt configuration module (206) may be input to the analysis module (208) as a previous knowledge graph (404).
[0136] The first prompt configuration module (206) can configure a prompt (406) for analysis and input it into the analysis module (208). The prompt (406) may instruct an analysis operation associated with each of the first modalities, or instruct the output of the results of the analysis operation in a set format (e.g., text format). According to one embodiment, the prompt (406) may include information instructing the recognition of an object in a set format for each of the first modalities (e.g., a format for expressing data as structured text, such as JSON (javascript object notation)), instructing the recognition of information about the area containing the object (e.g., coordinate information or bounding box information), instructing the recognition of interactions between objects, between backgrounds, or between an object and a background in a set form (e.g., tuple form), or instructing the creation of a caption in text form for a scene or situation described by each of the first modalities. For example, the prompt (406) may include information such as “analyze each modality (e.g., image, audio, subtitle, or video) using the previous interval knowledge graph” and may also include additional information about the output format.
[0137] According to one embodiment, the analysis module (208) may be executed based on modalities (e.g., first modalities) (402), a prior knowledge graph (404), or a prompt (406) input by the first prompt configuration module (206). The analysis module (208) may generate knowledge graphs corresponding to the first interval based on the input modalities (402), prior knowledge graph (404), or prompt (406) using a multimodal AI model. For example, the analysis module (208) may generate a knowledge graph corresponding to each of the first modalities, such as an image modality knowledge graph (412), an audio modality knowledge graph (414), a text modality knowledge graph (416), or a video modality knowledge graph (418).
[0138] According to one embodiment, the analysis module (208) can generate each knowledge graph as text-type information or language information by using a language model included in a multimodal AI model. Each knowledge graph can be stored or output in a knowledge graph DB (e.g., the knowledge graph DB (220) of FIG. 2) as text-type, and the text-type knowledge graph can also be input / output or used among modules within the knowledge graph acquisition device (200).
[0139] The process of generating an image modality knowledge graph is described below with reference to FIGS. 5a and 5b. Although FIGS. 5a and 5b describe the process of generating an image modality knowledge graph as an example, audio modality knowledge graphs, text modality knowledge graphs, or video modality knowledge graphs can also be generated in a similar manner.
[0140] FIG. 5a is a drawing illustrating an image modality included in a major section according to one embodiment.
[0141] According to one embodiment, the video of the first section, which is the main section, may include an image modality (500) as illustrated in FIG. 5a. The analysis module (208) illustrated in FIG. 2 or FIG. 4 may generate an image modality knowledge graph (e.g., image modality knowledge graph (412) of FIG. 4) based on the image modality (500), a prior knowledge graph (e.g., prior knowledge graph (404) of FIG. 4), or a prompt (e.g., prompt (406) of FIG. 4).
[0142] According to one embodiment, the prior knowledge graph may include an image modality knowledge graph of a second section prior to the first section. For example, the image modality knowledge graph of the second section may be represented as structured information of a text type, as shown in [Table 1] below.
[0143]
[0144] Referring to [Table 1], the image modality knowledge graph of the second segment may include time information associated with the second segment (e.g., 00:20-00:30), object information included in at least one frame of the second segment, background information (e.g., living room), interaction information between objects of the second segment, or between an object and a background (e.g., person, TV, watching), or caption information associated with a scene or situation of the second segment (e.g., person is watching TV). According to one embodiment, the object information may include object name information (e.g., dog, cat, person, or TV) and object location information (e.g., coordinate information of a bounding box containing the object, or coordinate information in the form of [x, y, w, h], where x and y indicate a reference position on the X-axis and Y-axis, respectively, and w and h indicate a width and height based on the reference position, respectively).
[0145] According to one embodiment, a prompt as shown in the following [Table 2] can be input into the analysis module (208).
[0146]
[0147] Referring to [Table 2], the prompt may be pre-set in the first prompt configuration module (206) of FIG. 2 or may be entered by a user. The prompt is intended to instruct the analysis module (206) to perform an analysis operation and may include, for example, “Analyze the image using the given image and the previous section knowledge graph, and the output format is as follows: {'Object': [(object, coordinate)], 'Background': word, 'Interaction': [(object, object, relationship)], 'Caption': sentence}”.
[0148] According to one embodiment, the output format may represent a format for outputting analysis results from the analysis module (206). For example, the output format may represent a text format based on various types of parentheses. A parenthesis (e.g., curly brace '{}') may contain multiple parentheses (e.g., square bracket '[]' and / or parentheses '()'), and a parenthesis may represent two or more ordered pairs.
[0149] According to one embodiment, the output format of the analysis result may be specified in a format such as {key:value}. For example, the key may represent an object, background, interaction, or caption as information to be analyzed, and the value may represent a value corresponding to the key. Such a format may facilitate calling or searching for a value (or key) corresponding to a key (or value). The value may be specified in a different format for each key. For example, a value corresponding to 'object' may be specified in the format '[(object, coordinate)]' representing an object (or object name) and coordinate information; a value corresponding to 'background' may be specified as 'word'; a value corresponding to 'interaction' may be specified in the format '[(object, object, relationship)]' representing a relationship between objects; and a value corresponding to 'caption' may be specified as 'sentence'.
[0150] According to one embodiment, the analysis module (208) can generate an image modality knowledge graph of the first section based on input of the image modality (500) of the first section shown in FIG. 5a, an image modality knowledge graph of the second section as shown in [Table 1], or a prompt as shown in [Table 2]. For example, the analysis module (208) can generate text-type information as shown in the following [Table 3] using a multimodal AI model or a language model.
[0151]
[0152] Referring to [Table 3], the analysis information of the image modality of the first section may be output in the format indicated by the prompt in [Table 2]. According to one embodiment, the time information (e.g., 00:30-00:40) may not be information provided by a multimodal AI model or a language model, but may be a time information corresponding to the first section that is automatically inserted when the analysis information is output.
[0153] According to one embodiment, [Table 3] is an image modality knowledge graph of the first section expressed as text-type information, and the information in [Table 3] can be converted into a knowledge graph shown in FIG. 5b.
[0154] FIG. 5b is a diagram illustrating an image modality knowledge graph according to one embodiment.
[0155] Referring to FIG. 5b, the image modality knowledge graph (502) may include object information such as a dog (504), a cat (506), a person (508), or a TV (510), location information for each object such as [100,50,50,100], [50,50,0,50], [123,456,789,12], or [255,255,255,255], background information such as a living room (512), or caption information (514).
[0156] According to one embodiment, interaction information may be indicated between a plurality of objects. For example, between a dog (504) and a cat (506), “chasing” (516) may be indicated as interaction information, and between a person (508) and a TV (510), “seeing” (518) may be indicated as interaction information. A plurality of objects in an interaction relationship may be connected by edges.
[0157] According to one embodiment, the caption information (514) may include a description of a scene or situation depicted by the image modality of the first section. For example, the caption information (514) may include text information such as “In the living room, a dog is chasing a cat and a person is watching TV.”
[0158] As described above, the image modality (500) of FIG. 5a can be generated as an image modality knowledge graph of FIG. 5b, and the generated knowledge graph is formed as a text type so that image analysis information for the first section can be managed more effectively.
[0159] FIG. 6 is a diagram illustrating collision detection between modalities by an inspection module according to one embodiment.
[0160] According to one embodiment, when a knowledge graph corresponding to each of the first modalities corresponding to the first section (e.g., image modality knowledge graph (412), audio modality knowledge graph (414), text modality knowledge graph (416), or video modality knowledge graph (418) of FIG. 4) is generated by the analysis module (208) of FIG. 2 or FIG. 4, a collision check between modalities (e.g., collision check between modalities (212) of FIG. 2) may be performed by the inspection module (210) of FIG. 2. According to one embodiment, the collision check between modalities may be performed to check the degree of agreement (or information agreement) between the first modalities regarding common information in the knowledge graph corresponding to each of the first modalities.
[0161] Referring to FIG. 6, the inspection module (210) can input a knowledge graph corresponding to each of the first modalities corresponding to the first section as a modality-specific instruction graph (606), and a prompt (608) for identifying common information into the AI model (610). According to one embodiment, the AI model (610) may be an AI model stored in an electronic device (100) or stored in an external server, and may include a language model (e.g., LLM) for outputting the output result of the input information as text information.
[0162] According to one embodiment, a prompt (608) for identifying common information may include information such as “identify common information in a knowledge graph corresponding to each of the first modalities,” “distinguish individual information and common information in a knowledge graph corresponding to each of the first modalities,” or “distinguish whether each target (e.g., object or background) or situation (or event or incident) is information that is commonly described by all of the first modalities or information that is described by a specific modality among the first modalities.”
[0163] According to one embodiment, the AI model (610) may output at least one of individual information (612) of each modality or common information (614) of the modalities based on input information (e.g., a knowledge graph corresponding to each of the first modalities and a prompt (608) for identifying common information). For example, if the common information of the first modalities is output as the common information (614) of the modalities, the output common information may be input into the AI model (618) along with a prompt (616) for identifying (or obtaining) the degree of agreement. According to one embodiment, the AI model (618) may be the same as or different from the AI model (610) used for identifying common information. According to one embodiment, the AI model (618) may be an AI model stored in the memory (110) of the electronic device (100) or stored on an external server, and may include a language model (e.g., LLM) for outputting the output result of the input information as text information.
[0164] According to one embodiment, the prompt (616) for identifying the degree of agreement may include information instructing to predict or calculate the degree of agreement for each modality pair for common information and / or information instructing the format for displaying the degree of agreement. For example, the prompt (616) for identifying the degree of agreement may include 'determine at least one modality pair and predict the degree of agreement for the description of the common information in each modality pair. The degree of agreement for each modality pair is indicated as ○ if the descriptions match each other, as △ if there are differences in details in the descriptions, and as Х if the descriptions conflict or contradict each other.'
[0165] According to one embodiment, the AI model (618) may output consistency information (620) for each pair of modalities regarding common information. For example, the consistency information (620) may include information indicating the consistency between different modalities. The consistency may be indicated by a level, score, or ratio, or by various forms using symbols, numbers, or characters.
[0166] According to one embodiment, in the degree of agreement information (620), an image-audio modality pair or an audio-image modality pair may be indicated as a degree of agreement by Х, an audio-video modality pair or a video-audio modality pair may be indicated as a degree of agreement by △, and an image-video modality pair or a video-image modality pair may be indicated as a degree of agreement by ○. Here, Х may indicate that the degree of agreement of the modality pair is less than a first threshold or that the modality pair contains conflicting or contradictory content, △ may indicate that the degree of agreement of the modality pair is less than a first threshold or that the modality pair contains different content, and ○ may indicate that the degree of agreement of the modality pair is greater than or equal to a first threshold or that the modality pair contains identical or similar content.
[0167] According to one embodiment, in the match information (620), identical modality pairs such as language-image modality pairs, language-audio modality pairs, or language-video modality pairs, or image-image modality pairs, audio-audio modality pairs, language-language modality pairs, or video-video modality pairs, where common information does not exist, may be indicated by -. - may indicate that match identification is impossible or unnecessary.
[0168] According to one embodiment, the degree of agreement indicated in the degree of agreement information (620) (e.g., degree of agreement indicated by ○, △, or Х) may be based on the degree of agreement of each modality pair in the previous section. For example, the inspection module (210) may identify the degree of agreement of each modality pair as the existing degree of agreement based on the previous knowledge graph (e.g., knowledge graph and / or integrated knowledge graph corresponding to each of the modalities of the second section) (602) (604). The inspection module (602) may determine a first threshold based on the existing degree of agreement. For example, the inspection module (602) may use the existing degree of agreement as the first threshold. The first threshold may be determined to be the same or different for each modality pair.
[0169] According to one embodiment, when the degree of agreement is ○ (○ case), the inspection module (210) can update the knowledge graph DB to include the knowledge graph of the corresponding modalities or perform a knowledge consistency check (e.g., the knowledge consistency check (214) of FIG. 2) (622).
[0170] According to one embodiment, the inspection module (210) may instruct the second prompt configuration module (218) of FIG. 2 to generate a prompt (e.g., cross modal critic prompt) when the degree of agreement is △ or Х (△ or Х case) (624). Based on the instruction of the inspection module (210), the second prompt configuration module (218) may generate a prompt to re-acquire the knowledge graph of the corresponding modalities. For example, based on the degree of agreement being △, the second prompt configuration module (218) may generate a prompt of “You analyzed a as b through audio and analyzed a as b through video. However, the degree of agreement is too low compared to the existing degree of agreement between audio and video. Refer to this and regenerate the result.” For example, the second prompt configuration module (218) can generate a prompt for “analyzing a as b through the image and a as c through the audio. Regenerate the result by referring to which of b and c is more appropriate through the image and audio.” The prompt generated by the second prompt configuration module (218) can be input into the analysis module (208) and used to re-acquire the knowledge graph.
[0171] FIG. 7 is a diagram illustrating a knowledge consistency check by an inspection module according to one embodiment.
[0172] According to one embodiment, a knowledge consistency check (e.g., the knowledge consistency check (214) of FIG. 2) may be performed after a knowledge graph (e.g., the image modality knowledge graph (412), audio modality knowledge graph (414), text modality knowledge graph (416), or video modality knowledge graph (418) of FIG. 4) corresponding to each of the first modalities corresponding to the first section is generated by the analysis module (208) of FIG. 2 or FIG. 4, or after a collision check between modalities of FIG. 6 has been passed. According to one embodiment, the knowledge consistency check may be performed to check the similarity between the knowledge graph corresponding to each of the first modalities corresponding to the first section and the knowledge graph corresponding to each of the second modalities corresponding to at least one second section prior to the first section. The knowledge consistency check may be performed by the inspection module (210) of FIG. 2.
[0173] Referring to FIG. 7, the inspection module (210) can input the previous knowledge graph (702) and the knowledge graph (704) generated by the analysis module (208) into the graph embedding model (708). According to one embodiment, the previous knowledge graph (702) may include a knowledge graph corresponding to each of the second modalities corresponding to at least one second section prior to the first section, and the knowledge graph (704) generated by the analysis module (208) may include a knowledge graph corresponding to each of the first modalities corresponding to the first section.
[0174] According to one embodiment, the graph embedding model (708) is an AI model included in an electronic device (100) or an external server, and can generate a first graph embedding (710) for each of the second modalities by vectorizing the features of the previous knowledge graph (702), and can generate a second graph embedding (712) for each of the first modalities by vectorizing the features of the knowledge graph (704) generated by the analysis module (208).
[0175] According to one embodiment, the inspection module (210) may use the first graph embedding (710) and the second graph embedding (712) for similarity calculation (716). For example, the similarity may be a cosine similarity as shown in [Equation 2] and may be a value in a set range (e.g., -1 to 1).
[0176] According to one embodiment, the inspection module (210) can identify the degree of change per modality based on the previous knowledge graph (702) (704). The degree of change per modality may include the degree of change per time period for each of the second modalities of the previous section, the second section, and may be used to determine a weight value for similarity calculation (716). For example, the weight value may be determined for each modality and may be obtained as a “1-degree of change.” Here, the degree of change may represent the average value of the similarity (e.g., a value of 0 to 1) of the knowledge graph for the second modality between two adjacent time periods. According to one embodiment, by applying a weight to the similarity for each modality (714), the inspection may be adjusted so that modalities that continue to change over time pass the inspection even with low similarity, and modalities that do not change significantly over time pass the inspection only if they have high similarity.
[0177] According to one embodiment, the inspection module (210) identifies that the knowledge consistency check has passed when the weighted similarity is greater than or equal to a second threshold (e.g., 0) (718), and may update the knowledge graph DB to include the knowledge graph of the corresponding modality, or perform a collision check between the modalities of FIG. 2 (212) or a collision check between the modalities of FIG. 6 (720). The second threshold may be fixed or changed.
[0178] According to one embodiment, if the weighted similarity is less than a second threshold (e.g., 0) (722), the inspection module (210) may identify that the knowledge consistency check has not passed and instruct the second prompt configuration module (218) of FIG. 2 to generate a self-critic prompt (724). Based on the instructions of the inspection module (210), the second prompt configuration module (218) may generate a prompt based on features with low similarity. For example, the prompt may include information for reacquiring the knowledge graph, such as, “It was analyzed that a appeared in the image previously and b appears now, so re-check if b appears and generate a result.”
[0179] According to one embodiment, the collision check between modalities of FIG. 6 and the knowledge consistency check of FIG. 7 may be performed simultaneously, in parallel, or sequentially. For example, the knowledge consistency check of FIG. 7 may be performed after the collision check between modalities of FIG. 6 is passed, or the collision check between modalities of FIG. 6 may be performed after the knowledge consistency check of FIG. 7 is passed. According to one embodiment, the collision check between modalities of FIG. 6 and the knowledge consistency check of FIG. 7 may each be performed independently, and either one may not be performed.
[0180] FIG. 8 is a diagram illustrating a knowledge graph integration operation according to one embodiment.
[0181] Referring to FIG. 8, a knowledge graph integration operation may be performed based on the fact that the collision check between modalities of FIG. 6 and / or the knowledge consistency check of FIG. 7 has been passed. The knowledge graph integration operation may be performed to obtain an integrated knowledge graph (808) based on the modality-specific knowledge graph (802). For example, the knowledge graph integration operation may be performed to generate an integrated knowledge graph (808) by integrating the modality-specific knowledge graphs (802).
[0182] According to one embodiment, a knowledge graph corresponding to each of the first modalities included in the first section as a modality-specific knowledge graph (802) and a prompt (804) for obtaining an integrated knowledge graph may be input to an AI model (806). The prompt (804) is information instructing the creation of an integrated knowledge graph, and may include, for example, “combine the information obtained from each modality to create one integrated knowledge graph.”
[0183] According to one embodiment, the AI model (806) may be an AI model included in an electronic device (100) or an external server, and may include a language model such as an LLM. Based on input information, the AI model (806) may integrate knowledge graphs corresponding to each of the first modalities to output an integrated knowledge graph (808) in the form of text. The integrated knowledge graph (808) may be stored in a knowledge graph DB (e.g., the knowledge graph DB (220) of FIG. 2) together with the knowledge graphs corresponding to each of the first modalities.
[0184] According to one embodiment, the knowledge graph corresponding to each of the first modalities may include an image modality knowledge graph and an audio modality knowledge graph, as shown in the following [Table 4].
[0185]
[0186] According to one embodiment, the image modality knowledge graph and the audio modality knowledge graph shown in [Table 4] can be integrated through an AI model (806) as shown in the following [Table 5]. The integrated knowledge graph of [Table 5] can be generated in a combined form of information on image and audio modalities.
[0187]
[0188] The operation of an electronic device (e.g., the electronic device (100) of FIG. 1a or FIG. 1b) will be described below with reference to FIG. 9 through 14. According to one embodiment, the operations illustrated in FIG. 9 through 14 may be understood to be performed by a processor of the electronic device (e.g., the processor (120) of FIG. 1a or FIG. 1b). The operations illustrated in FIG. 9 through 14 may be performed in various orders, not limited to the order shown. According to one embodiment, at least some of the operations illustrated in FIG. 9 through 14 may be omitted, or more operations may be performed than those illustrated in FIG. 9 through 14.
[0189] FIG. 9 is a flowchart illustrating the operation of an electronic device according to one embodiment.
[0190] Referring to FIG. 9, in operation 902, the electronic device can obtain a knowledge graph corresponding to each of the first modalities included in the first section of the video content through a multimodal AI model. According to one embodiment, the first modalities may correspond to at least two of image data, video data, audio data, or text data included in the first section of the video content.
[0191] According to one embodiment, the first section may include at least one frame among the frames included in the video content input during the set section that includes a feature different from a second knowledge graph corresponding to the video content obtained for the second section prior to the first section.
[0192] According to one embodiment, the multimodal AI model may include a multimodal language model for acquiring a knowledge graph corresponding to each of the first modalities as text type or text format information.
[0193] In operation 904, if the degree of agreement between the first modalities for common information in the knowledge graph corresponding to each of the first modalities is greater than or equal to a first threshold, the electronic device can obtain a first knowledge graph of video content corresponding to a first segment based on the knowledge graph corresponding to each of the first modalities.
[0194] According to one embodiment, the electronic device may obtain common information through an AI model (e.g., the AI model (610) of FIG. 6). For example, the electronic device may obtain common information through an AI model based on a knowledge graph corresponding to each of the first modalities and a prompt (e.g., identifying common information in the knowledge graph corresponding to each of the first modalities). According to one embodiment, the electronic device may obtain a degree of agreement between the first modalities to perform a collision check (212) between the modalities of FIG. 2 or a collision check between the modalities of FIG. 6.
[0195] According to one embodiment, the first threshold value may be a threshold value based on the degree of agreement between second modalities obtained in a second section prior to the first section.
[0196] According to one embodiment, an electronic device can obtain a first knowledge graph using an AI model (e.g., the AI model (806) of FIG. 8). According to one embodiment, if the degree of agreement between the first modalities is greater than or equal to a first threshold, the electronic device can identify that a collision check between the modalities has been passed and obtain a first knowledge graph.
[0197] FIG. 10 is a flowchart illustrating the operation of an electronic device according to one embodiment acquiring a knowledge graph corresponding to each of the first modalities.
[0198] According to one embodiment, the operations illustrated in FIG. 10 may be operations that can be performed in operation 902 of FIG. 9.
[0199] Referring to FIG. 10, in operation 1002, the electronic device may obtain a knowledge graph corresponding to each of the first modalities included in the first section, the second modalities included in the second section prior to the first section, or a prompt corresponding to instructions for analysis associated with each of the first modalities. According to one embodiment, the prompt may correspond to the prompt (406) of FIG. 4, and the multimodal AI model may be used in the analysis module (208) of FIG. 4.
[0200] In operation 1004, the electronic device may obtain a knowledge graph corresponding to each of the first modalities through a multimodal AI model based on a knowledge graph corresponding to each of the first modalities, or a prompt corresponding to instructions for analysis associated with each of the first modalities. According to one embodiment, the multimodal AI model may be included in the electronic device and run on-device, or may be stored on an external server.
[0201] FIG. 11 is a flowchart illustrating the operation of an electronic device for reacquiring a knowledge graph according to one embodiment.
[0202] Referring to FIG. 11, in operation 1102, the electronic device can obtain a knowledge graph corresponding to each of the first modalities included in the first section of the video content through a multimodal AI model. According to one embodiment, operation 1102 may correspond to operation 902 of FIG. 9.
[0203] In operation 1104, if the degree of agreement between the first modalities regarding common information in the knowledge graph corresponding to each of the first modalities is less than a first threshold, the electronic device may reacquire the knowledge graph corresponding to each of the first modalities through a multimodal AI model based on a prompt based on different features between the first modalities and the first modalities corresponding to the common information. According to one embodiment, the electronic device may identify that the collision check between modalities was not passed based on the degree of agreement between the first modalities being less than a first threshold, and reacquire the knowledge graph for each of the first modalities.
[0204] According to one embodiment, the electronic device can perform a conflict check between modalities even on a reacquired knowledge graph. For example, the electronic device can repeatedly perform the operation of reacquiring a knowledge graph for each of the first modalities until the degree of agreement between the first modalities becomes greater than or equal to a first threshold value.
[0205] FIG. 12 is a flowchart illustrating the operation of an electronic device for acquiring a knowledge graph corresponding to video content according to one embodiment.
[0206] According to one embodiment, the operations illustrated in FIG. 12 may be operations that can be performed in operation 904 of FIG. 9.
[0207] Referring to FIG. 12, in operation 1202, if the degree of agreement between the first modalities corresponding to common information in the knowledge graph corresponding to each of the first modalities is greater than or equal to a first threshold, the electronic device can obtain a similarity between the knowledge graph for each of the first modalities included in the first interval and the knowledge graph for each of the second modalities included in the second interval prior to the first interval. For example, the similarity may be a value included in a set range (e.g., -1 to 1).
[0208] According to one embodiment, the similarity between the knowledge graph for each of the first modalities and the knowledge graph for each of the second modalities may be a similarity corresponding to a modality-specific weight based on the change of each of the second modalities. For example, the modality-specific weight may be obtained as a “1-change.” Here, the change may represent the average value of the similarity of the knowledge graph for the second modality between two adjacent time intervals within the second interval (e.g., a value of 0 to 1).
[0209] In operation 1204, if the acquired similarity is greater than or equal to a second threshold (e.g., 0), the electronic device may acquire a first knowledge graph of video content corresponding to a first interval. For example, the electronic device may acquire a first knowledge graph of video content corresponding to a first interval based on a knowledge graph corresponding to each of the first modalities.
[0210] FIG. 13 is a flowchart illustrating the operation of an electronic device that obtains a degree of agreement between modalities through a knowledge consistency check according to one embodiment.
[0211] According to one embodiment, the operations illustrated in FIG. 13 may be operations that can be performed in operation 904 of FIG. 9.
[0212] Referring to FIG. 13, in operation 1302, the electronic device can obtain a similarity between a knowledge graph corresponding to each of the first modalities included in the first interval and a knowledge graph corresponding to each of the second modalities included in the second interval prior to the first interval. For example, the similarity may be a value included in a set range (e.g., -1 to 1).
[0213] According to one embodiment, the similarity between a knowledge graph corresponding to each of the first modalities and a knowledge graph corresponding to each of the second modalities may be a similarity corresponding to a modality-specific weight based on the change of each of the second modalities. For example, the modality-specific weight may be obtained as “1-change.” Here, the change may represent the average value of the similarity of the second modality-specific knowledge graph between two adjacent time intervals within the second interval (e.g., a value of 0 to 1).
[0214] In operation 1304, if the acquired similarity is greater than or equal to a second threshold, the electronic device may acquire a degree of agreement between first modalities corresponding to common information in a knowledge graph corresponding to each of the first modalities. According to one embodiment, if the acquired degree of agreement is greater than or equal to a first threshold, the electronic device may perform operation 904 of FIG. 9 or operations 1202 and 1204 of FIG. 12. According to one embodiment, if the acquired degree of agreement is less than the first threshold, the electronic device may perform operation 1104 of FIG. 11.
[0215] FIG. 14 is a flowchart illustrating the operation of an electronic device for reacquiring a knowledge graph by modality according to one embodiment.
[0216] According to one embodiment, the operations illustrated in FIG. 14 may be operations that can be performed after operation 1202 of FIG. 12 or operation 1302 of FIG. 13.
[0217] Referring to FIG. 14, in operation 1402, the electronic device can obtain a similarity between a knowledge graph corresponding to each of the first modalities included in the first section and a knowledge graph corresponding to each of the second modalities included in the second section prior to the first section.
[0218] In operation 1404, if the acquired similarity is less than a second threshold, the electronic device may generate a prompt based on different information between the knowledge graph corresponding to each of the first modalities and the knowledge graph corresponding to each of the second modalities.
[0219] In operation 1404, if the acquired similarity is less than a second threshold, the electronic device may generate a prompt based on different information between the knowledge graph corresponding to each of the first modalities and the knowledge graph corresponding to each of the second modalities.
[0220] In operation 1406, the electronic device can reacquire the knowledge graph corresponding to each of the first modalities through a multimodal AI model based on the first modalities, the knowledge graph corresponding to each of the second modalities, or the prompt.
[0221] The various embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of said embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of said items unless the relevant context clearly indicates otherwise. In this document, phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B, or C" may each include any one of the items listed together in the corresponding phrase, or all possible combinations thereof. Terms such as "first," "second," or "first" or "second" may be used simply to distinguish said components from other said components and do not limit said components in any other aspect (e.g., importance or order). Where any (e.g., 1st) component is referred to as “coupled” or “connected” to another (e.g., 2nd) component, with or without the terms “functionally” or “communicationly,” it means that said any component may be connected to said other component directly (e.g., via a wire), wirelessly, or through a third component.
[0222] The term “module” as used in the various embodiments of this document may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. A module may be a component formed integrally, or a minimum unit of said component or a part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).
[0223] According to various embodiments, each component (e.g., module or program) of the components described above may include a singular or multiple entities, and some of the multiple entities may be separated and placed in other components. According to various embodiments, one or more of the components or operations of the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the multiple components in the same or similar manner as those performed by the corresponding component among the multiple components prior to integration. According to various embodiments, operations performed by the module, program, or other components may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.
Claims
1. In an electronic device, At least one processor including a processing circuit; and The electronic device comprises a memory including at least one storage medium for storing instructions, wherein the instructions, when executed individually or collectively by the at least one processor, cause: Through a multimodal artificial intelligence (AI) model, a knowledge graph corresponding to each of the first modalities corresponding to at least two of the image data, video data, audio data, or text data included in the first segment of the video content is obtained, and An electronic device that causes to acquire a first knowledge graph of the video content corresponding to the first section based on the knowledge graph corresponding to each of the first modalities if the degree of agreement between the first modalities corresponding to common information in the knowledge graph corresponding to each of the first modalities is greater than or equal to a first threshold.
2. In Paragraph 1, An electronic device wherein the first section comprises at least one frame that includes a feature different from a second knowledge graph corresponding to the video content acquired for the second section prior to the first section, among the frames included in the video content input during the set section.
3. In Paragraph 1, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that causes to acquire a knowledge graph corresponding to each of the first modalities through a multimodal AI model based on the first modalities, a knowledge graph corresponding to each of the second modalities included in the second section prior to the first section, or a prompt corresponding to an instruction for analysis associated with each of the first modalities.
4. In Paragraph 1 or 3, The above multimodal AI model is an electronic device comprising a multimodal language model for acquiring knowledge graphs corresponding to each of the first modalities as text-type information.
5. In Paragraph 1, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: If the above match is greater than or equal to the above first threshold, the similarity between the knowledge graph corresponding to each of the above first modalities and the knowledge graph corresponding to each of the above second modalities included in the second section prior to the above first section is obtained, and An electronic device that causes the acquisition of the first knowledge graph when the similarity obtained above is greater than or equal to a second threshold value.
6. In Paragraph 1, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: A similarity is obtained between a knowledge graph corresponding to each of the first modalities and a knowledge graph corresponding to each of the second modalities included in the second section prior to the first section, and An electronic device that causes to obtain a degree of agreement between the first modalities corresponding to the common information when the similarity obtained above is greater than or equal to a second threshold.
7. In Paragraph 5 or 6, An electronic device in which the similarity between a knowledge graph corresponding to each of the first modalities and a knowledge graph corresponding to each of the second modalities is a similarity corresponding to a modality-specific weight based on the degree of change of each of the second modalities.
8. In Paragraph 5 or 6, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: If the above similarity is less than the above second threshold, a prompt is generated based on different information between the knowledge graph corresponding to each of the first modalities and the knowledge graph corresponding to each of the second modalities, and An electronic device that causes the knowledge graph corresponding to each of the first modalities to be reacquired through the multimodal AI model based on at least one of the knowledge graph corresponding to each of the first modalities, the knowledge graph corresponding to each of the second modalities, or the prompt.
9. In Paragraph 1, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that, if the degree of agreement is less than the first threshold, causes the multimodal AI model to reacquire a knowledge graph corresponding to each of the first modalities based on a prompt related to different features between the first modalities and the first modalities corresponding to the common information.
10. In Paragraph 1 or Paragraph 9, An electronic device in which the first threshold value is a threshold value based on the degree of agreement between second modalities obtained in a second section prior to the first section.
11. In a method of operating an electronic device, The operation of acquiring a knowledge graph corresponding to each of the first modalities corresponding to at least two of the image data, video data, audio data, or text data included in the first segment of the video content through a multimodal artificial intelligence (AI) model; and A method of operating an electronic device, comprising the operation of obtaining a first knowledge graph of the video content corresponding to the first section based on the knowledge graph corresponding to each of the first modalities, if the degree of agreement between the first modalities corresponding to common information in the knowledge graph corresponding to each of the first modalities is greater than or equal to a first threshold.
12. In Paragraph 11, A method of operating an electronic device, wherein the first section comprises at least one frame that includes a feature different from a second knowledge graph corresponding to the video content, which is obtained for a second section prior to the first section, among the frames included in the video content input during a set section.
13. In Paragraph 11, The operation of acquiring a knowledge graph corresponding to each of the first modalities above is, A method of operating an electronic device comprising the operation of acquiring a knowledge graph corresponding to each of the first modalities through a multimodal AI model based on the first modalities, a knowledge graph corresponding to each of the second modalities included in the second section prior to the first section, or a prompt corresponding to an instruction for analysis associated with each of the first modalities.
14. In Paragraph 11 or 13, A method of operating an electronic device comprising a multimodal language model for acquiring a knowledge graph corresponding to each of the first modalities as text-type information, wherein the multimodal AI model described above includes the multimodal AI model.
15. A storage medium storing at least one instruction readable by a computer, wherein the at least one instruction causes the electronic device to perform at least one operation when executed by at least a part of at least one processor of the electronic device, and The above at least one operation is: The operation of acquiring a knowledge graph corresponding to each of the first modalities corresponding to at least two of the image data, video data, audio data, or text data included in the first segment of the video content through a multimodal artificial intelligence (AI) model; and A storage medium comprising the operation of obtaining a first knowledge graph of the video content corresponding to the first section based on the knowledge graph corresponding to each of the first modalities, if the degree of agreement between the first modalities corresponding to common information in the knowledge graph corresponding to each of the first modalities is greater than or equal to a first threshold.