Electronic apparatus for providing content search for sentence-form utterance and method of operating same
By converting user utterances and content metadata into embedding vectors and using a large language model to generate additional metadata, the electronic device addresses the challenge of accurately reflecting user intent in digital content search, enhancing search relevance and accuracy.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SAMSUNG ELECTRONICS CO LTD
- Filing Date
- 2025-06-24
- Publication Date
- 2026-06-25
AI Technical Summary
Existing digital content search technologies struggle to accurately reflect user intent, particularly with sentence-type utterances that describe specific scenes or plots, as they lack the ability to understand and process rich metadata beyond basic keywords.
An electronic device employs a deep learning-based text encoder to convert user utterances and content metadata into embedding vectors, using cosine similarity to determine relevance, and incorporates a large language model to generate additional metadata for enhanced semantic-based search, allowing it to understand and retrieve content matching user intent.
The solution enables accurate retrieval of content highly relevant to user sentence-type queries by leveraging rich metadata, improving search accuracy and user satisfaction.
Smart Images

Figure KR2025008771_25062026_PF_FP_ABST
Abstract
Description
Electronic device providing content search for sentence-type utterances and method of operation thereof
[0001] Various embodiments of the present disclosure relate to an electronic device that provides content retrieval for sentence-type utterances and a method of operating the same.
[0002] Digital content is diverse in type, and the online platforms providing it are also varied. For example, movies, dramas, sports, and user-generated videos are the primary types of digital content produced and consumed. Online platforms such as OTT platforms, social media, and video platforms are widely used by many people.
[0003] Technologies for searching digital content include keyword-based search and semantic-based search. Keyword-based search is fast and simple to implement, but it is difficult to reflect specific user intent. Semantic-based search is implemented by deep learning technologies utilizing natural language processing (NLP) and machine learning. For example, a transformer-based model can understand the context of a user query and search for highly relevant content.
[0004] The information described above may be provided as related art for the purpose of aiding understanding of the present disclosure. No claim or determination is made as to whether any of the foregoing may be applied as prior art related to the present disclosure.
[0005] An electronic device according to one embodiment of the present disclosure comprises: a display; a microphone; and at least one processor, wherein the at least one processor receives a user voice input through the microphone, obtains a first query embedding based on the user voice input, obtains a similarity between a plurality of metadata embedding vectors included in a metadata database and the first query embedding, respectively, obtains one or more similar metadata among the plurality of metadata embedding vectors having a similarity greater than or equal to a first threshold value based on the similarity, obtains an order of the one or more similar metadata based on the similarity or a weight for one or more metadata of the same content, and controls the display to output information related to content corresponding to the one or more similar metadata in the order.
[0006] A server according to another embodiment of the present disclosure comprises: a communication circuit; a memory; and at least one processor, wherein the at least one processor receives a user voice input for a content search request from a first electronic device through the communication circuit, obtains a first query embedding based on the user voice input, obtains a similarity between a plurality of metadata embedding vectors included in a metadata database and the first query embedding, respectively, obtains one or more similar metadata among the plurality of metadata embedding vectors having a similarity greater than or equal to a first threshold value based on the similarity, obtains an order of the one or more similar metadata based on the similarity or a weight for one or more metadata of the same content, and transmits information related to content corresponding to the one or more similar metadata, including the order information, to the first electronic device through the communication circuit.
[0007] A non-volatile computer-readable storage medium may be provided that records instructions according to another embodiment of the present disclosure. When the instructions are executed by one or more processors, the one or more processors may perform the following operations: receiving a user voice input by a microphone of an electronic device; obtaining a first query embedding based on the user voice input; obtaining a similarity between a plurality of metadata embedding vectors included in a metadata database and the first query embedding, respectively; obtaining one or more similar metadata having a similarity greater than or equal to a first threshold among the plurality of metadata embedding vectors based on the similarity; obtaining an order of the one or more similar metadata based on the similarity or a weight for one or more metadata of the same content; and controlling a display of the electronic device to output information related to content corresponding to the one or more similar metadata in the order.
[0008] In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components.
[0009] FIG. 1 is an example of a search screen of an electronic device according to one embodiment of the present disclosure.
[0010] FIG. 2 is a block diagram including components of an electronic device according to one embodiment of the present disclosure.
[0011] FIG. 3 is a flowchart illustrating the operation of an electronic device performing a content search for a user's utterance according to one embodiment of the present disclosure.
[0012] FIG. 4 is a block diagram showing a search function component of an electronic device according to one embodiment of the present disclosure.
[0013] FIG. 5 is a flowchart illustrating the operation of an electronic device storing metadata of content in a database according to one embodiment of the present disclosure.
[0014] FIG. 6 is a flowchart illustrating the operation of an electronic device storing additional metadata of content in a database according to one embodiment of the present disclosure.
[0015] FIG. 7 is a block diagram showing a metadata generation function component of an electronic device according to one embodiment of the present disclosure.
[0016] FIG. 8 is an example of metadata according to an embodiment of the present disclosure.
[0017] FIG. 9 is an example of a search result table based on user utterance of an electronic device according to one embodiment of the present disclosure.
[0018] FIG. 10 is an example of a search result screen of an electronic device according to one embodiment of the present disclosure.
[0019] FIG. 11 illustrates an electronic device, a database, and a cloud server according to one embodiment of the present disclosure.
[0020] Hereinafter, embodiments of the present disclosure are described in detail with reference to the drawings so that those skilled in the art can easily practice them. However, the present disclosure may be embodied in various different forms and is not limited to the embodiments described herein. In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components. Furthermore, in the drawings and related descriptions, descriptions of well-known functions and configurations may be omitted for clarity and brevity.
[0021] An embodiment of the present disclosure will be described below with reference to the attached drawings.
[0022] FIG. 1 is an example of a search screen of an electronic device according to one embodiment of the present disclosure.
[0023] According to one embodiment, an electronic device (101) can receive a user utterance (105) by a user (103) using an input device (e.g., a microphone), determine the meaning of the user utterance, and provide the result (112) of a search for content highly related to the meaning to the user (103) through an output device (e.g., a display).
[0024] According to one embodiment, the electronic device (101) can process sentence-type utterances containing user intent in addition to simple keywords. Sentence-type utterances include sentences for requesting a search and may include descriptions of the search target. Unlike keyword search requests that mention specific information such as the title, characters, actors, and director of the content, descriptions of the search target may represent descriptions related to the content, such as the plot, major scenes, historical background, and relationships between characters. Accordingly, user utterances (105) may be sentence-type utterances that describe specific scenes or major synopses rather than basic information such as the movie title, characters, and director. For example, referring to FIG. 1, user utterances (105) may be "Find me a movie where detectives sell chicken!" which describes the major scenario.
[0025] According to one embodiment, an electronic device (101) can understand user utterance (105) and search for content that matches the meaning of user utterance (105) in a content metadata DB. To compare the meaning of user utterance (105) and content metadata, the electronic device (101) can convert each into an embedding vector and calculate the similarity between the vectors to determine the association. According to one embodiment, the processor of the electronic device (101) can convert text information (e.g., user utterance, content metadata) into an embedding vector using a text encoder. The text encoder can be implemented as a deep learning-based model that learns the similarity of words and the context of sentences. The electronic device (101) can calculate a similarity value between the embedding vector for user utterance and the embedding vector for content metadata using cosine similarity, and determine whether the content is highly related to the user utterance based on the value.
[0026] An electronic device (101) according to one embodiment may use metadata about content for semantic-based content search. The metadata about the content may include various information about the content. For example, as basic information about the content, if the content is a movie, the title, characters, actors, writer, director, release date (release date), genre, synopsis, and main plot may be stored as metadata about the content. The electronic device (101) may acquire additional information in addition to the origin information about the content provided by the content provider and utilize it as metadata.
[0027] Unlike keyword search that matches keywords included in content information, the richer the metadata information about the content, the higher the accuracy of semantic-based content search for user utterances, that is, content search with high relevance to user utterances (queries). An electronic device (101) according to one embodiment may add various information about the content as metadata by using a large language model (LLM) in addition to the basic information provided by a content provider.
[0028] The electronic device (101) can obtain additional information by searching for various items describing the content. For example, the electronic device (101) can search for reviews of the content and store some of the reviews as metadata for the content. The electronic device (101) can use a large language model (LLM) to obtain additional information. For example, the electronic device (101) can ask the LLM about various item values for the content and store the content information generated by the LLM as metadata. A single piece of content may include multiple pieces of metadata. The electronic device (101) can convert the content metadata into embedding vectors and store and search them in a database. The electronic device (101) can obtain search results based on the similarity judgment with user utterances based on the content metadata, and one or more pieces of metadata for the same content may be included in the search results. The electronic device (101) can rearrange the search results based on the content and provide the top-linked content to the user.
[0029] An electronic device (101) according to one embodiment receives a sentence-type utterance (105) by a user (103) and can output the results of searching for content highly related to the sentence-type utterance (105) on a display screen (110). The display screen (110) may include a first part (111) that displays text that recognizes the sentence-type utterance (105) and a second part (112) that displays a list of content corresponding to the search results.
[0030] FIG. 2 is a block diagram including components of an electronic device according to one embodiment of the present disclosure.
[0031] According to one embodiment, an electronic device (201) (e.g., the electronic device (101) of FIG. 1) may include a processor (210), memory (220), a display (230), a connection terminal (240), a microphone (250), a speaker (260), and a communication circuit (270). According to one example, the electronic device (201) may include additional components (e.g., a camera) in addition to the components shown, or at least one of the components shown may be omitted.
[0032] The processor (210) can, for example, execute software (e.g., a program) to control at least one other component (e.g., a hardware or software component) of the electronic device (201) connected to the processor (210) and perform various data processing or operations. According to one embodiment, as at least part of the data processing or operations, the processor (210) can store commands or data received from another component (e.g., a microphone (250)) in volatile memory, process the commands or data stored in volatile memory, and store the resulting data in non-volatile memory. According to one embodiment, the processor (210) may include a main processor (e.g., a central processing unit or an application processor) or an auxiliary processor (e.g., a graphics processing unit, a neural processing unit (NPU), an image signal processor, a sensor hub processor, or a communication processor) that can operate independently or together with it. For example, if the electronic device (201) includes a main processor and an auxiliary processor, the auxiliary processor may be configured to use less power than the main processor or to be specialized for a designated function. An auxiliary processor may be implemented separately from the main processor or as part thereof. According to one embodiment, the auxiliary processor (e.g., a neural network processing unit) may include a hardware structure specialized for processing an artificial intelligence model. The artificial intelligence model may be generated through machine learning. Such learning may be performed, for example, on the electronic device (201) itself where the artificial intelligence model is executed, or through a separate server (e.g., a cloud server).Learning algorithms may include, for example, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but are not limited to the examples described above. An artificial intelligence model may include multiple artificial neural network layers. An artificial neural network may be a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep neural network (DBN), a bidirectional recurrent deep neural network (BRDNN), a deep Q-network, or a combination of two or more of the above, but is not limited to the examples described above. In addition to the hardware structure, the artificial intelligence model may include a software structure, either additionally or substantially.
[0033] Memory (220) may store various data used by at least one component of the electronic device (201) (e.g., processor (210) or display (230)). The data may include, for example, software (e.g., programs) and input data or output data for related commands. Memory (230) may include volatile memory or non-volatile memory. Memory (220) may include a database in non-volatile memory. In one embodiment, memory (220) may include at least a portion of a content metadata database or a user utterance database.
[0034] The display (230) can visually provide information to an external (e.g., user) of the electronic device (201). The display (230) may include, for example, a display panel, a holographic device, or a projector and a control circuit for controlling said device. According to one embodiment, the display panel may include a touch sensor configured to detect a touch, or a pressure sensor configured to measure the intensity of the force generated by said touch.
[0035] A connection terminal (140) may include a connector through which the electronic device (201) can be physically connected to an external electronic device (e.g., a separate display device or an audio output device). According to one embodiment, the connection terminal (140) may include, for example, an HDMI (high-definition multimedia interface) connector, a DS port (display port), a Thunderbolt, a USB connector, an SD card connector, or an audio connector (e.g., a headphone connector). The electronic device (201) may be connected to one or more external devices using the connection terminal (140). The electronic device (201) may receive or transmit at least a portion of a video signal or an audio signal through the connection terminal (140). The electronic device (201) may transmit search results based on user speech to an external display device connected through the connection terminal (140) and output them through the external display device.
[0036] The microphone (250) is an input device (sensor) that detects sound and can provide a function for recognizing voice. The electronic device (201) can recognize the user's voice through the microphone (250) and receive user speech requesting a content search.
[0037] The speaker (260) is an audio output device that enables sound to be heard along with video. The speaker (260) may include a speaker driver, an amplifier, and a sound processor. The sound processor may store and manage the sound output table of the speaker (260). The electronic device (201) may output information about the content search results through the speaker (260).
[0038] The communication circuit (270) can support the establishment of a direct (e.g., wired) communication channel or a wireless communication channel between an electronic device (201) and an external electronic device (e.g., remote control, audio output device, source device, content provider server), and the performance of communication through the established communication channel. The communication circuit (270) may include one or more communication processors that operate independently of the processor (210) (e.g., application processor) and support direct (e.g., wired) communication or wireless communication. According to one embodiment, the communication circuit (270) may include a wireless communication module (e.g., cellular communication module, short-range wireless communication module, or GNSS (global navigation satellite system) communication module) or a wired communication module (e.g., LAN (local area network) communication module, or power line communication module). The corresponding communication module among these communication modules can communicate with a remote control or a content provider server via a first network (e.g., a short-range communication network such as Bluetooth, Wi-Fi (wireless fidelity) Direct, or IrDA (infrared data association)) or a second network (e.g., a legacy cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (e.g., a LAN or WAN)). These various types of communication modules may be integrated into a single component (e.g., a single chip) or implemented as multiple separate components (e.g., multiple chips). The wireless communication module can identify or authenticate the electronic device (101) within a communication network, such as the first network or the second network, using subscriber information (e.g., International Mobile Subscriber Identifier (IMSI)) stored in the subscriber identification module. The electronic device (201) may be connected to one or more content servers, external storage devices, or cloud servers via a communication circuit (270).The electronic device (201) can request and receive content information or content from a content server. The electronic device (201) can request and receive content metadata information stored in an external storage device. The electronic device (201) can perform semantic-based content search for user utterances through a cloud server.
[0039] According to one embodiment, an electronic device (201) comprises a display (230); a microphone (250); and at least one processor (210). The at least one processor (210) can control the display to output information related to content corresponding to the one or more similar metadata in an order based on at least one of the similarity or weights for other metadata of the same content, when receiving user voice input through the microphone, each of which obtains a similarity between a plurality of metadata embedding vectors included in a metadata database and a first query embedding obtained based on the user voice input, obtains one or more similar metadata corresponding to a similarity greater than or equal to a first threshold among the plurality of metadata embedding vectors, and outputs information related to content corresponding to the one or more similar metadata in an order based on at least one of the similarity or weights for other metadata of the same content.
[0040] According to one embodiment, the at least one processor receives information corresponding to a first content from a server, and can obtain an embedding vector obtained based on the information corresponding to the first content as metadata for the first content.
[0041] According to one embodiment, information corresponding to the first content may include at least one of a title, characters, actors, writer, director, synopsis, release date, genre, synopsis, or main plot.
[0042] According to one embodiment, the at least one processor can obtain additional information corresponding to the first content through an artificial intelligence model, and obtain an embedding vector obtained based on the additional information as metadata corresponding to the first content.
[0043] According to one embodiment, the at least one processor generates a prompt corresponding to the acquisition of one or more category information corresponding to the first content, and based on the prompt, can acquire information obtained through the artificial intelligence model as the additional information corresponding to the first content.
[0044] According to one embodiment, the additional information may include at least one of the era, atmosphere, theme, setting, subject, character occupation, prominent keywords, cultural background, story classification, main audience, film industry, surrounding relationships, hidden meaning, hidden message, plot, year of release, main actors, director, or related video.
[0045] According to one embodiment, the at least one processor can obtain the similarity between a plurality of metadata embedding vectors included in the metadata database and the first query embedding, respectively, through a cosine similarity calculation method.
[0046] According to one embodiment, the at least one processor may assign weights to the similarity values of metadata corresponding to the same content for the one or more similar metadata, obtain a final similarity for each content for the one or more similar metadata, and rearrange the one or more similar metadata according to the obtained final similarity.
[0047] According to one embodiment, the at least one processor can control the display to output information related to content corresponding to the one or more similar metadatas, including a description corresponding to the category of the metadata.
[0048] According to one embodiment, the at least one processor can control the display to store the user voice input in a voice input database and output information corresponding to the search failure if metadata having a similarity greater than or equal to the first threshold among the plurality of metadata embedding vectors is not obtained.
[0049] FIG. 3 is a flowchart illustrating the operation of an electronic device performing a content search for a user's utterance according to one embodiment of the present disclosure.
[0050] An electronic device according to one embodiment (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2) receives a user utterance for content search and can search for content to be found in the user utterance by comparing metadata about the content with the user utterance. In the following embodiments, each operation may be performed sequentially, but is not necessarily performed sequentially. For example, the order of each operation may be changed, and at least two operations may be performed in parallel.
[0051] In operation 310, an electronic device (101) according to one embodiment may receive user voice input in response to a search input request through a microphone (e.g., microphone (250) of FIG. 2). The user voice input may be a sentence-type utterance requesting a content search. According to one embodiment, if the user voice input is a sentence-type utterance, the electronic device (101) may perform a semantic-based content search for a metadata database. When the electronic device (101) receives a sentence-type utterance, it may perform operation 320.
[0052] According to one embodiment, the electronic device (101) may perform a keyword search in a content database instead of a 320 operation when the user voice input is a keyword (word) or a concise sentence containing a keyword. The content database contains basic information about the content and may be stored and managed in a structure of items and item values. The content database may include text-format data such as (title, Extreme Job), (lead actor, Ryu Seong-ryong), (director, Lee Byung-hun) for the first content, for example.
[0053] In operation 320, an electronic device (101) according to one embodiment may obtain a first query embedding based on user voice input. The electronic device (101) may obtain the first query embedding using a text encoder. The electronic device (101) may convert the user voice input into an embedding vector for similarity comparison with metadata for content in the form of an embedding vector.
[0054] In operation 330, an electronic device (101) according to one embodiment can calculate the similarity between embedding vectors of a metadata database (DB) and a first query embedding. The electronic device (101) can, for example, calculate a cosine similarity value between each metadata embedding vector and a first query embedding. The electronic device (101) can determine that the similarity between the two vectors is higher as the cosine similarity value approaches 1.
[0055] Alternatively, in one embodiment, the electronic device (201) may use various similarity determination methods. For example, the similarity determination method may be Euclidean distance, Manhattan distance, Jaccard similarity, Pearson correlation coefficient, Mahalanobis distance, Hellinger distance, or Kullback-Liebler divergence.
[0056] In operation 340, an electronic device (101) according to one embodiment may acquire one or more similar metadata corresponding to a similarity greater than a threshold. The degree of similarity between two vectors may be proportional to the association between the metadata of the content and the user voice input. The electronic device (101) may set a threshold corresponding to the level of accuracy expected for the search results. The electronic device (101) may modify the threshold by reflecting user feedback on the search results.
[0057] In operation 350, an electronic device (101) according to one embodiment may obtain an order based on at least one of similarity or weights for other metadata of the same content. The electronic device (101) may rearrange the extracted metadata, i.e., the embedding vectors of the metadata. The electronic device (101) may rearrange the extracted embedding vectors based on the content. The electronic device (101) may sort the embedding vectors according to the content based on similarity values and, in the case of the same content, retain only the metadata with the highest similarity value. The electronic device (101) may rearrange them by reflecting the weights for other metadata of the same content. If multiple metadata for the same content is extracted, the electronic device (101) may assign weights to each of the metadata and adjust the ranking of the content according to the sum of the weights. The electronic device (101) may rearrange the extracted embedding vectors by considering both similarity and weights for other metadata of the same content. The electronic device (101) can reorder a metadata list by considering the ranking of metadata that has weighted the same content for embedding vectors sorted according to similarity.
[0058] In operation 360, an electronic device (101) according to one embodiment may output content corresponding to one or more similar metadata in a rearranged order. The electronic device (101) may output content corresponding to a top-linked embedding vector among the rearranged embedding vectors as a search result. The electronic device (101) may output both basic information about the content and metadata information corresponding to the embedding vector. The electronic device (101) may specify the association with the user's utterance by providing the user with specific details regarding metadata that is judged to be highly related to the user's utterance. Referring to FIG. 1, the electronic device (101) may display the movie Extreme Job on the search result screen in response to "Find me a movie where detectives sell chicken," and may also display setting information that the detectives in the movie Extreme Job operate a chicken restaurant for stakeout duty as metadata regarding the main settings of the movie Extreme Job.
[0059] FIG. 4 is a block diagram showing a search function component of an electronic device according to one embodiment of the present disclosure.
[0060] According to one embodiment, an electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2) may receive a user query (e.g., a sentence-type utterance) (401), search for content that the user query implies, and output the search results through a display (450) to provide to the user. The electronic device (101) may include a search API (410), a contents metadata database (420), an encoder (430), a re-rank module (440), and a display (450) as components related to the search function. According to one example, the electronic device (101) may include additional components (e.g., a microphone) in addition to the illustrated components, or at least one of the illustrated components may be omitted. The omitted component may be provided by an external electronic device, and the electronic device (101) may transmit and receive necessary data through data communication with the external electronic device.
[0061] According to one embodiment, a search API (410) of an electronic device (101) can perform a content search for a user query. The search API (410) can calculate the similarity between the user query and the content metadata. The search API (410) can transmit the user query (401) to an encoder (430) and receive a query embedding vector containing the user query (401) from the encoder (430). The search API (410) can receive a metadata embedding vector from a content metadata database (420). The content metadata database (420) can store one or more metadata for the content in the form of embedding vectors.
[0062] The search API (410) can calculate a cosine similarity value between a query embedding and a metadata embedding vector and generate an embedding list from which embedding vectors above a threshold are extracted. If there are no embedding vectors above the threshold, the search API (410) can store the user query (401) in the user utterance DB (460). The user utterance DB (460) can store user queries that failed to be searched. If a user query stored in the user utterance DB (460) is associated with specific content by user input, the user query can be stored as metadata for that content.
[0063] The re-rank module (440) receives an embedding list extracted from the search API (410) and can re-rank the embedding vectors by considering at least some of the weights for similarity and other metadata of the same content. The embedding list may contain embedding vectors sorted in descending order based on similarity values.
[0064] The reordering module (440) can assign a bonus point to the content when multiple metadata of the same content are included in the embedding list. For example, the reordering module (440) can determine the similarity of the content based on the sum of weighted similarity values of each metadata. Unlike the similarity value calculated for each content metadata, the reordering module (440) can improve the accuracy or quality of the semantic-based search results by calculating a final similarity that reflects the similarity values of multiple metadata for each content.
[0065] The re-ranking module (440) can re-rank in the following way. It is assumed that there are a total of N embedding vectors for which the similarity value between the user query embedding vector value and the embedding vector value of each metadata is greater than or equal to a threshold value. Since metadata may contain the same content, there may be multiple content identifiers (id). Up to M duplicate metadata may be included. At this time, the number of metadata (M) allowed to be duplicated for the same content may be experimentally determined as an optimal value by considering the accuracy of the search results and the processing speed.
[0066] The weight for metadata of identical content is denoted as W, and the default value of W can be set to 0.1. The similarity score (score range) can have a value between 0 and 1. If the similarity value is 0, it is considered that there is no match at all. If the similarity value is 1, it is considered that there is an exact match. The closer the similarity value is to 1, the more similar it is judged to be. The threshold for determining similarity to the user query embedding vector can be set based on the accuracy of the result.
[0067] The reordering module (440) can calculate a final score for each item in the embedding vector list as in Equation 1. Each item in the embedding vector list represents a similarity to the content metadata, and the final score can represent a final similarity to the content.
[0068]
[0069] In mathematical formula 1, represents the rank of item i of the embedding vector list, and represents the score of item i, N represents the total number of embedding vector lists, and w represents the weight. Also, is the rank-based coefficient (RBF) of item i iRepresents ) and is the value obtained by multiplying the rank-based coefficient by the score of each item ( ) normalized score (ONS i It is called ). The sum of the normalization scores for each content becomes the Accumulated Normalized Score. When calculating the Accumulated Normalized Score, if the input value is a user query of two words or less, it is treated as an exception value (exceptionNorScore) to be processed similarly to keyword search, and the sum is calculated based on a higher threshold (e.g., 0.75). For example, the Accumulated Normalized Score for exception input values can be calculated as shown in the following Equation 2.
[0070]
[0071] Cumulative Normalized Score ( By multiplying ) by the weight value (w), the maximum similarity value ( The value obtained by adding ) becomes the final score. The reordering module (440) can re-rank based on the final score.
[0072] The reordering module (440) can output a list of contents with duplicates removed through the display (450) after reordering the embedding vectors. Alternatively, the reordering module (440) can output the contents corresponding to the reordered embedding vectors as they are through the display (450) without removing duplicates.
[0073] The display (450) can display a list of search result contents and metadata of the contents, along with a message displaying search results for a user query (401) on the search result screen.
[0074] FIG. 5 is a flowchart illustrating the operation of an electronic device storing metadata of content in a database according to one embodiment of the present disclosure.
[0075] According to one embodiment, an electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2) may store origin metadata for content provided by a content server in a content metadata DB. In the following embodiments, each operation may be performed sequentially, but is not necessarily performed sequentially. For example, the order of each operation may be changed, and at least two operations may be performed in parallel.
[0076] In operation 510, according to one embodiment, an electronic device (101) may receive original metadata for content from a content server. The content server may be created and managed by one or more content providers who produce and / or distribute content. The content server may store content data and information about the content as metadata. The metadata provided by the content server may be referred to as origin metadata. The origin metadata may include basic information about the content. Basic information may include, for example, title, characters, actors, writer, director, release date (release date), genre, synopsis, and main plot. The electronic device (101) may receive original metadata for content from one or more content servers.
[0077] In operation 520, according to one embodiment, the electronic device (101) can convert original metadata into an embedding vector. The electronic device (101) can embedding the original metadata using a text encoder. The original metadata may be one or more.
[0078] 530 In operation, according to one embodiment, the electronic device (101) may store the embedding vector in a content metadata DB. In one embodiment, the content metadata DB may be stored in the memory of the electronic device (101) (e.g., the memory (220) of FIG. 2). Alternatively, in one embodiment, the content metadata DB may be implemented as a separate storage device, server, or cloud server.
[0079] FIG. 6 is a flowchart illustrating the operation of an electronic device storing additional metadata of content in a database according to one embodiment of the present disclosure.
[0080] According to one embodiment, an electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2) may store additional information obtained about the content as metadata. In the following embodiments, each operation may be performed sequentially, but is not necessarily performed sequentially. For example, the order of each operation may be changed, and at least two operations may be performed in parallel.
[0081] In operation 610, according to one embodiment, an electronic device (101) may receive content information from a content server. The content information may be one or more pieces of information capable of identifying the content. The content information may be basic information stored by the content creator. For example, the content information may include at least some of a title, characters, actors, writer, director, release date, genre, synopsis, and main plot.
[0082] In operation 620, according to one embodiment, the electronic device (101) can generate additional metadata for the content based on a large language model (LLM). The electronic device (101) can generate a prompt requesting additional information about the content and input the prompt into the LLM to obtain additional information describing the content.
[0083] Table 1 is an example of a prompt requesting additional information about the content.
[0084] <task>You have to generate additional metadata for given movie / show < / task> <goal> Main Goal for generating [contents title] extra metadata is to enable deep and smart search for users.This data will be converted to vector embedding for KNN search with user query which means we cannot have repetition of nouns / words in our generated metadata.Each piece of information is required only once.As you know there could be multiple contents with same titles,So carefully understand the input information and If you donot know or understand about the asked content, Pls give blank output.Do Not hallucinate.< / goal> <requirements>Categories for which you need to generate content descriptors are- Time Period(displayed in story), Mood, Theme, Setting, Subject, Cultural Background(american , english , indian , african , british etc.),Story Classification[different to genres given in prompt], Film Industry(bollywood , hollywood , hallyuwood etc) .Best for which generation(Gen X , millennials , Gen Z, Gen alpha , adults , old-age , young , children, teenagers [return 1 of these values in english]) ,Around Relationship(signifies relations between lead characters like father-daughter , husband-wife , student-teacher etc.) ,Character Profession (of lead characters only), prominent keywords(most important words regarding the program not considered in any other content descriptor),Plot1(clearly describes plot / story of program in minimum 350 words) ,Plot2(Explain important scenes with information like(story classification, overall mood ,themes covered, important topics covered, highlights shown).Do not directly give the words in response only information is required in minimum 300 words) ,plot3(clear explains what happened in the ending / climax of the program and other key information like hidden meaning, hidden message in minimum 300 words).Related(Generate at least 10 related contents for a movie / show.Each related content must be specified only once, Related Content signifies other contents with same genre , actor, director etc.that user can watch or we can recommend. For internal purpose, we would also need release date in format YYYY-MM-DD and also its type (movie or show)).Also specify the casts and directors of the asked content.< / requirements>Also, these Plots are for semantic search so only give relevant information in a contextual way only. Do not give redundant information or words.Give the response in following [JSON] format [Must Generate values in English language] { "Content Descriptor": { "Time Period": "", "Mood": "", "Theme": "", "Setting": "", "Subject": "", "Character Profession": "", "Prominent Keywords": "", "Cultural Background": "", "Story Classification": "", "Best for which generation": "", "Film Industry": "", "Around Relationship": "", "Hidden Meaning": "", "Hidden Message": "", "Plot1": "", "Plot2": "", "Plot3": "", "Release Year":"", "Cast":"", "Director":"", "Related": [ { "Title": "", "Released Year": "", "Type":"" } ] }}
[0085] The prompts in Table 1 may consist of a task definition requesting the creation of additional metadata, a specific goal definition for creating metadata, a requirements definition describing the categories required to create content descriptors and the conditions for each category, and a data format (e.g., JSON) definition for the created metadata. The categories are intended to obtain content information by detailed item, for example, in Table 1, include time period, mood, theme, setting, subject, character profession, prominent keywords, cultural background, story classification, best for which generation, film industry, around relationship, hidden meaning, hidden message, plot, release year, cast, director, or related title, released year, type. According to one embodiment, the electronic device (101) may generate various prompts. The prompts may also vary depending on the type of content.
[0086] According to one embodiment, the electronic device (101) can generate additional information generated by the LLM as additional metadata for the content.
[0087] In operation 630, in one embodiment, the electronic device (101) may convert additional metadata into an embedding vector using a text encoder. The additional metadata may be one or more, and the embedding vector may also be one or more.
[0088] In operation 640, according to one embodiment, the electronic device (101) may store the embedding vector in a content metadata DB. In one embodiment, the content metadata DB may be stored in the memory of the electronic device (101) (e.g., the memory (220) of FIG. 2). Alternatively, in one embodiment, the content metadata DB may be implemented as a separate storage device, server, or cloud server.
[0089] FIG. 7 is a block diagram showing a metadata generation function component of an electronic device according to one embodiment of the present disclosure.
[0090] According to one embodiment, an electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2) may receive content information from a content provider server, generate metadata for the content, and store and manage it in a content metadata database. The electronic device (101) may include a metadata generator (710), a content server (720), an artificial intelligence model (730), an encoder (740), a content metadata database (750), and a search batch (760) as components related to the metadata generation function. According to one example, the electronic device (101) may include additional components (e.g., communication circuits) in addition to the illustrated components, or at least one of the illustrated components may be omitted. The omitted components may be provided by an external electronic device (e.g., a cloud server), and the electronic device (101) may transmit and receive necessary data through data communication with the external electronic device.
[0091] The metadata generator (710) can receive content data from the content server (720), obtain additional information about the content, and generate additional metadata. The content data may include one or more pieces of information that can identify the content provided by the content server (720). For example, it may include information on the title, characters, actors, author, and director of the content. The metadata generator (710) can obtain additional information about the content using an AI model (760). The metadata generator (710) can send a prompt to the AI model (760) requesting the generation of additional information about the content, and receive additional content data generated by the AI model (760).
[0092] The AI model (760) can generate additional information about the content. For example, the AI model (760) may be a large language model that receives a prompt-style request and generates an answer to the request. An LLM refers to a language model based on an artificial neural network that has learned a large amount of text data through prior training. An LLM may contain far more parameters (e.g., more than 10 billion) than a conventional general language model. An LLM may use a transformer artificial neural network structure based on an attention mechanism.
[0093] The attention mechanism is a technique that helps artificial intelligence models focus on important parts within input data. The attention mechanism predicts the extent to which certain parts of time-series input data (e.g., input data such as voice or video, or input data for specific layers of a neural network) contribute to the network's intermediate or final output, and can be utilized for output data prediction. While the recurrent neural network (RNN) structure, which processes each element of a sequence sequentially, suffers from degraded prediction performance when there is information dependency over long time-series distances, the attention mechanism can account for information dependency over long time-series distances by controlling the degree of weighted attention within the context of the input data as a whole (or part thereof).
[0094] A Transformer can be composed of an encoder-decoder structure. The encoder processes input data to output compressed information (e.g., contextual representation), and the decoder processes the compressed information to output data in token units. Each of the encoder and decoder may include an independent attention network and a cross-attention network connecting the encoder and decoder.
[0095] For example, LLM learning may include pre-training and / or fine-tuning. Pre-training is the process of enabling the LLM to acquire general linguistic knowledge using large amounts of text data, and may include, for example, self-supervised learning that predicts the next word using the previous sequence of words in a sequence of text. Fine-tuning is the process of training the LLM to be suitable for a specific domain (e.g., chatbot, translation, summarization, Q&A) or task; based on the pre-trained model, the LLM can be further supervised (or adaptive) using a dataset tailored to the domain purpose. The LLM can perform tasks using text input containing natural language called a prompt.
[0096] For example, fine-tuning can be omitted during LLM training. Users can control the prompts input to the LLM to improve the performance of desired tasks. Similar to in-context learning or zero-shot / few-shot learning, task examples and / or guides for performing the task can be added to the prompts. Examples of publicly available LLMs include BERT (Bidirectional Encoder Representations from Transformer) and GPT (Generative Pre-trained Transformer).
[0097] The AI model (760) can receive additional input of image information (including video) in addition to text. The image information can be converted into text through a separate pre-transformation (e.g., image recognition, scene recognition) and included in the prompt to generate a response. As another example, the input image can be converted into an image embedding aligned with text through an image encoder, and a response can be generated using a separately trained model (e.g., a large multimodal model) using the text embedding corresponding to the input text.
[0098] The term 'LLM' can refer to the language neural network model itself, but it can also refer to models for LLM-based applications (e.g., chatbots, translation, summarization, text classification, sentence generation). For instance, an LLM-based chatbot or translator can also be referred to as 'LLM'.
[0099] 'LLM' may include an inference engine using an LLM neural network model. For example, "inputting an input prompt into the LLM" may mean "inputting an input prompt into an inference engine based on the LLM." For example, "the output of the LLM for the input prompt" may refer to the output information of the last neural network layer of the LLM (or output information modified through additional processing) obtained when the input prompt is input into an inference engine based on the LLM.
[0100] The metadata generator (710) can store additional content data generated by the AI model (760) as metadata and embed it to store it in the content metadata database (750). The metadata generator (710) can pass the additional metadata to the encoder (740), and the encoder (740) can convert it into an embedding vector and store it in the content metadata database (750).
[0101] A search batch (760) can receive basic information about content from a content server (720) and store it in a content metadata database (750) as origin metadata. The search batch (760) can pass the origin metadata to an encoder (740), and the encoder (740) can convert it into an embedding vector and store it in a content metadata database (750).
[0102] The content metadata database (750) can store and manage one or more metadata about the content in the form of embedding vectors.
[0103] FIG. 8 is an example of metadata according to an embodiment of the present disclosure.
[0104] According to one embodiment, a content metadata database (800) (e.g., the content metadata database (420) of FIG. 4, the content metadata database (750) of FIG. 7) may include a plurality of metadata. Each metadata (801, 802, 803, 804, 805) may include identification information (id), content identification information (contents id), content title (title), data classification (data), and embedding vector.
[0105] The first metadata (801), the second metadata (802), and the third metadata (803) relate to the first content (contents id: 0001). The fourth metadata (804) and the fifth metadata (805) relate to the second content (contents id: 0002). Metadata for the same content may be classified by data classification. The content metadata database (800) may be configured to include a limited number of metadata for each data classification according to the data management policy. For example, if the data classification is plot, up to three metadata entries for plot may be stored. In one embodiment, the data management policy may differ for each content metadata database (800), and the data classification may also be defined differently.
[0106] The metadata embedding vector can be defined as a value embedded into K dimensions. The number of dimensions can be determined by the text encoder. The dimensions of the metadata embedding vector and the dimensions of the embedding vector embedding the user utterance can be the same.
[0107] According to one embodiment, an electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2) receives a user utterance and, in the case of a sentence-type search request, can extract one or more contents associated with the user utterance from the metadata included in the content metadata database (800) as search results. The electronic device (101) can calculate the similarity between an embedding vector containing the user utterance and the embedding vector value of the metadata, and extract metadata having a similarity greater than or equal to a threshold value as search results. The content metadata database (800) includes the embedding vector value of each metadata so that the similarity calculation with the embedding vector value containing the user utterance can be performed quickly.
[0108] FIG. 9 is an example of a search result table based on user utterance of an electronic device according to one embodiment of the present disclosure.
[0109] According to one embodiment, an electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2) can search for content highly associated with a user utterance in a content metadata database (e.g., the content metadata database (420) of FIG. 4, the content metadata database (750) of FIG. 7) and provide the search results to the user. The content metadata database (410) can store information about the content by metadata. The electronic device (101) can output a list of search result metadata based on the similarity search between the content metadata and the user utterance, as shown in Table 1 (910) of FIG. 9.
[0110] In Table 1 (910), the search result metadata list may be displayed sorted in descending order of similarity values (scores). The electronic device (101) or the reordering module of the electronic device (101) (the reordering module (440) in FIG. 4) may re-rank the list according to content in the search result metadata list, taking into account similarity and weights for identical content. Table 2 (920) shows the list re-ranked according to content in the content metadata search results, taking into account similarity and weights for identical content. Table 1 (910) represents a portion of the total search results, and Table 2 (920) represents the re-ranked list of the search results in Table 1 (910).
[0111] Referring to FIG. 9, when the total number of items (N) included in the metadata list is 200 and the rank index according to the similarity value (score) is from 0 to 199, Table 1 (910) represents ranks 0 to 7 according to the similarity value. Table 2 (920) represents a list of Table 1 (910) rearranged by content.
[0112] In Table 1 (910), the 0th rank content (metadata id=00101101) has the title "big shot" and there is one identical content (content id=1) as the 5th rank content (metadata id=00110000). The 1st rank content (metadata id=00001011) has the title "super man" and there is one identical content (content id=2) as the 7th rank content (metadata id=01011100). The 2nd rank content (metadata id=10100110) has the title "mission impossible 1" and there are three identical contents (content id=3) as the 2nd, 3rd, and 4th ranks.
[0113] The reordering module (440) can calculate the final score for the contents of Table 1 (910) according to Equation 1. For example, the following Table 2 shows the score calculation formula according to Equation 1.
[0114] Content idtitleFinal score1Big Shot0.79 + 0.1* (0.79*(200-0) / 200 + 0.62*(200-5) / 200)2Superman0.75 + 0.1*(0.75* (200-1) / 200 + 0.6 *(200-7) / 200)3Mission Impossible10.72 +0.1* ( 0.72*(200-2) / 200 + 0.71*(200-3) / 200 + 0.65*(200-4) / 200)4GodFather20.61 + 0.1*(0.61*(200-6) / 200)
[0115] The results of reordering based on the final score calculated as in Table 2 are as shown in Table 2. Unlike Table 1 (910), the first-ranked content in Table 2 (920) is mission impossible1, not superman. This reflects that metadata for Mission impossible1 was searched 3 times and superman was searched 2 times.
[0116] FIG. 10 is an example of a search result screen of an electronic device according to one embodiment of the present disclosure.
[0117] According to one embodiment, an electronic device (e.g., the electronic device (101) of FIG. 1, the electronic device (201) of FIG. 2) can output a search result screen for a user utterance through a display (1001) (e.g., the display (110) of FIG. 1, the display (230) of FIG. 2).
[0118] According to one embodiment, the electronic device (101) may display a first part (1010) indicating that a search result for a search query including a user utterance is displayed, a second part (1021) indicating that content with the highest similarity to the user utterance is displayed as a search result, a third part (1022) indicating that a search result content according to a type of metadata is displayed, and a fourth part (1023) on a display (1001) screen.
[0119] The first part (1010) can directly display the text part of the user's speech that has been recognized as speech ("Find me a movie where detectives sell chicken") and, to indicate that a highly relevant result is displayed, display "a highly relevant result for 'user speech'".
[0120] The second part (1021) can display the content with the highest search ranking based on the result of reordering reflecting similarity judgment and weights for identical content. The second part (1021) can display the type of metadata (e.g., key scenes) that was judged to have high similarity, and can also display basic information about the content.
[0121] The third part (1022) and the fourth part (1023) can display the next highest search ranking content based on similarity judgment and reordering, and can indicate whether the type of metadata is a title (1022) and a plot (1023), respectively, and specify the type of metadata.
[0122] An electronic device (101) according to one embodiment can provide a user with an explanation of the search result process by displaying metadata types (e.g., main scene, title, plot) for the search result together.
[0123] FIG. 11 illustrates an electronic device, a database, and a cloud server according to one embodiment of the present disclosure.
[0124] According to one embodiment, an electronic device (1110) can interact with a cloud server (1030) and a content metadata database (1050) to provide content search based on user utterance.
[0125] The content metadata database (1050) can store metadata about the content in the form of an embedding vector and may be included in the memory of an electronic device (e.g., the memory (220) of FIG. 2) or implemented as a separate storage device (1050). When the content metadata database (1050) is included in a separate storage device as in FIG. 11, the electronic device (1110) can access the content metadata of the content metadata database (1050) through a communication circuit (e.g., the communication circuit (270) of FIG. 2).
[0126] The electronic device (1110) can determine similarity between user utterances and content metadata through the processor of the electronic device (e.g., the processor (210) of FIG. 2). Alternatively, at least some operations of determining similarity and reordering operations considering weights for identical content can be performed through a cloud server (1130) that provides semantic-based content search. The electronic device (1110) can transmit and receive data with the cloud server (1030) through a communication circuit (270) with the electronic device (1110). The cloud server (1130) can be connected to the electronic device (1110) and the content metadata database (1150) based on wireless communication.
[0127] According to one embodiment, a server (1130) comprises a communication circuit; a memory; and at least one processor, and when the at least one processor receives a user voice input for a content search request from a first electronic device through the communication circuit, it obtains a similarity between a plurality of metadata embedding vectors included in a metadata database and a first query embedding obtained based on the user voice input, obtains one or more similar metadata corresponding to a similarity greater than or equal to a first threshold among the plurality of metadata embedding vectors, and transmits information related to content corresponding to the one or more similar metadata to the first electronic device through the communication circuit, including order information based on at least one of the similarity or a weight for other metadata of the same content.
[0128] According to one embodiment, the at least one processor may receive information corresponding to a first content from an external server through the communication circuit, and store an embedding vector obtained based on the information corresponding to the first content as metadata for the first content in the metadata database.
[0129] According to one embodiment, information corresponding to the first content may include at least one of a title, characters, actors, writer, director, synopsis, release date, genre, synopsis, or main plot.
[0130] According to one embodiment, the at least one processor may acquire additional information corresponding to the first content using an artificial intelligence model, and acquire an embedding vector acquired based on the additional information as metadata corresponding to the first content and store it in the metadata database.
[0131] According to one embodiment, the at least one processor generates a prompt corresponding to the acquisition of one or more category information corresponding to the first content, and based on the prompt, can acquire information obtained through the artificial intelligence model as additional information for the first content.
[0132] According to one embodiment, the additional information may include at least one of the era, atmosphere, theme, setting, subject, character occupation, prominent keywords, cultural background, story classification, main audience, film industry, surrounding relationships, hidden meaning, hidden message, plot, year of release, main actors, director, or related video.
[0133] According to one embodiment, the at least one processor can obtain the similarity between a plurality of metadata embedding vectors included in the metadata database and the first query embedding, respectively, through a cosine similarity calculation method.
[0134] According to one embodiment, the at least one processor may assign weights to the similarity values of metadata corresponding to the same content for the one or more similar metadata, obtain a final similarity for each content for the one or more similar metadata, and rearrange the one or more similar metadata according to the obtained final similarity.
[0135] According to one embodiment, if metadata having a similarity greater than or equal to the first threshold among the plurality of metadata embedding vectors is not obtained, the at least one processor may store the user voice input in a voice input database and transmit information corresponding to the search failure to the first electronic device through the communication circuit.
[0136] The embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of said embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of said items unless the relevant context clearly indicates otherwise. In this document, phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B, or C" each may include any one of the items listed together in the corresponding phrase, or all possible combinations thereof. Terms such as "first," "second," or "first" or "second" may be used simply to distinguish said components from other said components and do not limit said components in any other aspect (e.g., importance or order). Where any (e.g., 1st) component is referred to as “coupled” or “connected” to another (e.g., 2nd) component, with or without the terms “functionally” or “communicationly,” it means that said any component may be connected to said other component directly (e.g., via a wire), wirelessly, or through a third component.
[0137] The term “module” as used in the embodiments of this document may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. A module may be a component formed integrally, or a minimum unit of said component or a part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).
[0138] One embodiment of the present document may be implemented as software (e.g., program (140)) comprising one or more instructions stored in a storage medium (e.g., internal memory (136) or external memory (138)) readable by a machine (e.g., electronic device (101)). For example, a processor (e.g., processor (120)) of the machine (e.g., electronic device (101)) may call at least one of the one or more instructions stored in the storage medium and execute it. This enables the machine to be operated to perform at least one function according to the at least one called instruction. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. The storage medium readable by the machine may be provided in the form of a non-transitory storage medium. Here, 'non-temporary' simply means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic waves), and the term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily.
[0139] According to one embodiment, the method according to the embodiments disclosed herein may be provided by being included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store (e.g., Play Store™) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created on a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.
[0140] According to one embodiment, each component (e.g., module or program) of the components described above may include a singular or multiple entities, and some of the multiple entities may be separated and placed in other components. According to one embodiment, one or more of the components or operations of the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the multiple components in the same or similar manner as those performed by the corresponding component among the multiple components prior to integration. According to one embodiment, operations performed by the module, program, or other components may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.
Claims
1. In an electronic device, display; Microphone; and It includes at least one processor, The above at least one processor is, Receiving user voice input through the above microphone, and Based on the above user voice input, a first query embedding is obtained, and Each of the similarity values between a plurality of metadata embedding vectors included in the metadata database and the first query embedding is obtained, and Based on the above similarity, one or more similar metadata having a similarity greater than or equal to a first threshold among the plurality of metadata embedding vectors are obtained, and Based on the weights for one or more metadata of the aforementioned similarity or identical content, the order of the one or more similar metadata is obtained, and An electronic device that controls the display to output information related to content corresponding to one or more similar metadata in the order described above.
2. In Paragraph 1, The above at least one processor is, Receive information corresponding to the first content from the server, and An electronic device that obtains an embedding vector obtained based on information corresponding to the first content as metadata for the first content.
3. In Paragraph 2, An electronic device for which information corresponding to the first content above includes at least one of a title, characters, actors, writer, director, synopsis, release date, genre, synopsis, or main plot.
4. In Paragraph 2, The above at least one processor is, Using an artificial intelligence model to obtain additional information corresponding to the first content, and An electronic device that obtains an embedding vector obtained based on the above additional information as metadata corresponding to the first content.
5. In Paragraph 4, The above at least one processor is, Generate a prompt corresponding to the acquisition of one or more category information corresponding to the first content above, and An electronic device that obtains information obtained through the artificial intelligence model based on the above prompt as additional information corresponding to the above first content.
6. In Paragraph 4, The above additional information is an electronic device comprising at least one of the following: historical background, atmosphere, theme, setting, subject, character occupation, prominent keywords, cultural background, story classification, target audience, film industry, surrounding relationships, hidden meaning, hidden message, plot, year of release, main actors, director, or related video.
7. In Paragraph 1, The above at least one processor is, An electronic device that obtains, respectively, similarity between a plurality of metadata embedding vectors included in the metadata database and the first query embedding through a cosine similarity calculation method.
8. In Paragraph 7, The above at least one processor is, A weight is assigned to the similarity of metadata belonging to the same content among the above one or more similar metadatas, and For the above one or more similar metadata, a final similarity is obtained for each content, and An electronic device that rearranges one or more similar metadata according to the final similarity obtained above.
9. In Paragraph 1, The above at least one processor is, An electronic device that controls the display to output information related to content corresponding to one or more similar metadatas, including a description corresponding to a category of the metadata.
10. In Paragraph 1, The above at least one processor is, Based on one or more similar metadata corresponding to a similarity below the first threshold value, the user voice input is stored in a voice input database, and An electronic device that controls the display to output information corresponding to a search failure.
11. Regarding the server, Communication circuit; Memory; and It includes at least one processor, The above at least one processor is, Receiving user voice input for a content search request from a first electronic device through the above communication circuit, and Based on the above user voice input, a first query embedding is obtained, and Each of the similarity values between a plurality of metadata embedding vectors included in the metadata database and the first query embedding is obtained, and Based on the above similarity, one or more similar metadata having a similarity greater than or equal to a first threshold among the plurality of metadata embedding vectors are obtained, and Based on the weights for one or more metadata of the aforementioned similarity or identical content, the order of the one or more similar metadata is obtained, and A server that transmits information related to content corresponding to one or more similar metadata, including the above sequence information, to the first electronic device through the communication circuit.
12. In Paragraph 11, The above at least one processor is, Receive information corresponding to the first content from an external server through the above communication circuit, and A server that stores an embedding vector obtained based on information corresponding to the first content as metadata for the first content in the metadata database.
13. In Paragraph 12, Information corresponding to the first content above includes at least one of a title, characters, actors, writer, director, synopsis, release date, genre, synopsis, or main plot, on a server.
14. In Paragraph 12, The above at least one processor is, Using an artificial intelligence model to obtain additional information corresponding to the first content, and A server that obtains an embedding vector based on the above additional information as metadata corresponding to the first content and stores it in the metadata database.
15. A non-volatile computer-readable storage medium that records instructions, When the above instructions are executed by one or more processors, the one or more processors: The operation of receiving user voice input by the microphone of an electronic device; The operation of obtaining a first query embedding based on the above user voice input; An operation of obtaining similarity between a plurality of metadata embedding vectors included in a metadata database and the first query embedding, respectively; Based on the above similarity, an operation of obtaining one or more similar metadata having a similarity greater than or equal to a first threshold among the plurality of metadata embedding vectors; An operation to obtain the order of one or more similar metadata based on the weights for one or more metadata of the similarity or identical content; and A non-transient computer-readable storage medium that performs an operation to control the display of the electronic device to output information related to contents corresponding to one or more similar metadata in the order described above.