Method for generating reaction content, and device thereof

The method and device generate personalized reaction content by using user-specific data to create prompts for generative models, addressing the challenge of time-consuming creative content production with existing AI models.

WO2026135188A1PCT designated stage Publication Date: 2026-06-25SAMSUNG ELECTRONICS CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SAMSUNG ELECTRONICS CO LTD
Filing Date
2025-12-16
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing generative AI models struggle to produce personalized and creative reaction content that reflects a user's unique characteristics and preferences for video content, requiring significant time and effort from users.

Method used

A method and electronic device that obtain content-related information, including image and voice data from a video, identify user-specific scenes and inputs, and generate a prompt using a generative model to create personalized reaction content, incorporating user characteristics and preferences.

Benefits of technology

Enables quick and effective generation of diverse and creative reaction content that reflects a user's unique style and interests, reducing the time and effort required for content creation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025021920_25062026_PF_FP_ABST
    Figure KR2025021920_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A method for generating reaction content and an electronic device are provided. The method for generating reaction content comprises the steps of: acquiring content-related information including image data corresponding to a first video and voice data corresponding to the first video; acquiring, on the basis of user characteristic information related to the content-related information, at least one scene included in the first video and a user voice input corresponding to the at least one scene; acquiring a prompt on the basis of the user characteristic information related to the content-related information, the at least one scene included in the first video, and the user voice input corresponding to the at least one scene; and acquiring reaction content corresponding to the first video through a generative model by using the prompt.
Need to check novelty before this filing date? Find Prior Art

Description

Method and apparatus for generating reactive content

[0001] The present disclosure relates to a method for generating reaction content and an electronic device for generating reaction content.

[0002] Generative Artificial Intelligence (GAI) is a technology that learns the structure and patterns of large-scale data and generates new synthetic data based on input data. It produces human-level results in various tasks involving text, images, voice, video, music, and more. For example, a generative model generates new data based on given data such as text, images, voice, video, and music. However, even if a generative model can generate new data containing content that matches the input based on the given data, it is difficult to obtain results specialized for the user.

[0003] Recently, with the increase in various VOD and streaming video services, users have gained easy access to a large volume of video content. Consequently, there is a growing trend of users creating "reaction content" (or "reaction recording content"), which records their impressions or reactions after watching video content. This reaction content is primarily recorded in the form of posts, reviews, or videos posted on social media platforms or blogs, and can manifest as a way for users to share their opinions or experiences regarding video content. As a form of user creative activity, reaction content can reflect the user's unique characteristics, interests, or content creation style. However, there is a limitation in that it requires a significant amount of time for users to produce more diverse and creative reaction content.

[0004] Accordingly, there is a demand for content creation technology capable of producing personalized reaction content more quickly and effectively, reflecting users' unique characteristics, preferences for video content, interests, or content creation styles.

[0005] A method for generating response content according to one embodiment of the present disclosure comprises: a step of obtaining content-related information including image data corresponding to a first video and voice data corresponding to the first video; a step of obtaining at least one scene included in the first video and a user voice input corresponding to the at least one scene based on user characteristic information related to the content-related information; a step of obtaining a prompt based on user characteristic information related to the content-related information, at least one scene included in the first video, and the user voice input corresponding to the at least one scene; and a step of obtaining response content corresponding to the first video through a generation model using the prompt.

[0006] An electronic device according to one embodiment of the present disclosure includes at least one processor and a memory comprising one or more storage media for storing one or more instructions.

[0007] According to one embodiment of the present disclosure, at least one processor obtains content-related information including image data corresponding to a first video and voice data corresponding to the first video.

[0008] According to one embodiment of the present disclosure, at least one processor obtains at least one scene included in the first video and a user voice input corresponding to the at least one scene, based on user characteristic information related to the content-related information.

[0009] According to one embodiment of the present disclosure, at least one processor obtains a prompt based on user characteristic information related to the content-related information, at least one scene included in the first video, and a user voice input corresponding to the at least one scene.

[0010] At least one processor according to one embodiment of the present disclosure obtains response content corresponding to the first video through a generation model using the prompt.

[0011] The present disclosure can be easily understood from the combination of the following detailed description and the accompanying drawings, where reference numerals denote structural elements.

[0012] FIG. 1 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure to generate a prompt related to the reaction of a user who has watched a first video.

[0013] FIG. 2 is a diagram illustrating the operation of obtaining response recording content from a prompt generated by an electronic device according to one embodiment of the present disclosure.

[0014] FIG. 3 is a block diagram illustrating the operation of an electronic device, a heart rate measuring device, and a server according to one embodiment of the present disclosure.

[0015] FIG. 4 is a diagram illustrating the operation of converting collected data into a vector form by an electronic device according to one embodiment of the present disclosure.

[0016] FIG. 5 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure generating a prompt through a user-generated content identification module.

[0017] FIG. 6 is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to identify user-generated content related to time-series content information.

[0018] FIG. 7 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure to generate a prompt through a scene and speech generation module and a user preference information identification module.

[0019] FIG. 8 is a diagram for representing data collected by an electronic device according to one embodiment of the present disclosure as time-series information according to time intervals.

[0020] FIG. 9a is an example of an operation in which an electronic device according to one embodiment of the present disclosure calculates importance based on the similarity between a user speech input and voice data of a first video / image data of a first video.

[0021] FIG. 9b is an example of an operation in which an electronic device according to one embodiment of the present disclosure calculates importance based on the similarity between user preference information and voice data of the first video / image data of the first video.

[0022] FIG. 10 is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to generate a scene in a first video based on importance.

[0023] FIG. 11a is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to generate a set of consecutive images of a scene in a first video based on importance.

[0024] Figure 11b is a diagram illustrating the method of Figure 11a.

[0025] FIG. 12 is an example of a scene generated by an electronic device according to one embodiment of the present disclosure and a speech input corresponding to the scene.

[0026] FIG. 13 is an example showing additional comments mapped to importance values ​​in an electronic device according to one embodiment of the present disclosure.

[0027] FIG. 14 is a flowchart illustrating a method for an electronic device to generate a prompt according to one embodiment of the present disclosure.

[0028] FIG. 15 is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to obtain response recording content from a prompt.

[0029] FIG. 16 is a detailed block diagram of an electronic device according to one embodiment of the present disclosure.

[0030] In the present disclosure, the expression “at least one of a, b, or c” may refer to “a”, “b”, “c”, “a and b”, “a and c”, “b and c”, “a, b, and c all”, or variations thereof.

[0031] Embodiments of the present disclosure are described below in detail with reference to the attached drawings so that those skilled in the art can easily implement them. However, the present disclosure may be embodied in various different forms and is not limited to the embodiments described herein.

[0032] The terms used in this disclosure are described in their current, general form considering the functions mentioned herein; however, they may refer to various other terms depending on the intent of those skilled in the art, case law, the emergence of new technologies, etc. Accordingly, the terms used in this disclosure should not be interpreted solely by their names, but should be interpreted based on the meaning of the terms and the overall content of this disclosure.

[0033] Furthermore, the terms used in this disclosure are used merely to describe specific embodiments and are not intended to limit this disclosure.

[0034] Throughout the specification, when a part is described as being "connected" to another part, this includes not only cases where they are "directly connected," but also cases where they are "electrically connected" with other components in between.

[0035] The terms “above” and similar designations used in this specification, particularly in the claims, may indicate both singular and plural forms. Furthermore, unless there is a description explicitly specifying the order of the steps describing the method according to this disclosure, the described steps may be performed in a suitable order. This disclosure is not limited by the order in which the described steps are described.

[0036] Phrases such as "in one embodiment" appearing in various places in this specification do not necessarily refer to the same embodiment.

[0037] Some embodiments of the present disclosure may be represented by functional block configurations and various processing steps. Some or all of these functional blocks may be implemented by various numbers of hardware and / or software configurations that execute specific functions. For example, the functional blocks of the present disclosure may be implemented by one or more microprocessors or by circuit configurations for a specific function. Additionally, for example, the functional blocks of the present disclosure may be implemented in various programming or scripting languages. The functional blocks may be implemented as algorithms executed on one or more processors. Furthermore, the present disclosure may employ prior art for electronic configuration, signal processing, and / or data processing, etc. Terms such as “mechanism,” “element,” “means,” and “configuration” may be used broadly and are not limited to mechanical and physical configurations.

[0038] Furthermore, the connecting lines or connecting members between the components depicted in the drawings are merely illustrative of functional connections and / or physical or circuit connections. In the actual device, connections between components may be represented by various alternative or added functional connections, physical connections, or circuit connections.

[0039] Additionally, terms such as "...part," "module," etc., as described in the specification refer to a unit that processes at least one function or operation, and this may be implemented in hardware or software, or as a combination of hardware and software.

[0040] In the present disclosure, the “processor” may include various processing circuits and / or a plurality of processors. For example, the term “processor” as used herein, including in the claims, may include at least one processor and various processing circuits. In the at least one processor, one or more processors may be configured to perform the various functions described herein in a distributed manner, individually and / or collectively. As used herein, the “processor,” “at least one processor,” and “one or more processors” may be configured to perform various functions. However, these terms cover, without limitation, situations where one processor performs some of the functions and other processor(s) perform other parts of the functions, and situations where a single processor can perform all functions. Additionally, the at least one processor may include a combination of processors performing various functions of the disclosed functions in a distributed manner. The at least one processor may execute program instructions to achieve or perform various functions.

[0041] In the present disclosure, expressions such as 'first similarity,' 'second similarity,' etc., may be terms used to distinguish similarities of different meanings within a single sentence (or paragraph). However, even if such terms are used identically throughout the entire disclosure, they may not be used with the same meaning. Depending on the context, they may signify different degrees of similarity.

[0042] In the present disclosure, the term "user" refers to a person using an electronic device and may include a consumer, evaluator, viewer, administrator, or installer. Additionally, in the specification, "manufacturer" or "provider" may refer to a manufacturer that manufactures an electronic device and / or components included in the electronic device.

[0043] In the present disclosure, 'image' may include a still image, a graphic, a picture, a frame, a video composed of a plurality of consecutive still images, or a video.

[0044] In this disclosure, the term "neural network" is a representative example of an artificial neural network model that mimics brain neurons and is not limited to an artificial neural network model using a specific algorithm. The term "neural network" may also be referred to as a deep neural network.

[0045] The present disclosure will be described in detail below with reference to the attached drawings.

[0046] FIG. 1 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure generating a prompt related to the reaction of a user who has watched a first video. FIG. 2 is a diagram illustrating the operation of obtaining reaction recording content from a prompt generated by an electronic device according to one embodiment of the present disclosure.

[0047] Referring to FIGS. 1 and FIGS. 2, a system for generating response recording content according to one embodiment of the present disclosure may include an electronic device (1000), a heart rate measuring device (2000 in FIG. 1), and a server (3000 in FIG. 2).

[0048] An electronic device (1000) according to one embodiment of the present disclosure may be implemented as an electronic device of various types and forms including a display. The electronic device (1000) may include devices capable of displaying through a display, such as a smart TV, smartphone, tablet PC, PDA (personal digital assistant), laptop PC, glasses-type display, and head-mounted display (HMD), but is not limited thereto. For example, the electronic device (1000) may be implemented as an electronic device of various types and forms capable of wired / wireless connection with a display. For example, the electronic device (1000) may include devices capable of displaying through wired / wireless connection with a display, such as a set-top box or desktop PC, but is not limited thereto.

[0049] An electronic device (1000) according to one embodiment of the present disclosure can communicate with a heart rate measuring device (2000) and a server (3000) through a network, and the network can be implemented as any type of wireless network such as a wired network such as a Local Area Network (LAN), a Wide Area Network (WAN), or a Value Added Network (VAN), or a mobile radio communication network or a satellite communication network.

[0050] In one embodiment of the present disclosure, the heart rate measuring device (2000) may be a device that measures a user's heart rate using at least one heart rate sensor. For example, the heart rate sensor may include an ECG (Electrocardiogram) sensor that detects heart rate and heart activity, a PPG (Photoplethysmogram) sensor that measures heart rate and oxygen saturation, etc. The heart rate measuring device (2000) may be implemented as a wearable device such as a smart watch or smart glasses and may be worn by a user, but is not limited thereto. The heart rate measuring device (2000) and the electronic device (1000) may be connected to the same user account. For example, the heart rate measuring device (2000) may transmit the measured user's heart rate information (2001) to the electronic device (1000).

[0051] In one embodiment of the present disclosure, the server (3000) may be a device that generates user-customized response record content using a generative model. The server (3000) may be a device capable of processing complex computations and tasks using large-scale data, such as training, inference, management, and distribution of the generative model. According to one embodiment of the present disclosure, the training of the generative model executed on the server (3000) may be performed by another computing device.

[0052] An electronic device (1000) according to one embodiment of the present disclosure may generate a prompt (210), which is input data to be input into a generation model (3001), in order to obtain a reaction recording content (220) that records the reaction of a user watching a first video (110). The electronic device (1000) may generate a prompt (210) for the first video (110) and obtain the reaction recording content (220) for the first video (110) through the generation model (3001).

[0053] In the present disclosure, a 'prompt' may be used as input information required for a generative model to perform a task. The prompt may include natural language text. The natural language text may include various information such as a task indicating the task to be performed by the generative model, and a context, intent, constraints, and example as components available when the generative model performs the task. An electronic device (1000) may process the natural language text using a Natural Language Processing (NLP) model. In the present disclosure, the prompt may be replaced with an input, command, directive, input phrase, starting sentence, task query, trigger sentence, etc.

[0054] In the present disclosure, a prompt may include a multimedia prompt that integrates various types of media elements, including text, images, voice, video, music, animation, etc. A multimedia prompt may be a combination of different types of media elements in the same situation.

[0055] In the present disclosure, "Generative Artificial Intelligence" may refer to an artificial intelligence technology capable of generating new text, images, etc. in response to input data (e.g., text, images, etc.). In the present disclosure, "Generative model" may refer to a neural network model that implements generative AI technology. A generative model can generate new data having characteristics similar to the input data or new data corresponding to the input data by learning the patterns and structures of training data. For example, a generative model can generate text from text using a text-to-text method. For example, a generative model can receive a multimedia prompt as input and generate response recording content containing multimedia data.

[0056] In the present disclosure, 'reaction recording content' may represent content in which a user records their impressions or reactions after using, watching, or consuming video content. As a creative activity of the user, the reaction recording content may reflect the user's unique characteristics, interests, or content creation style. In the present disclosure, the reaction recording content may be multimedia content in which various types of media elements, including text, images, voice, video, music, animation, etc., are combined in response to a multimedia prompt.

[0057] Hereinafter, with reference to FIG. 1, a method for an electronic device (1000) to automatically generate a prompt (210) for a first video (110) will be described.

[0058] An electronic device (1000) according to one embodiment of the present disclosure may obtain content-related information including speech input (user speech input) (120) of a user watching a first video (110), image data of the first video (110), and voice data of the first video (110). The electronic device (1000) may identify user-specific information related to content-related information (e.g., user-generated content (135), and user-preferred keywords (145)) among user-specific information (e.g., a plurality of user-generated content (130), and user preference information (140)). Based on the content-related information, the electronic device (1000) may obtain at least one scene (115) within the first video (110) and a user speech clip (125) corresponding to at least one scene (115) within the user speech input (120). The electronic device (1000) can generate a prompt (210) based on user characteristic information related to content-related information (e.g., user-generated content (135), and user's preferred keywords (145)), at least one scene (115) in the first video (110), and a user's speech clip (125) corresponding to at least one scene (115).

[0059] In the present disclosure, 'content-related information' may represent various types of data generated during the playback process of video content. Content-related information may correspond to time-series content information that represents various types of data over time. Content-related information may be acquired while a user is watching a video. Content-related information may be information configured by synchronizing a specific scene of the video with the user's speech input in that scene over time, but is not limited thereto. For example, content-related information may include user speech input (120) containing the speech input of a user watching the first video (110), image data of the first video (110), and voice data of the first video (110).

[0060] In the present disclosure, the ‘first video’ may be video content intended for viewing by a user. After watching the first video (110), the user may create reaction recording content to record reactions, impressions, memorable highlight scenes, or reviews or opinions regarding the highlight scenes of the first video (110). In one embodiment of the present disclosure, the first video (110) may include image data (or video data) and voice data (or audio data).

[0061] In the present disclosure, 'image data of the first video' may include data for each frame of the first video (110). In the present disclosure, 'audio data of the first video' may include audio information included in the audio track of the first video (110), for example, narration, dialogue, background music. Each of the image data of the first video (110) and the audio data of the first video (110) may be played over time.

[0062] In one embodiment of the present disclosure, the first video (110) is exemplified as being displayed through a display provided in the electronic device (1000), but is not limited thereto. The first video (110) may also be output through a separate display connected to the electronic device (1000) via a wired or wireless connection. In this case, image data and audio data of the first video (110) may be stored in the electronic device (1000).

[0063] In the present disclosure, 'user speech input' may include speech input by a user that occurs while watching the first video (110). The user speech input may correspond to a specific time interval of the first video frame. The user speech input (120) may include speech input by a user who is the primary viewer watching the first video (110). The user speech input (120) may be in the form of voice data or in the form of text data converted from voice data. However, it is not limited thereto, and throughout the entire disclosure, the user speech input may include not only the user's speech but also various user voice inputs generated by the user. That is, 'user speech input' may be replaced with 'user voice input'.

[0064] In one embodiment of the present disclosure, user speech input (120) may be obtained through a voice input device (e.g., a microphone) provided in an electronic device (1000). For example, the electronic device (1000) may process an analog voice signal input through a microphone and store it as a digital signal. However, it is not limited thereto, and the user speech input (120) may be collected through a separate voice input device connected to the electronic device (1000) via wired or wireless connection and transmitted to the electronic device (1000).

[0065] In the present disclosure, 'at least one scene in the first video' may represent a summary of the main content of the first video (110), a specific scene in the first video (110), or a highlight scene that was impressive to the user. For example, at least one scene (115) in the first video (110) may be represented as a single frame image extracted from the first video (110). Alternatively, for example, at least one scene (115) in the first video (110) may be represented as a series of frames extracted from the first video (110), for example, a video or a GIF (Graphics Interchange Format). In the present disclosure, 'scene' may be replaced with an important scene, a main scene, a highlight scene, a summary scene, or a video clip.

[0066] In the present disclosure, a ‘user’s utterance clip’ may represent a short segment cut from the user’s entire utterance input. In one embodiment of the present disclosure, an utterance clip (125) corresponding to at least one scene (115) may include an utterance input of a time segment corresponding to at least one frame included in at least one scene (115). For example, the utterance clip (125) may include an utterance input (e.g., 1 minute 20 seconds) of a time segment that is the same as or close to at least one scene (115) identified by the user within the user utterance input (120) (e.g., and went into a goal). For example, in the present disclosure, an ‘utterance clip’ may be replaced with an extracted utterance input, an utterance segment, etc.

[0067] In the present disclosure, 'user characteristic information' is unique characteristic information distinguishable from each user and may include the user's voice information, user preference information (140), and user-generated content (130), etc. The user characteristic information may be information stored in advance in the local memory of the electronic device (1000) or in an external database.

[0068] In the present disclosure, the user's voice information represents information regarding the speech characteristics of the user, who is the primary viewer watching the video, and may represent the user's voice, tone, manner of speech, speed, emotion, etc. The user's voice information may be information used in a speaker separation algorithm to extract the speech input of the user, who is the primary viewer, among a number of viewers watching the video.

[0069] In the present disclosure, user preference information (140) may represent an entire data set containing user preference keywords. In the present disclosure, user preference information (145) related to content-related information may represent at least one keyword related to a specific scene or user speech input regarding a specific scene among the user preference information.

[0070] In the present disclosure, user-generated content (130) may represent an entire set of content created by the user. User-generated content (130) may represent the style of the content created by the user, examples of texts written by the user, the user's writing style, language selection, expression method, etc. In the present disclosure, user-generated content (135) related to content-related information may represent at least one user-generated content related to a specific scene or the user's speech input regarding a specific scene among the entire user-generated content.

[0071] Referring to FIG. 2, in one embodiment of the present disclosure, a prompt (210) for a first video (110) may include information used to generate user-customized response record content. For example, the prompt (210) for the first video (110) may include an operation (212 in FIG. 2) indicating 'creating response record content (210) for the first video (110) viewed by the user'. The prompt (210) for the first video (110) may include components indicating at least one scene (211 in FIG. 2) within the first video (110), at least one speech clip (214 in FIG. 2) corresponding to at least one scene (211 in FIG. 2), and user characteristic information related to content-related information. For example, user characteristic information related to content-related information may include user-generated content related to content-related information (213 in FIG. 2) and user-preferred keywords related to content-related information (215 in FIG. 2). The components of the prompt (210) for the first video (110) may further include additional comments (216) mapped to importance values ​​of at least one scene (211).

[0072] Here, at least one scene (211) may be image data corresponding to at least one scene (115 in FIG. 1) within the first video (110). For example, at least one scene (211) may appear as a ‘scene of scoring a goal’. Additionally, at least one speech clip (214) may be text data corresponding to a user’s speech clip (125 in FIG. 1). For example, at least one speech clip (214) may appear as ‘What I said: Wow, the goal went in.’ Additionally, user-generated content (213) related to content-related information may be text data corresponding to user-generated content (135 in FIG. 1) related to content-related information. For example, user-generated content (213) related to content-related information may appear as ‘Example: User-generated content 1’. Additionally, the user's preferred keyword (215) related to the content information may be text data corresponding to the user's preferred information related to the content information (145 in FIG. 1). For example, the user's preferred keyword (215) related to the content information may appear as 'goal' and 'by the way, my favorite player X appeared in this situation.' Additional comment (216) may be text data. For example, the additional comment (216) may appear as 'and it was a moment I was very immersed in.'

[0073] In one embodiment of the present disclosure, a generating model (3001) may receive a prompt (220) and generate response recording content (220). In one embodiment of the present disclosure, the response recording content (220) may include at least one scene (221) within the first video (110) and a user's response recording (222) regarding at least one scene (221). At least one scene (221) may be image data corresponding to at least one scene (211) of the prompt (210). The user's response recording (222) is data newly generated based on the prompt (210) and may be recorded in the form of text data, but is not limited thereto and may be data composed of various types of media elements.

[0074] In one embodiment of the present disclosure, at least one scene (211) in the first video (110) and a speech clip (214) corresponding to at least one scene (211) are used as components of the prompt (210), so that a response record content (220) recording a user's response to at least one scene (211) in the first video (110) may be generated. In one embodiment of the present disclosure, user preference information (215) related to content-related information is used as components of the prompt (210), so that a response record content (220) recording a response mainly to a preferred keyword related to the first video (110) or user speech input (120) may be generated. In one embodiment of the present disclosure, user-generated content (213) related to content-related information is used as components of the prompt (210), so that a response record content (220) reflecting the style of content previously created by the user may be generated.

[0075] An electronic device (1000) according to one embodiment of the present disclosure may transmit a prompt (210) for a generated first video (110) to a server (3000). A server (3000) according to one embodiment of the present disclosure may receive a prompt (210) for generating response recording content (220) from an electronic device (1000) which is a client device, and may generate response recording content (220) using a generation model (3001) that receives the prompt (210) as input. The server (3000) may transmit the response recording content (220) to the electronic device (1000).

[0076] An electronic device (1000) according to one embodiment of the present disclosure may receive reaction record content (220) from a server (3000). The electronic device (1000) may automatically upload the received reaction record content (220) to a social media platform, etc., or upload it according to the user's choice. The reaction record content (220) may be displayed in the form of a graphic user interface (GUI) on the electronic device (1000).

[0077] In one embodiment of the present disclosure, at least one of the functions and operations of the server device (3000) may be implemented by the electronic device (1000). For example, the electronic device (1000) may generate a prompt and generate response record content using a generative model stored in the memory of the electronic device (1000). The generative model stored in the memory of the electronic device (1000) may be a lightweight model of a generative model previously learned in the server device (3000).

[0078] According to one embodiment of the present disclosure, an electronic device (1000) can perform a method for generating customized content that does not require separate viewer input in addition to pre-prepared user characteristic information (e.g., user's voice information, user's preference information, and user-generated content). The electronic device (1000) can automatically generate a prompt for generating response record content using video information viewed (e.g., a first video (110)), speech information regarding the video (e.g., user speech input (120)), and user characteristic information. The electronic device (1000) can obtain customized response record content based on the generated prompt. For example, the user may use the automatically generated prompt as input information for the generation model as is, or may directly modify some of the content in the automatically generated prompt and use it as input information. In this case, the prompt may also be displayed in the form of a graphical user interface on the electronic device (1000).

[0079] Content generated by the customized content generation method according to one embodiment of the present disclosure is not clichéd or awkward and can give the impression that it was created by the user themselves. Accordingly, the electronic device (1000) can more richly support the viewer's creative activities by generating reaction record content based on the viewer's reaction to the video content. The user can easily produce personalized content while saving time on content production.

[0080] Meanwhile, video service providers sometimes utilize video summary content creation technology. This technology extracts video clips containing highlight scenes from the entire video and presents them to viewers, helping them grasp the key points in a shorter amount of time. These video clips can be generated by utilizing reaction data, such as the number of viewers, voice, and emotions. Additionally, for example, video service providers may use technology to manually delete unnecessary parts from the extracted clips or to automatically delete them using a predetermined method. However, these technologies are designed to produce video clips from the perspective of the video service provider; they rely solely on viewer reactions and have limitations in that the provider cannot proactively generate content.

[0081] Furthermore, for example, video service providers sometimes utilize template-based technology that automatically applies subtitles, animation effects, and correction effects to video clips extracted by analyzing viewer response data. However, such technology is limited to the application of visual effects and has limitations in generating diverse and creative content tailored to viewers.

[0082] A method for generating customized content according to one embodiment of the present disclosure generates scenes preferred from the viewer's own perspective, and furthermore, rather than simply generating summarized video clips, it can automatically generate the viewer's reactions or reviews regarding the video content, or even multimedia data (e.g., text, graphics, music, etc.). Additionally, through the method for generating customized content, the viewer can create personalized content in a diverse and creative manner.

[0083] However, not limited thereto, the electronic device (1000), heart rate measuring device (2000), and server (3000) can be implemented as a single device.

[0084] FIG. 3 is a block diagram illustrating the operation of an electronic device, a heart rate measuring device, and a server according to one embodiment of the present disclosure.

[0085] Referring to FIG. 3, an electronic device (1000) according to one embodiment of the present disclosure may include a pre-processing module (310), a user-generated content identification module (320), a scene and utterance generation module (330), and a user preference information identification module (340). However, not all of the illustrated components are essential components. The electronic device (1000) may be implemented by more components than those illustrated, or by fewer components. In the present disclosure, a 'module' may be implemented by at least one processor included in the electronic device (1000) executing software such as program code, instructions, algorithms, and data structures stored in memory included in the electronic device (1000). In the following, operations described as being performed by a module of the electronic device (1000) may actually be performed by at least one processor included in the electronic device (1000).

[0086] In one embodiment of the present disclosure, an electronic device (1000) may obtain image data of a first video, voice data of a first video, and user speech input corresponding to content-related information. The electronic device (1000) may use user characteristic information stored in each of the user voice DB (Data Base) (350), user-generated content DB (360), and user preference information DB (370).

[0087] The electronic device (1000) can perform preprocessing on content-related information and / or user characteristic information through a preprocessing module (310). The preprocessing module (310) can perform a preprocessing operation to convert data to be used in the user-generated content identification module (320), the scene and speech generation module (330), and the user preference information identification module (340) into a vector representation or a specific format. For example, the preprocessing module (310) can generate image embedding data by converting the image data of the first video into a vector representation. Also, for example, the preprocessing module (310) can generate voice-text embedding data by converting the voice data of the first video into voice-text data through Automatic Speech Recognition (ASR) and converting the converted voice-text data into a vector representation. In addition, for example, the preprocessing module (310) can generate speech text embedding data by converting the user speech input into speech text data through automatic speech recognition (ASR) and converting the converted speech text data into a vector representation. In addition, if the voices of multiple speakers exist in the user speech input, the preprocessing module (310) can extract the speech input of the user who is the main audience from the user speech input through speaker diarization.

[0088] For example, the preprocessing module (310) may include an encoder (410, 430 in FIG. 4) that converts image data or text data into a vector representation. For example, the preprocessing module (310) may include a speech recognition module (420 in FIG. 4) that converts voice data into text data, and also a speaker separation module (440 in FIG. 4). This is described in FIG. 4.

[0089] In the present disclosure, image data of a first video, voice data of a first video, and image embedding data, voice text embedding data, and speech text embedding data converted from user speech input may also be referred to as content-related information that appears sequentially over time.

[0090] Additionally, for example, the preprocessing module (310) can convert user-generated content corresponding to user characteristic information and user preference information into vector representations. For example, the preprocessing module (310) can convert data (image data and / or text data) included in user-generated content into embedding data. For example, the preprocessing module (310) can convert data (image data and / or text data) included in user preference information into embedding data.

[0091] However, in one embodiment of the present disclosure, all or some of the preprocessing operations of the preprocessing module (310) may be omitted. For example, the electronic device (1000) may process content-related information and user characteristic information in a form other than a vector representation.

[0092] In one embodiment of the present disclosure, the electronic device (1000) can identify at least one user-generated content related to content-related information among a plurality of user-generated contents stored in a user-generated content DB (360) through a user-generated content identification module (320). The user-generated content identification module (320) can receive image data of a first video converted into a vector representation, voice data of a first video, and user speech input from a preprocessing module (310). Additionally, the user-generated content identification module (320) can receive a plurality of user-generated contents stored in the user-generated content DB (360).

[0093] The user-generated content identification module (320) can calculate the relevance between the user-generated content and content-related information for each of the plurality of user-generated contents in order to identify at least one user-generated content. The user-generated content identification module (320) can identify at least one user-generated content among the plurality of user-generated contents in which the relevance corresponds to a predetermined first condition. For example, the predetermined first condition may be identifying content in which the value of relevance is greater than or equal to a threshold value (e.g., p). Or, for example, the predetermined first condition may be identifying a predetermined number (e.g., top n) of content having the maximum relevance value among user-generated contents in which the relevance value is greater than or equal to a threshold value (e.g., p). The user-generated content identification module (320) may use at least one user-generated content corresponding to the predetermined first condition in the prompt.

[0094] The user-generated content identification module (320) can calculate a first similarity between the user-generated content and the image data of the first video, a second similarity between the user-generated content and the voice data of the first video, and a third similarity between the user-generated content and the user speech input in order to calculate the degree of correlation between each user-generated content and content-related information. The user-generated content identification module (320) can calculate the degree of correlation between the user-generated content and content-related information by performing a weighted sum of the first similarity, the second similarity, and the third similarity. The user-generated content identification module (320) can assign a greater weight to data that contributes more to the degree of correlation among various types of data belonging to content-related information, and assign a smaller weight to data that contributes less to the degree of correlation. The user-generated content identification module (320) is further explained in FIGS. 5, FIGS. 6, Equation 1, and Equation 2.

[0095] In one embodiment of the present disclosure, the electronic device (1000) can identify at least one preference keyword related to content-related information among the user's preference keywords stored in the user preference information DB (370) through the user preference information identification module (340). The user preference information identification module (340) can receive image data of the first video converted into a vector representation, voice data of the first video, and user speech input from the preprocessing module (310). Additionally, the user preference information identification module (340) can receive the user's preference keywords stored in the user preference information DB (370).

[0096] The user preference information identification module (340) can calculate the similarity between the user's preference keyword and content-related information for each of the user's preference keywords in order to identify at least one user's preference keyword. For example, the user preference information identification module (340) can calculate a first similarity between the user's preference information and the image data of the first video, a second similarity between the user's preference information and the voice data of the first video, and a third similarity between the user's preference information and the user's speech input. The user preference information identification module (340) can identify at least one preference keyword whose similarity corresponds to a predetermined second condition. For example, the predetermined second condition may be identifying a preference keyword whose value of similarity is greater than or equal to a threshold value. For example, the predetermined second condition may be identifying a preference keyword whose value of similarity is greater than or equal to a threshold value among at least one of the first similarity, the second similarity, or the third similarity. The user preference information identification module (340) can use at least one user's preference keyword corresponding to the predetermined second condition in the prompt. The user preference information identification module (340) is further explained in FIG. 7 and Equation 3.

[0097] However, the method by which the user preference information identification module (340) identifies the user's preferred keywords is not limited to the similarity calculation method described above, and a relevance calculation method (i.e., a weighted sum after each similarity calculation) may be used, similar to the method for identifying user-generated content. Likewise, the method for identifying user-generated content of the user-generated content identification module (320) may use the method for identifying preferred keywords.

[0098] In one embodiment of the present disclosure, an electronic device (1000) can generate at least one scene in a first video included in content-related information and at least one speech clip in a user speech input through a scene and speech generation module (330). The scene and speech generation module (330) can receive image data of the first video converted into a vector representation, voice data of the first video, and user speech input from a preprocessing module (310). Additionally, the scene and speech generation module (330) can receive user preference keywords stored in a user preference information DB (370). Additionally, the scene and speech generation module (330) can receive user heart rate information from a heart rate measuring device (2000) through a communication module. Here, the user heart rate information may be data interpolated by a time interpolation module (710 in FIG. 7), and this is described in FIG. 7 and FIG. 8.

[0099] The scene and utterance generation module (330) can calculate importance for each time interval based on user preference keywords included in content-related information and user characteristic information. Here, importance may be referred to as time-series importance. The scene and utterance generation module (330) may use a different method for calculating importance for each time interval based on whether user utterance input exists for each time interval. For example, if utterance text embedding data corresponding to user utterance input exists in a specific time interval, the scene and utterance generation module (330) can obtain the importance of the corresponding time interval by considering at least one of the first similarity between the user utterance input and the image data of the first video, the second similarity between the user utterance input and the voice data of the first video, the third similarity between the user preference information and the image data of the first video, or the fourth similarity between the user preference information and the voice data of the first video through Equation 4. Alternatively, for example, if there is no speech text embedding data corresponding to the user speech input in a specific time interval, the scene and speech generation module (330) may obtain the importance of the corresponding time interval by considering at least one of the third similarity between the user's preference information and the image data of the first video or the fourth similarity between the user's preference information and the voice data of the first video through Equation 5. This is explained in FIGS. 9a and 9b.

[0100] The scene and utterance generation module (330) can generate the at least one scene including at least one frame included in the image data of the first video based on the importance of each time interval. The scene and utterance generation module (330) can generate a scene with a frame corresponding to a time interval in which the importance value corresponds to a predetermined third condition. For example, the predetermined third condition may be identifying a time interval in which the importance is greater than or equal to a first threshold value (e.g., q1). This is explained in FIG. 10.

[0101] The scene and utterance generation module (330) may determine whether to generate a selected scene as a single image or a set of consecutive images based on importance. Here, a single image may include a single frame. A set of consecutive images may include a plurality of consecutive frames and may be represented as a video or GIF. For example, the scene and utterance generation module (330) may identify whether the importance value is greater than or equal to a second threshold (e.g., q2) for a scene where the importance value is greater than or equal to a first threshold (e.g., q1). Here, the second threshold may be a value greater than the first threshold. If the importance value is greater than or equal to the second threshold, the scene and utterance generation module (330) may generate the scene as a set of consecutive images. If the importance value is less than the second threshold, the scene and utterance generation module (330) may generate the scene as a single image. This is described in FIG. 10.

[0102] The scene and utterance generation module (330) can identify whether the difference in importance from the previous time interval is less than a third threshold (e.g., q3) for a time interval where the importance value is greater than or equal to a second threshold value in order to generate a scene as a set of continuous images. For example, the scene and utterance generation module (330) can include the previous frame corresponding to the previous time interval as a scene if, for a time interval where the importance value is greater than or equal to a second threshold value, the difference in importance from the previous time interval is less than the third threshold (e.g., q3). The scene and utterance generation module (330) can generate only the frame corresponding to the time interval selected as a scene as a scene if the difference in importance from the previous time interval is greater than or equal to the third threshold (e.g., q3). That is, the scene and utterance generation module (330) may not generate the previous frame corresponding to the previous time interval as a scene. Accordingly, the scene and utterance generation module (330) repeatedly identifies the difference in importance between neighboring time intervals in reverse chronological order based on the time interval selected as the scene, and can use all frames corresponding to time intervals where the difference in importance is less than a third threshold value as the scene. This is explained in FIGS. 11a and 11b.

[0103] The scene and utterance generation module (330) can generate a utterance clip including a utterance input of a time interval corresponding to at least one scene from a user utterance input. The scene and utterance generation module (330) can generate a utterance clip for a portion within the user utterance input that corresponds to a time interval corresponding to a scene.

[0104] The scene and utterance generation module (330) can divide images, utterance data, and time-series importance into equal time intervals. For example, the scene and utterance generation module (330) can calculate the importance of each time interval based on the frame time of the image data. Additionally, to calculate the importance of each time interval, the scene and utterance generation module (330) may use speech text embedding data located at the same time as the frame time of the image embedding data and utterance text embedding data located at the same time as the frame time of the image embedding data. However, it is not limited thereto. Accordingly, the scene and utterance generation module (330) can generate an utterance clip to be used in the prompt based on the utterance text embedding data located at the same time as the frame time of the image embedding data.

[0105] In one embodiment of the present disclosure, an electronic device (1000) may generate a prompt by obtaining at least one of at least one user-generated content identified through a user-generated content identification module (320), at least one scene in a first video generated through a scene and speech generation module (330), at least one speech clip in a user speech input, or at least one preferred keyword identified through a user preference information identification module (340). In one embodiment of the present disclosure, the prompt may include at least one of at least one user-generated content, at least one scene in a first video, at least one speech clip in a user speech input, or at least one preferred keyword as a component. For example, the prompt may include all of the components described above, or may include at least one of the components.

[0106] In one embodiment of the present disclosure, an electronic device (1000) may transmit a generated prompt to a server (3000). The prompt is transmitted to the server (3000) and may be used by the server (3000) to generate customized response record content.

[0107] The generative model (3001) can generate customized response record content from a prompt. The generative model (3001) may be an artificial intelligence model that uses a mechanism (e.g., attention, etc.) to process the correlation between prompt and text to generate text based on the prompt.

[0108] Meanwhile, information stored in each of the user voice DB (Database) (350), user-generated content DB (360), and user preference information DB (370) may be referred to as user characteristic information. User characteristic information may be information that is pre-stored in the memory of the electronic device (1000) or in an external database. For example, user characteristic information may be manually entered by the user. Alternatively, user characteristic information may be automatically collected by the electronic device (1000). For example, user voice information stored in the user voice DB (350) may be automatically generated by the electronic device (1000) by analyzing the user's voice patterns (e.g., speaking speed, intonation, pronunciation characteristics). For example, multiple user-generated contents stored in the user-generated content DB (360) may be automatically generated by the electronic device (1000) by analyzing the production style or writing style frequently used by the user. For example, multiple user preference keywords stored in the user preference information DB (370) may be automatically generated by the electronic device (1000) by analyzing the user's behavior patterns, frequently selected genres, categories, etc.

[0109] Meanwhile, user characteristic information is not limited to the examples described above and may further include viewing history information, behavioral data information during playback, location information, etc., used to analyze user reactions, etc.

[0110] FIG. 4 is a diagram illustrating the operation of converting collected data into a vector form by an electronic device according to one embodiment of the present disclosure.

[0111] With reference to FIG. 4, the data preprocessing operation of the preprocessing module (310) of FIG. 3 is described in detail. The preprocessing module (310) of FIG. 3 may include an image encoder (410), a speech recognition module (420), a text encoder (430), and a speaker separation module (440). However, not all of the illustrated components are essential components. The preprocessing module (310) may be implemented with more components than those illustrated, or with fewer components.

[0112] In the present disclosure, the 'encoder' may be trained to find relationships between text and images and to generate a common vector representation between text and images. The encoder may be implemented using a known neural network architecture capable of processing text and images, or through a variation of a known neural network architecture. For example, the encoder may be implemented based on a multimodal model, but is not limited thereto. The encoder may include an 'image encoder' for encoding image data and a 'text encoder' for encoding text data.

[0113] In the present disclosure, a 'multimodal model' may be a neural network model that simultaneously processes various types of modalities (e.g., text data, image data, voice data, video data, etc.) and learns the relationships between them. For example, in a multimodal model, if an image and text have similar meanings, the two vectors may be placed close to each other in a vector space. For example, the vector of the text 'soccer' output through a text encoder (430) and the vector of the soccer image output through an image encoder (410) may be placed close to each other.

[0114] The electronic device (1000) can obtain image data (402) and voice data (403) from the first video. The image data (402) and voice data (403) of the first video may be stored in a form separated into separate files, or may be stored in a form separated into separate streams within a single file.

[0115] The image data (402) of the first video can be converted into a vector representation through an image encoder (410). For example, an electronic device (1000) can generate image embedding data, which is a vector representation, from the image data (402) of the first video using an image encoder (401).

[0116] The voice data (403) of the first video can be converted into a vector representation through a voice recognition module (420) and a text encoder (430). For example, the electronic device (1000) can convert the voice data (403) of the first video into text data using the voice recognition module (420). The text data may contain voice information included in the audio track of the first video, such as narration, dialogue, and background music, and the content and timing may be arranged according to the flow of time. The voice recognition module (420) performs the operation of converting voice input into text and can be implemented through automatic speech recognition (ASR) technology, such as a speech-to-text model. For example, the electronic device (1000) can extract text such as "Player X shooting. turned around." from the voice data (403) of the first video at the "1:20" time interval through the voice recognition module (420). The electronic device (1000) can generate voice text embedding data, which is a vector representation, from voice text data using a text encoder (430).

[0117] The user speech input (404) can be converted into a vector representation through a speaker separation module (440), a speech recognition module (420), and a text encoder (430). For example, in a situation where there are multiple speakers, the electronic device (1000) can use the speaker separation module (440) to identify the speech input of the user who is the primary audience in the user speech input (404). The speaker separation module (440) can use the user's voice information stored in the user voice DB (350) to distinguish the user's voice from the voices of multiple speakers included in the user speech input, and extract speech data representing the user's speech input. The user's voice information can represent information regarding the user's speech characteristics, such as the user's voice, tone, manner of speech, speed, emotion, etc.

[0118] The electronic device (1000) can convert speech data representing the user's speech input into text data using a speech recognition module (420). In the text data, the time of the user's speech and the speech input can be aligned according to the flow of time. For example, the electronic device (1000) can extract text such as "Wow, the goal went in" at the time interval of "1:20" from the speech data representing the user's speech input through the speech recognition module (420). The electronic device (1000) can generate speech text embedding data, which is a vector representation, from the speech text data using a text encoder (430).

[0119] Meanwhile, in a situation where the speaker is the primary viewer alone, the speaker separation module (440) may be omitted. Additionally, in a situation where there are multiple speakers, the electronic device (1000) may extract the speech inputs of the remaining speakers, excluding the speech input of the primary viewer. For example, the speaker separation module (440) may extract speech features for speech data, cluster speakers with similar speech patterns based on the extracted feature vectors, and track the speech inputs for each speaker in chronological order. The speech inputs for each speaker can be converted into a structured text form containing the speech time and input of each speaker through the speech recognition module (420). Accordingly, text can be structured by separating when and what content each speaker said, even in situations where multiple speakers exist, including situations where there is only a primary viewer alone.

[0120] For example, each frame of a video may contain metadata (e.g., a unique frame number (or index) or a timestamp assigned sequentially according to playback time). When each frame is converted into embedding data, the embedding data may be stored in an electronic device (1000) in the form of structured data along with the metadata. For example, if the embedding data with a timestamp of 1 minute 20 seconds is [0.12, -0.45, ...], the timestamp of 1 minute 20 seconds may be mapped and stored in the embedding data. The frame time can be identified through the timestamp. However, it is not limited thereto.

[0121] Additionally, for example, the audio data of the video and user speech input may be aligned with the voice timing and voice content according to the flow of time. When the audio data of the video and user speech input are converted into embedding data, the embedding data may be stored in an electronic device (1000) along with start and end timestamps of the time interval. For example, timestamps from 1 minute 20 seconds to 1 minute 45 seconds may be mapped to and stored in the embedding data. However, it is not limited thereto.

[0122] FIG. 5 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure generating a prompt through a user-generated content identification module.

[0123] With reference to FIG. 5, the operation of the user-generated content identification module (320) of FIG. 3 is described in detail. The user-generated content identification module (320) may receive image embedding data, voice text embedding data, and speech text embedding data corresponding to content-related information from the encoder (410 or 430) of FIG. 4. The user-generated content identification module (320) may receive embedding data for multiple user-generated contents stored in the user-generated content DB (360). Here, the user-generated content received by the user-generated content identification module (320) may be converted into an embedding data form by the encoder (410 or 430) of FIG. 4.

[0124] The user-generated content identification module (320) can calculate the relevance between user-generated content and content-related information through mathematical formulas 1 and 2.

[0125] [Mathematical Formula 1]

[0126]

[0127] In mathematical formula 1, R i can represent the relevance of the i-th user-generated content. α, β, and γ can represent the relevance weights for the image, voice, and utterance, respectively (α+β+γ = 1). r i can refer to embedding data for each user-generated content. e t, x can refer to embedding data for information related to each content. For example, e t, 이미지 can refer to image embedding data. e t, 음성 can refer to speech-text embedding data. e t, 발화 can mean speech text embedding data.

[0128] In mathematical formula 1, SIM(r i , e t,x) may represent the similarity between each content-related information and each user-generated content, and the similarity can be calculated through Equation 2.

[0129] [Mathematical Formula 2]

[0130]

[0131] Through mathematical equation 2, the two vectors r i and e t, x The cosine similarity between them can be calculated. The similarity calculated in Equation 2 may have a real number range between 0 and 1, but is not limited thereto.

[0132] The electronic device (1000) can calculate a first similarity between the image data of the user-generated content and the first video, a second similarity between the voice data of the user-generated content and the first video, and a third similarity between the user-generated content and the user speech input through mathematical formula 2.

[0133] The electronic device (1000) can calculate the degree of correlation between user-generated content and content-related information by performing a weighted sum of the first similarity, the second similarity, and the third similarity through mathematical formula 1. For example, the electronic device (1000) may assign a greater weight to data that contributes more to the degree of correlation among various types of data belonging to content-related information, and assign a smaller weight to data that contributes less to the degree of correlation. For example, if the value of γ is set to be greater than α or β, the third similarity between user-generated content and user speech input may contribute the most to the final degree of correlation.

[0134] The operation method of the user-generated content identification module (320) is specifically described in FIG. 6. FIG. 6 is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to identify user-generated content related to content-related information.

[0135] Referring to FIG. 6, in operation 610, the electronic device (1000) can calculate the relevance between user-generated content and content-related information through Equations 1 and 2 of the user-generated content identification module (320). In operation 620, the electronic device (1000) can identify whether the value of relevance is greater than or equal to a threshold value (e.g., p). Alternatively, for example, the electronic device (1000) can identify whether it corresponds to a predetermined number (e.g., top n) of content having the maximum relevance value among user-generated content where the relevance value is greater than or equal to the threshold value (e.g., p). Here, p may be a real number and n may be a natural number. This may be an example of a predetermined first condition. In operation 630, the electronic device (1000) can identify (or select) user-generated content where the relevance value is greater than or equal to the threshold value. The electronic device (1000) can use at least one user-created content corresponding to a predetermined first condition as a prompt.

[0136] Referring to example (615), assume a case where the relevance of user-generated content 1 is calculated to be 0.8, the relevance of user-generated content 2 is 0.3, and the relevance of user-generated content 3 is 0.5. Assume the threshold value is 0.8. In this case, the electronic device (1000) can identify user-generated content 1 whose relevance value is greater than or equal to the threshold value (0.8). The electronic device (1000) can use user-generated content 1 as a component of the prompt.

[0137] Alternatively, if the electronic device (1000) determines that there is no user-generated content above a threshold value, it may not use user-generated content as a prompt, but is not limited thereto.

[0138] FIG. 7 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure to generate a prompt through a scene and speech generation module and a user preference information identification module.

[0139] Referring to FIG. 7, the user preference information identification module (340) of FIG. 3 is described first, and the scene and speech generation module (330) is described in detail.

[0140] The user preference information identification module (340) can receive image embedding data, voice text embedding data, and speech text embedding data corresponding to content-related information from the encoder (410 or 430) of FIG. 4. The user preference information identification module (340) can receive embedding data for user preference keywords stored in the user preference information DB (370). Here, the user preference keywords received by the user preference information identification module (340) may be converted into embedding data form by the text encoder (430) of FIG. 4.

[0141] The user preference information identification module (340) can identify at least one preference keyword related to content-related information among the user's preference keywords stored in the user preference information DB (370) through mathematical formula 3.

[0142] [Mathematical Formula 3]

[0143]

[0144] In mathematical equation 3, e t, xrepresents embedding data regarding content-related information, as explained in Equation 1. f can represent each preference keyword included in the user's preference information. Here, f ∈ {A, B, ...} can be used. SIM(e t,x , f) represents the similarity between content-related information and user preference information, and the method for calculating similarity is explained in Equation 2. q f is a threshold value compared with similarity to identify at least one preferred keyword, and may be referred to as a predetermined second condition.

[0145] The electronic device (1000) calculates, respectively, the similarity between the user's preferred keyword and the image data of the first video, the similarity between the user's preferred keyword and the voice data of the first video, and the similarity between the user's preferred keyword and the user's speech input through mathematical formula 3, and the maximum value of the similarity is a threshold value (e.g., q f If the value is greater than ) the user's preferred keyword can be identified. However, the method of identifying the user's preferred keyword is not limited to Equation 3. The electronic device (1000) can use the identified user's preferred keyword in the prompt. This can correspond to the user's preferred keyword (215) of FIG. 2.

[0146] The scene and speech generation module (330) can receive image embedding data, voice text embedding data, and speech text embedding data corresponding to content-related information from the encoder (410 or 430) of FIG. 4.

[0147] Additionally, the scene and speech generation module (330) may receive more interpolated heart rate data. The interpolated heart rate data may be data that has been interpolated by the time interpolation module (710), which is user heart rate information obtained by the electronic device (1000) from the heart rate measuring device (2000 in FIG. 3). The interpolated heart rate data may be time series information adjusted to match the time interval per frame of the first video image.

[0148] The electronic device (1000) may further include a time interpolation module (710) for obtaining a heart rate interpolated from user heart rate information. The time interpolation module (710) can perform time series information interpolation so that user heart rate information collected at different intervals and the first video image are synchronized with respect to the time axis. For example, since user heart rate information and the first video image are information that are acquired at a constant rate at a fixed interval, the time interpolation module (710) can apply time series information interpolation to the heart rate in accordance with the frame rate of the first video image. In other words, the time interpolation module (710) can adjust the heart rate information to the time interval in which each frame of the first video image is displayed.

[0149] Additionally, the time interpolation module (710) can interpolate heart rate information over the entire first video and convert the interpolated heart rate information into a value within a predetermined range through a standardization process. For example, the standardized heart rate information may be represented as a real number greater than or equal to 0. Mean and standard deviation values ​​may be used in the standardization process, but are not limited thereto.

[0150] A specific example regarding the time series information of the present disclosure is explained in conjunction with FIG. 7 and FIG. 8. FIG. 8 is a diagram for representing data collected by an electronic device according to one embodiment of the present disclosure as time series information according to time intervals.

[0151] In FIG. 8, the image data of the first video, the voice data of the first video, and the user speech input are each shown in the form of corresponding embedding data. Additionally, the interpolated heart rate is shown aligned to correspond to the time interval per frame of the first video image. Here, the x-axis represents the time axis.

[0152] For example, image embedding data (810) corresponding to image data of the first video, voice text embedding data (830) corresponding to voice data of the first video, and speech text embedding data (840) corresponding to user speech input may be aligned according to each time interval.

[0153] For example, the time interval of the interpolated heart rate information (820) may be aligned identically to the time interval of the image embedding data (810) by the time interpolation module (710). Since the interpolated heart rate information (820) is aligned to the time interval of the image embedding data (810) by the time interpolation module (710), for example, the values ​​of the interpolated heart rate information (820) 0.45, 0.68, 0.71, ..., 0.22 may be aligned to each frame time t ∈ {1, 2, 3, ..., k, ..., T} of the image embedding data (810).

[0154] In one embodiment of the present disclosure, different types of embedding data used in the importance calculation process described below may be used based on the same time interval. For example, an electronic device (1000) may use data of voice text embedding data (830) and speech text embedding data (840) based on the frame time of image embedding data (810). Here, the frame time may refer to a time interval during which a specific frame is maintained on the screen. For example, the electronic device (1000) may calculate the importance of each time interval based on the frame time of image embedding data (810) (e.g., the time interval from t2 to t3). Additionally, for example, the electronic device (1000) may use voice text embedding data (830) located at the same time as the frame time (e.g., time interval from t2 to t3) of the image embedding data (810) and speech text embedding data (840) located at the same time as the frame time (e.g., time interval from t2 to t3) to calculate the importance of a specific time interval. However, it is not limited thereto.

[0155] In one embodiment of the present disclosure, the amount of image embedding data (810) may be constant for each frame time (i.e., time interval). Here, the amount of image embedding data (810) corresponding to a specific frame time may be smaller or larger than the amount of speech text embedding data (830) and speech text embedding data (840) corresponding to the specific frame time, respectively. For example, in the graph of FIG. 8, the image embedding data (810) uses only data located between frame time t2 and t3, but the speech text embedding data (830) may include time intervals extending beyond t2 to t3, so that the amount of speech text embedding data may be greater than the amount of image embedding data at frame time t.

[0156] Meanwhile, in FIG. 8, user preference information (850) is shown in the form of corresponding embedding data.

[0157] Again, referring to FIG. 7, the scene and utterance generation module (330) may include an Importance Calculation Module (720), a scene generation module (730), and an utterance clip generation module (740). However, not all of the illustrated components are essential components. The scene and utterance generation module (330) may be implemented with more components than illustrated, or with fewer components.

[0158] The importance calculation module (720) can calculate time series importance through mathematical formulas 4 and 5. In the present disclosure, 'time series importance' is data according to the flow of time calculated using time series information of each time interval (e.g., content-related information and heart rate information, etc.), and the importance for each time interval may be represented as a specific value, numerical value, or level. For example, the importance calculation module (720) can calculate time series importance based on time series information corresponding to each frame time of the first video image.

[0159] [Mathematical Formula 4]

[0160]

[0161] [Mathematical Formula 5]

[0162]

[0163] In Equations 4 and 5, time series information can be used based on each frame time t of the first video image. Here, frame time may refer to the time interval during which a specific frame is maintained on the screen. Here, t ∈ {1, 2, 3, ... , k, ..., T}. I tcan represent importance according to frame time t. e t, x can refer to embedding data regarding content-related information at frame time t. For example, e t, 이미지 can mean image embedding data at frame time t. e t, 음성 can refer to speech-text embedding data at frame time t. e t, 발화 can represent utterance text embedding data at frame time t. F represents user preference information, and f can represent each preference keyword included in the user preference information. Here, f ∈ {A, B, ...} can be used. SIM(e t,x , e t,x ) refers to the similarity between different content-related information, and SIM(e t,x , f) can refer to the similarity between content-related information and user preference information. h t can mean the value of interpolated heart rate information at frame time t.

[0164] Equations 4 and 5 can be distinguished based on whether user utterance input exists for each time interval. For example, the importance calculation module (720) can calculate the time series importance using Equation 4 if there is utterance text embedding data corresponding to user utterance input in a specific time interval. Also, for example, the importance calculation module (720) can calculate the time series importance using Equation 5 if there is no utterance text embedding data corresponding to user utterance input in a specific time interval.

[0165] According to Equation 4, the similarity SIM(e) between image embedding data and speech text embedding data t,이미지 , e t,발화 ) or similarity SIM(e) between speech-text embedding data and speech-text embedding data t,음성 , et,발화 The greater at least one of ) is, the higher the importance of the corresponding time interval can be calculated. Accordingly, the more similar at least one of the image of the first video or the voice of the first video in a specific time interval is to the user's speech input, the higher the importance of the corresponding time interval can be calculated.

[0166] According to mathematical formulas 4 and 5, the heart rate value h of a predetermined time interval included in the interpolated heart rate information t The larger this value, the higher the importance of the corresponding time interval can be calculated. Accordingly, as the user's heart rate increases within a specific time interval, the importance of that time interval can be calculated as higher.

[0167] of mathematical formulas 4 and 5 SIM(e) for all preference keywords belonging to the user's preference information F t,이미지 , f) and SIM(e t,음성 By comparing , f), the sum of all maximum values ​​can be represented. Accordingly, if either the image of the first video or the voice of the first video is similar to at least one preferred keyword, the importance of the corresponding time interval can be calculated as high.

[0168] Meanwhile, in one embodiment of the present disclosure, the method for calculating importance is not limited to Equations 4 and 5. Some of the information exemplified in Equations 4 and 5 may be omitted. For example, heart rate information may be omitted in Equations 4 and 5. Accordingly, the electronic device (1000) may calculate importance without considering the user's heart rate information. For example, the electronic device (1000) may calculate importance using only the similarity between content-related information and user preference information in Equations 4 and 5.

[0169] In the present disclosure, the meaning that importance is calculated to be high may mean that importance corresponds to a predetermined third condition.

[0170] In conjunction with Fig. 7 and Fig. 9a, an example of an importance calculation operation utilizing user speech input is described, and in conjunction with Fig. 9b, an example of an importance calculation operation utilizing user preference information is described.

[0171] FIG. 9a is an example of an operation in which an electronic device according to one embodiment of the present disclosure calculates importance based on the similarity between a user speech input and voice data of a first video / image data of a first video.

[0172] Referring to FIG. 9a, when a user speech input exists in a specific time interval, the electronic device (1000) can calculate a first similarity between the user speech input and the voice data of the first video and a second similarity between the user speech input and the image data of the first video, and calculate the importance of the time interval based on the calculated similarities.

[0173] For example, it is assumed that in a specific time interval, the image in the first video relates to a 'goal-scoring scene,' and the audio in the first video relates to 'Player X shooting' and 'goal.' If the user utterance input in that time interval is 'Wow, the goal went in,' the first similarity and the second similarity may be 0.8. Additionally, the heart rate h t The importance considering as a weight could be 0.9. On the other hand, if the user utterance input in the corresponding time interval is 'What should I eat for dinner tonight?', the first and second similarities could be 0.1, and the heart rate h t The importance considered as a weight may be 0.05. In other words, the electronic device (1000) can select a personalized highlight scene of the user by using the similarity between the user's speech input and the image of the first video or the voice of the first video.

[0174] FIG. 9b is an example of an operation in which an electronic device according to one embodiment of the present disclosure calculates importance based on the similarity between user preference information and voice data of the first video / image data of the first video.

[0175] Referring to FIG. 9b, the electronic device (1000) can calculate a third similarity between the user's preference information and the image data of the first video, and a fourth similarity between the user's preference information and the voice data of the first video, and calculate the importance of the corresponding time interval based on the calculated similarities. This can be utilized in cases where there is no user speech input in a specific time interval, or where there is one.

[0176] For example, it is assumed that in a specific time interval, the image of the first video is about 'rejoicing spectators,' the audio of the first video is about 'spectators rejoicing after scoring a goal,' and the user's preferred keyword is 'goal.' In this case, the third similarity between the user's preference information and the image data of the first video is low at 0.1, but the fourth similarity between the user's preference information and the audio data of the first video can be high at 0.7. The electronic device (1000) can calculate importance by considering the maximum value of 0.7 obtained by comparing the third similarity and the fourth similarity. In other words, if either the image of the first video or the audio of the first video is similar to the user's preference information, the electronic device (1000) can use it as a personalized highlight scene for the user.

[0177] Referring again to FIG. 7, the scene generation module (730) can generate a scene within the first video based on the time-series importance obtained through the importance calculation module (720). The speech clip generation module (740) can generate a speech clip corresponding to the scene within the first video based on the time-series importance obtained through the importance calculation module (720). The scene generation module (730) and the speech clip generation module (740) may use original data of content-related information to generate the scene within the first video and the speech clip corresponding to the scene. Here, the original data of content-related information may include image data of the first video, voice-text data of the first video, and speech-text data of user speech input as data prior to encoding. For example, the electronic device (1000) may use original data of content-related information of the corresponding time interval based on the importance of the corresponding time interval.

[0178] The operation of the scene generation module (730) is described in conjunction with FIG. 7 and FIG. 10. FIG. 10 is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to generate a scene in a first video based on importance.

[0179] Referring to FIG. 10, in operation 1010, the electronic device (1000) can calculate time series importance through Equations 4 and 5 of the importance calculation module (720). For example, referring to example (1015), the importance of frame time t1 may be 0.2, the importance of frame time t2 may be 0.7, and the importance of frame time t3 may be 0.9.

[0180] In operations 1020 to 1030, the electronic device (1000) can generate a scene by means of a scene generation module (730) for a frame corresponding to a time interval in which the importance value corresponds to a predetermined third condition. Here, the predetermined third condition may be identifying a time interval in which the importance is greater than or equal to a first threshold value (e.g., q1), but is not limited thereto.

[0181] In operation 1020, the electronic device (1000) can identify whether it is greater than or equal to a first threshold value (e.g., q1). In operation 1030, the electronic device (1000) can create a scene of a frame in a time interval where the importance value is greater than or equal to the first threshold value. In example (1015), when the first threshold value q1 is 0.6, the electronic device (1000) can select a scene of Frame 2 corresponding to Frame Time t2 and Frame 3 corresponding to Frame Time t3. The electronic device (1000) may not include a frame in a time interval where the importance value is less than the threshold value, for example, Frame 1 corresponding to Frame Time t1, in the scene.

[0182] In operations 1040 to 1060, the electronic device (1000) may determine whether to generate a selected scene as a single image or a set of consecutive images based on importance. Here, a single image may include a single frame. A set of consecutive images may include a plurality of consecutive frames and may be represented as a video or GIF.

[0183] In operation 1040, the electronic device (1000) can identify whether, for a scene where the importance value is greater than or equal to a first threshold, the importance value is greater than or equal to a second threshold (e.g., q2). Here, the second threshold may be a value greater than the first threshold. In operation 1050, if the importance value is greater than or equal to the second threshold, the electronic device (1000) can generate the scene as a set of consecutive images. In operation 1060, if the importance value is less than the second threshold, the electronic device (1000) can generate the scene as a single image. In example (1015), if the second threshold q2 is 0.85, the electronic device (1000) can generate frame 3 corresponding to frame time t3 as a scene of a set of consecutive images. Additionally, the electronic device (1000) can generate frame 2 corresponding to frame time t2 as a scene of a single image.

[0184] A detailed description of operation 1050 is given with reference to FIGS. 11a and FIG. 11b. FIG. 11a is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to generate a scene in a first video as a set of consecutive images based on importance. FIG. 11b is a diagram illustrating the method of FIG. 11a.

[0185] In operations 1110 and 1120, the electronic device (1000) may include the previous frame corresponding to the previous time interval as a scene if, based on the time interval corresponding to the frame selected as a scene, the difference in importance with respect to the previous time interval is less than a third threshold value (e.g., q3). The electronic device (1000) may repeatedly identify the difference in importance between neighboring time intervals in reverse order based on the time interval selected as a scene. For example, the electronic device (1000) may consider the importance of time intervals in reverse order for a predetermined number (e.g., k). The electronic device (1000) may create a scene consisting of a set of consecutive images including a maximum number (e.g., k) of previous frames and a current frame for which the difference in importance between neighboring time intervals is less than the third threshold value.

[0186] In operations 1110 and 1130, the electronic device (1000) may generate only the frame corresponding to the time interval corresponding to the frame selected as the scene as the scene if the difference in importance from the previous time interval is greater than or equal to a third threshold value, based on the time interval corresponding to the frame selected as the scene. That is, the electronic device (1000) may not generate the previous frame corresponding to the previous time interval as the scene.

[0187] Referring to FIG. 11b, the electronic device (1000) has importance I for It that is greater than or equal to q2, starting at time t and proceeding in reverse order of time up to time t'. t'<t By considering this, the frames to include in the scene can be determined. For k starting from 0, t' is I t-k- I t-k-1 tk can be determined according to the maximum k that continuously satisfies ≥q3. That is, t' can represent the last time of the frame to be included in the scene. k can represent the number of time intervals to be considered in reverse order. The electronic device (1000) can generate a scene up to the frame corresponding to time t'.

[0188] For example, assume the case where the third threshold is 0.15 and the importance of the current frame time t is 0.89. The electronic device (1000) has an importance I of the current frame time t selected as a scene. t and the importance I of the previous frame time t-1 t-1 If the difference between them is 0.08, scenes can be generated up to the frame corresponding to the previous frame time t-1. The electronic device (1000) has an importance I of the previous frame time t-1. t-1 and the importance of the previous frame time t-2 I t-2 If the difference between them is 0.12, the scene can be generated up to the frame corresponding to the previous frame time t-2. In the same way, the electronic device (1000) can generate the scene up to the frame corresponding to the frame time t-3, and not generate the frame corresponding to the frame time t-4.

[0189] Meanwhile, if k is set to 0, the electronic device (1000) can generate a single image as a scene even if it was a scene in which the user was immersed.

[0190] The electronic device (1000) can search back through time intervals where importance does not decrease rapidly and include previous frames in the scene. Accordingly, the electronic device (1000) can generate not only the most important scene but also a scene that includes the situation immediately preceding the most important scene.

[0191] FIG. 12 is an example of a scene generated by an electronic device according to one embodiment of the present disclosure and a speech input corresponding to the scene.

[0192] Referring to FIG. 12, a scene (1215) within an image (1210) of the first video may include a set of consecutive images. For example, an electronic device (1000) may generate a scene including a post-goal frame (1215d), which is the most important scene, as in FIG. 11b, and the frames prior thereto (e.g., a pass frame (1215a), a pre-goal frame (1215b), and a shooting frame (1215c)). A scene (1215) within an image (1210) of the first video may further include an audio clip (1235) included in the audio (1230) of the first video.

[0193] The electronic device (1000) can generate a speech clip (1225) corresponding to a time interval (e.g., 1 minute 20 seconds) corresponding to a scene (1215) in the speech input (1220). The electronic device (1000) can identify a time interval of speech text embedding data corresponding to a frame time based on the frame time of the image embedding data. For example, the electronic device (1000) can generate a speech clip (1225) to be used for a prompt based on speech text embedding data located at the same time as the frame time of the image embedding data (e.g., 1 minute 20 seconds).

[0194] Scene (1215) may correspond to at least one scene (211) in the first video of FIG. 2. Speech clip (1225) may correspond to speech clip (214) of FIG. 2.

[0195] FIG. 13 is an example showing additional comments mapped to importance values ​​in an electronic device according to one embodiment of the present disclosure.

[0196] Referring to FIG. 13, an electronic device (1000) according to one embodiment of the present disclosure identifies additional comments mapped to importance values ​​and can use the identified additional comments in a prompt. Referring to the table (1300), as the importance value increases, the intensity of the emotion may change or the length of the additional comments may increase. For example, at an importance value of 0.6, a short and simple interest such as "It was quite interesting" may be expressed. At an importance value of 0.8, a long and emotional response such as "It was a moment I was very immersed in!" may be expressed. The mapping data of the table (1300) may be stored in the local memory of the electronic device (1000) or in an external database.

[0197] Accordingly, the electronic device (1000) can generate prompts using a tone that reveals a longer or more emotional response and stronger immersion as the importance increases. This may correspond to additional comments (216) in FIG. 2.

[0198] FIG. 14 is a flowchart illustrating a method for an electronic device to generate a prompt according to one embodiment of the present disclosure.

[0199] Referring to FIG. 14, in operation 1410, an electronic device (1000) according to one embodiment of the present disclosure can obtain content-related information including image data of a first video and voice data of the first video.

[0200] For example, content-related information may include image data of the first video and voice data of the first video. Content-related information may further include user speech input occurring during the playback of the first video. Content-related information may be information constructed by synchronizing a specific scene of the first video and the user's reaction in that scene over time, but is not limited thereto. Content-related information may be referred to as time-series content information.

[0201] The electronic device (1000) can receive various video content generated by content providers through an external device. A content provider may refer to a terrestrial broadcasting station, a cable broadcasting station, a satellite broadcasting station, an IPTV (Internet Protocol Television) service provider, or an OTT (Over the Top) service provider that provides various content to consumers. The external device may be implemented as a source device of various forms, such as a PC, a set-top box (e.g., a terrestrial broadcasting set-top box, a cable broadcasting set-top box, a satellite broadcasting set-top box, an internet broadcasting set-top box), a Blu-ray disc player, a mobile phone, a game console, a home theater, an audio player, a USB, etc.

[0202] For example, an electronic device (1000) can acquire data regarding a first video provided to a user (viewer). For example, the data regarding the first video may include image data (or video data) and voice data (or audio data).

[0203] For example, the electronic device (1000) can receive speech input from a user while watching the first video through a voice input device (e.g., a microphone). The electronic device (1000) can obtain the user speech input by processing the user's speech input.

[0204] In one embodiment of the present disclosure, an electronic device (1000) can convert image data of a first video, voice data of a first video, and user speech input corresponding to content-related information into vector representations through a preprocessing module (310 in FIG. 3). For example, the electronic device (1000) can obtain image embedding data by encoding the image data of the first video through an image encoder (410 in FIG. 4). For example, the electronic device (1000) can obtain voice-text embedding data by inputting the voice data of the first video into a voice recognition module (420 in FIG. 4) to extract text, and encoding the extracted text through a text encoder (420 in FIG. 4). For example, an electronic device (1000) can obtain speech text embedding data by inputting a user speech input into a speech recognition module (420 in FIG. 4) to extract text, and encoding the extracted text through a text encoder (420 in FIG. 4).

[0205] Meanwhile, in a situation where there are multiple speakers, the electronic device (1000) can extract speech input from the main viewer user through a speaker separation module (440 in FIG. 4). The electronic device (1000) can extract text by inputting the speech input of the main viewer into a speech recognition module (420 in FIG. 4), and can obtain speech text embedding data by encoding the extracted text through a text encoder (420 in FIG. 4).

[0206] In operation 1420, an electronic device (1000) according to one embodiment of the present disclosure can identify user characteristic information related to content-related information.

[0207] For example, user characteristic information may be information stored in advance in the memory of the electronic device (1000) or in an external database. User characteristic information may include at least one of user voice information, user preference information, or user-generated content. User voice information, user preference information, and user-generated content may each be stored in separate databases, but are not limited thereto. In one embodiment of the present disclosure, the electronic device (1000) may convert user characteristic information into a vector representation through a preprocessing module (310 in FIG. 3).

[0208] An electronic device (1000) according to one embodiment of the present disclosure can identify at least one content among a plurality of content that has a correlation degree greater than or equal to a preset value, based on the correlation degree between each of the plurality of content related to the user and the content-related information through a user-generated content identification module (320 in FIG. 3 and 5). The electronic device (100) can use the identified at least one content as a component of a prompt. For example, the electronic device (1000) can calculate the correlation degree between each user-generated content and the content-related information through Equations 1 and 2 of FIG. 5. The correlation degree between each of the plurality of content related to the user and the content-related information can be obtained by weighting the sum of a first similarity degree between the content and image data corresponding to the first video, a second similarity degree between the content and voice data corresponding to the first video, and a third similarity degree between the content and the user speech input. The electronic device (1000) can identify at least one user-generated content among a plurality of user-generated content that has a correlation degree corresponding to a predetermined first condition. For example, a predetermined first condition may be to identify content whose value of relevance is greater than or equal to a threshold value.

[0209] An electronic device (1000) according to one embodiment of the present disclosure can identify at least one preference keyword related to content-related information among the user's preference keywords stored in a user preference information DB through a user preference information identification module (340 in FIG. 3 and 7). The electronic device (1000) can use the identified at least one user's preference keyword as a component of a prompt. For example, the electronic device (1000) can calculate the similarity between the user's preference keyword and the content-related information for each of the user's preference keywords through Equation 3 of FIG. 7 in order to identify at least one user's preference keyword. The electronic device (1000) can identify at least one preference keyword in which the similarity corresponds to a predetermined second condition. For example, the predetermined second condition may be identifying a preference keyword in which the value of similarity is greater than or equal to a threshold value (e.g., qf).

[0210] In operation 1430, an electronic device (1000) according to one embodiment of the present disclosure can obtain at least one scene included in the first video based on user characteristic information related to content-related information. In operation 1440, an electronic device (1000) according to one embodiment of the present disclosure can obtain a user's speech input corresponding to at least one scene.

[0211] An electronic device (1000) according to one embodiment of the present disclosure can generate at least one scene in a first video and a user’s speech clip corresponding to at least one scene through an importance calculation module (720 in FIG. 7), a scene generation module (730 in FIG. 7), and a speech clip generation module (740 in FIG. 7) included in a scene and speech generation module (330 in FIG. 3 and 7).

[0212] An electronic device (1000) according to one embodiment of the present disclosure can calculate importance for each time interval based on user preference keywords included in content-related information and user characteristic information through an importance calculation module (720 in FIG. 7). Here, importance may be referred to as time-series importance or importance by time interval.

[0213] An electronic device (1000) according to one embodiment of the present disclosure can calculate the importance of each time interval based on the frame time of the image data. Additionally, to calculate the importance of each time interval, the electronic device (1000) may use voice text embedding data located at the same time as the frame time of the image embedding data and speech text embedding data located at the same time as the frame time. An example regarding this is described in FIG. 8. Meanwhile, the frame time of the image embedding data may be identified through a timestamp, but is not limited thereto.

[0214] An electronic device (1000) according to one embodiment of the present disclosure may use a different method for calculating importance for each time interval based on whether there is a user utterance input for each time interval. For example, if there is utterance text embedding data corresponding to a user utterance input in a specific time interval, the electronic device (1000) may obtain importance through Equation 4. For example, if there is no utterance text embedding data corresponding to a user utterance input in a specific time interval, the electronic device (1000) may obtain importance through Equation 5.

[0215] An electronic device (1000) according to one embodiment of the present disclosure can generate the at least one scene including at least one frame included in the image data of the first video based on the importance of each time interval through a scene generation module (730 in FIG. 7). The electronic device (1000) can generate a scene by a frame corresponding to a time interval in which the importance value corresponds to a predetermined third condition through a scene generation module (730 in FIG. 7). For example, the predetermined third condition may be identifying a time interval in which the importance is greater than or equal to a first threshold value (e.g., q1). The electronic device (1000) can determine whether to generate the selected scene as a single image or a set of continuous images based on the importance through a scene generation module (730 in FIG. 7). Here, the single image may include one frame. For example, the electronic device (1000) can generate the scene as a set of consecutive images if the importance value is greater than or equal to a first threshold value (e.g., q1) and greater than or equal to a second threshold value (e.g., q2) which is greater than the first threshold value (e.g., q1). The electronic device (1000) can generate the scene as a single image if the importance value is less than the second threshold value.

[0216] An electronic device (1000) according to one embodiment of the present disclosure can identify whether the difference in importance from the previous time interval is less than a third threshold (e.g., q3) for a time interval where the importance value is greater than or equal to a second threshold value in order to generate a scene as a set of continuous images. For example, if the difference in importance from the previous time interval is less than the third threshold (e.g., q3) for a time interval where the importance value is greater than or equal to the second threshold value, the electronic device (1000) may include the previous frame corresponding to the previous time interval as a scene. If the difference in importance from the previous time interval is greater than or equal to the third threshold (e.g., q3), the electronic device (1000) may generate only the frame corresponding to the time interval selected as a scene as a scene. That is, the electronic device (1000) may not generate the previous frame corresponding to the previous time interval as a scene. Accordingly, the electronic device (1000) can repeatedly identify the difference in importance between neighboring time intervals in reverse order of time based on the time interval selected as a scene, and can use all frames corresponding to the time interval where the difference in importance is less than a third threshold value as a scene.

[0217] An electronic device (1000) according to one embodiment of the present disclosure can generate a speech clip including speech input of a time interval corresponding to at least one scene from a user speech input through a speech clip generation module (740 in FIG. 7). For example, the electronic device (1000) can generate a speech clip to be used for a prompt based on speech text embedding data located at the same time as the frame time of the image embedding data.

[0218] In operation 1450, an electronic device (1000) according to one embodiment of the present disclosure may obtain a prompt based on user characteristic information related to content-related information, at least one scene included in a first video, and a user's speech input corresponding to at least one scene. In one embodiment of the present disclosure, the prompt may include at least one user-generated content, at least one scene in the first video, at least one speech clip in the user speech input, or at least one preferred keyword. For example, the prompt may include all of the above-described components, or may include at least one of the components.

[0219] FIG. 15 is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to obtain response recording content from a prompt.

[0220] Referring to FIG. 15, in operation 1510, the electronic device (1000) can acquire image data of the first video and voice data of the first video.

[0221] In operation 1520, the electronic device (1000) can obtain user speech input. The electronic device (1000) can extract the speech input of the user who is the main viewer from among a number of speakers based on the user's voice information stored in the user voice DB.

[0222] In operation 1530, the electronic device (1000) can convert content-related information, including user speech input, image data of the first video, and voice data of the first video, into a vector form. The electronic device (1000) can obtain speech text embedding data corresponding to the user speech input, image embedding data corresponding to the image data of the first video, and voice text embedding data corresponding to the voice data of the first video. This is described in FIG. 4.

[0223] In operation 1540, the electronic device (1000) can identify at least one user-generated content related to content-related information among a plurality of user-generated contents stored in a user-generated content DB. The electronic device (1000) can calculate the degree of correlation between content-related information and user-generated content and identify at least one user-generated content corresponding to a predetermined first condition. This has been explained in FIGS. 5 and 6.

[0224] In operation 1550, the electronic device (1000) can identify at least one user's preferred keyword related to content-related information among the user's preferred keywords stored in the user preference information DB. The electronic device (1000) can calculate the similarity between the content-related information and the user preference information and identify at least one user's preferred keyword corresponding to a predetermined second condition. This is explained in FIG. 7.

[0225] In operation 1560, the heart rate measuring device (2000) can measure the user's heart rate through a heart rate sensor. In operation 1570, the heart rate measuring device (2000) can transmit heart rate information to the electronic device (1000) through a communication module. The electronic device (1000) can receive heart rate information from the heart rate measuring device (2000) through a communication module. The electronic device (1000) can perform time-series information interpolation to match the cycle of the heart rate information with the cycle of the content-related information. This is described in FIGS. 7 and 8.

[0226] In operation 1570, the electronic device (1000) can acquire at least one scene in the first video based on content-related information, at least one identified user's preferred keyword, and the user's heart rate information. The electronic device (1000) can calculate importance based on the interpolated user heart rate information. The electronic device (1000) can calculate a time-series importance indicating the importance of each time interval and generate a frame of the first video corresponding to the importance of a predetermined time interval corresponding to a predetermined third condition as a scene. This has been explained in FIGS. 7 to 11b.

[0227] In operation 1580, the electronic device (1000) can acquire a user's speech clip corresponding to at least one scene within the user speech input. This is described in FIG. 12.

[0228] In operation 1590, the electronic device (1000) may generate a prompt based on at least one scene in the first video, a user’s speech clip corresponding to at least one scene, at least one identified user-generated content, and at least one identified user’s preferred keyword. An example of a prompt is described in FIG. 2.

[0229] In operation 1592, the electronic device (1000) can transmit a prompt to the server (3000) through a communication module.

[0230] In operation 1594, the server (3000) can generate user response record content for the first video generated using a generation model that receives a prompt as input. An example of response record content is described in FIG. 2.

[0231] In operation 1596, the server (3000) can transmit the response record content to the electronic device (1000). The electronic device (1000) can receive the response record content from the server (3000) through a communication module.

[0232] FIG. 16 is a detailed block diagram of an electronic device according to one embodiment of the present disclosure.

[0233] Referring to FIG. 16, an electronic device (1000) according to one embodiment of the present disclosure may include a processor (1601), memory (1602), a tuner unit (1610), a communication unit (1620), a sensing unit (1630), an input / output unit (1640), a video processing unit (1650), a display (1660), an audio processing unit (1670), an audio output unit (1680), and a user input unit (1690). However, not all components shown in FIG. 16 are essential components. The electronic device (1000) may be implemented with more components than those shown in FIG. 16, or with fewer components.

[0234] The processor (1601) controls the overall operation of the electronic device (1000). For example, the processor (1601) can perform the functions of the electronic device (1000) described in the present disclosure by executing one or more instructions stored in memory (1602). In this case, memory (1602) may store one or more instructions that can be executed by the processor (1601). Additionally, the processor (1601) can store one or more instructions in internally provided memory and control the execution of the one or more instructions stored in the internally provided memory so that the aforementioned operations are performed. That is, the processor (1601) can perform a predetermined operation by executing at least one instruction or program stored in internal memory or memory (1602) provided within the processor (1601).

[0235] The processor (1601) may be composed of at least one of a Central Processing Unit, a microprocessor, a Graphic Processing Unit, an Application Processor (AP), an Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), and a Neural Processing Unit or an AI-dedicated processor designed with a hardware structure specialized for the learning and processing of an artificial intelligence model (AI), but is not limited thereto.

[0236] The memory (1602) can store instructions, algorithms, data structures, program codes, and application programs that are stored for processing and control of the processor (1601), and can store data that is input to or output from the electronic device (1000). The memory (1602) may include at least one of a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (e.g., SD or XD memory, etc.), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), Mask ROM, Flash ROM, etc.), a hard disk drive (HDD), or a solid-state drive (SSD). A program (one or more instructions) or application stored in memory (1602) can be executed by the processor (1601).

[0237] The tuner unit (1610) can select only the frequency of the channel to be received by the electronic device (1000) from among many radio wave components by tuning through amplification, mixing, resonance, etc. of broadcast content received via wired or wireless connection. The broadcast signal received through the tuner unit (1610) is separated into audio, video, and additional information (e.g., EPG (Electronic Program Guide)). The separated audio, video, and additional information can be stored in memory (1602) under the control of the processor (1601).

[0238] The tuner unit (1610) can receive broadcast signals from various sources such as terrestrial broadcasting, cable broadcasting, satellite broadcasting, internet broadcasting, etc. The tuner unit (1610) can also receive broadcast signals from sources such as analog broadcasting or digital broadcasting.

[0239] The communication unit (1620) can connect the electronic device (1000) to peripheral devices, external devices, servers, display devices, remote control devices, mobile terminals, etc. under the control of the processor (1601). The communication unit (1620) may include at least one communication module capable of performing wireless communication. For example, the communication unit (1620) may separately provide a communication module for communicating with a server, a communication module for communicating with a display device, a communication module for communicating with a remote control device, and a communication module for communicating with a mobile terminal, or it may include a single integrated module.

[0240] The communication unit (1620) may include at least one of a wireless LAN module (1621), a Bluetooth module (1622), and a wired Ethernet (1623) in accordance with the performance and structure of the electronic device (1000). The Bluetooth module (1622) can receive Bluetooth signals transmitted from a peripheral device according to the Bluetooth communication standard. The Bluetooth module (1622) can be a BLE (Bluetooth Low Energy) communication module and can receive BLE signals. The Bluetooth module (1622) can scan for BLE signals continuously or temporarily to detect whether a BLE signal is being received. The wireless LAN module (1621) can transmit and receive Wi-Fi signals with a peripheral device according to the Wi-Fi communication standard.

[0241] In one embodiment of the present disclosure, the communication unit (1620) can receive heart rate information from a heart rate measuring device (2000 in FIG. 3). The communication unit (1620) can transmit a prompt to a server (3000 in FIG. 3) and receive response record data.

[0242] The detection unit (1630) detects the user's voice, the user's image, or the user's interaction and may include a microphone (1631), a sensor (1632), and an optical receiver (1633).

[0243] The microphone (1631) can receive an audio signal including the user's uttered voice or noise and can convert the received audio signal into an electrical signal and output it to the processor (1601).

[0244] In one embodiment of the present disclosure, the microphone (1631) may receive user speech input including speech input from a user watching the first video and transmit it to the processor (1601).

[0245] The microphone (1631) may be provided in a remote control device such as a remote control, a mobile terminal, or an AI speaker. For example, the mobile terminal may run an application to remotely control the electronic device (1000). In this case, the microphone (1631) provided in the remote control device may receive an audio signal containing the user's uttered voice or noise. The remote control device may convert the audio signal into a control signal and transmit it to the electronic device (1000). The electronic device (1000) may receive the control signal from the remote control device through the communication unit (1620).

[0246] The sensor (1632) detects the user's image, or the user's interaction, gesture, and touch, and may include a distance sensor, an image sensor, a gesture sensor, an illuminance sensor, etc. The distance sensor may include various sensors that detect the distance between the electronic device (1000) and the user, such as an ultrasonic sensor, an IR (Infrared Radiation) sensor, or a TOF (Time Of Flight) sensor. The distance sensor detects the distance to the user and can transmit the sensing data to the processor (1601). The image sensor can capture the user's gesture by photographing it through a camera, etc., and transmit it to the processor (1601). The gesture sensor can detect the speed or direction of movement through an accelerometer or a gyroscope. The illuminance sensor can detect ambient illuminance.

[0247] The optical receiver (1633) can receive an optical signal (including a control signal). The optical receiver (1633) can receive an optical signal corresponding to user input (e.g., touch, press, touch gesture, voice, or motion) from a control device such as a remote control or a mobile phone.

[0248] The input / output unit (1640) can receive video (e.g., dynamic image signal or still image signal), audio (e.g., voice signal or music signal), and additional information from an external device, etc., under the control of the processor (1601). The input / output unit (1640) may include a port for outputting video and audio together, and may also include separate ports for outputting video and audio separately.

[0249] The input / output unit (1640) may include one of an HDMI port (High-Definition Multimedia Interface port, 1641), a component jack (1642), a PC port (1643), and a USB port (1644). The input / output unit (1640) may include a combination of an HDMI port (1641), a component jack (1642), a PC port (1643), and a USB port (1644). Additionally, the input / output unit (1640) may include one of a DP (Display Port), a Thunderbolt port, a VGA (Video Graphics Array) port, an RGB port, a D-SUB, and a DVI (Digital Visual Interface).

[0250] When the electronic device (1000) corresponds to a content providing device such as a set-top box, the input / output unit (1640) can output video, audio, and additional information to an external display device under the control of the processor (1601).

[0251] In one embodiment of the present disclosure, image data and audio data of the first video are transmitted through a separate port within the input / output unit (1640) and may be stored as a separate track in the electronic device (1000). For example, image data may be transmitted through a port such as VGA or DVI, and audio data may be transmitted through a separate port. Alternatively, image data and audio data of the first video may be transmitted as a single stream through HDMI, DP, Thunderbolt, etc., and may be stored as a separate track in the electronic device (1000).

[0252] The video processing unit (1650) processes image data to be displayed by the display (1660) and can perform various image processing operations such as decoding, rendering, scaling, noise filtering, frame rate conversion, and resolution conversion on the image data.

[0253] The display (1660) can output content received from a broadcasting station or from an external device such as an external server or an external storage medium. The content is a media signal and may include a video signal, an image, a text signal, etc.

[0254] In one embodiment of the present disclosure, the display (1660) can output a first video.

[0255] The audio processing unit (1670) performs processing on audio data. Various processing such as decoding, amplification, and noise filtering on the audio data can be performed in the audio processing unit (1670).

[0256] The audio output unit (1680) can output audio included in content received through the tuner unit (1610) under the control of the processor (1601), audio input through the communication unit (1620) or input / output unit (1640), and audio stored in memory (1602). The audio output unit (1680) may include at least one of a speaker (1681), headphones (1682), or S / PDIF (Sony / Philips Digital Interface: output terminal) (1683).

[0257] In one embodiment of the present disclosure, the audio output unit (1680) can output audio data of the first video. The processor (1601) can control the audio output unit (1680) to output the audio data of the first video as audio.

[0258] The user input unit (1690) can receive user input for controlling the electronic device (1000). The user input unit (1690) may include, but is not limited to, various forms of user input devices, such as a touch panel that detects a user's touch, a button that receives a user's push operation, a wheel that receives a user's rotation operation, a keyboard, a dome switch, a microphone for voice recognition, and a motion detection sensor that senses motion. When the electronic device (1000) is controlled by a remote control device, such as a remote control device or other mobile terminal, the user input unit (1690) can receive a control signal received from the remote control device.

[0259] In one embodiment of the present disclosure, the memory (1602) may include the preprocessing module (310) of FIG. 3, the user-generated content identification module (320), the scene and utterance generation module (330), and the user preference information identification module (340). A 'module' included in the memory (1602) refers to a unit that processes a function or operation performed by the processor (1602), and this may be implemented as software such as instructions, algorithms, data structures, or program code.

[0260] In one embodiment of the present disclosure, the memory (1602) may store image data of the first video, voice data of the first video, and user speech input corresponding to content-related information. The memory (1602) may store user characteristic information. The memory (1602) may store a prompt. The memory (1602) may store response record content.

[0261] In one embodiment of the present disclosure, the processor (1601) can convert each of the image data of the first video, the voice data of the first video, and the user speech input corresponding to content-related information into a vector representation by executing one or more instructions included in the preprocessing module (310).

[0262] In one embodiment of the present disclosure, the processor (1601) can identify at least one user-generated content related to content-related information among a plurality of user-generated contents stored in a user-generated content DB by executing one or more instructions included in the user-generated content identification module (320).

[0263] In one embodiment of the present disclosure, the processor (1601) can obtain at least one scene in the first video and a user's speech clip corresponding to at least one scene in the user speech input based on content-related information by executing one or more instructions included in the scene and speech generation module (330).

[0264] In one embodiment of the present disclosure, the processor (1601) can identify at least one preference keyword related to content-related information among the user's preference keywords stored in the user preference information DB by executing one or more instructions included in the user preference information identification module (340).

[0265] According to one embodiment of the present disclosure, at least one processor obtains content-related information including image data corresponding to a first video and voice data corresponding to the first video.

[0266] According to one embodiment of the present disclosure, at least one processor obtains at least one scene included in the first video and a user speech input corresponding to the at least one scene, based on user characteristic information related to the content-related information.

[0267] According to one embodiment of the present disclosure, at least one processor obtains a prompt based on user characteristic information related to the content-related information, at least one scene included in the first video, and a user speech input corresponding to the at least one scene.

[0268] At least one processor according to one embodiment of the present disclosure obtains response content corresponding to the first video through a generation model using the prompt.

[0269] The user characteristic information according to one embodiment of the present disclosure may include a plurality of generated content related to the user.

[0270] According to one embodiment of the present disclosure, the at least one processor may obtain the prompt based on the degree of correlation between each of the plurality of production contents related to the user and the content-related information, and further based on at least one production content among the plurality of production contents having a degree of correlation greater than or equal to a preset value.

[0271] According to one embodiment of the present disclosure, the degree of correlation between each of the plurality of production contents related to the user and the content-related information can be obtained by weighting the first similarity between the production content and image data corresponding to the first video, the second similarity between the production content and voice data corresponding to the first video, and the third similarity between the production content and the user speech input.

[0272] The user characteristic information according to one embodiment of the present disclosure may include preferred keywords.

[0273] According to one embodiment of the present disclosure, the at least one processor may obtain the prompt based further on at least one preferred keyword among the preferred keywords that is related to the content-related information.

[0274] According to one embodiment of the present disclosure, the at least one preferred keyword can be obtained if at least one of the fourth similarity between the user's preference information and image data corresponding to the first video, the fifth similarity between the user's preference information and voice data corresponding to the first video, or the sixth similarity between the user's preference information and the user's speech input corresponds to the second condition.

[0275] According to one embodiment of the present disclosure, the at least one processor can obtain importance by time interval based on preferred keywords included in the content-related information and the user characteristic information.

[0276] According to one embodiment of the present disclosure, the at least one processor can acquire the at least one scene including at least one frame included in image data corresponding to the first video, based on the importance of each time interval.

[0277] According to one embodiment of the present disclosure, the at least one processor can obtain a speech clip including a user speech input for each time interval corresponding to the at least one scene from the user speech input.

[0278] According to one embodiment of the present disclosure, if the user speech input exists in a predetermined time interval, the at least one processor may obtain the importance based on at least one of a first similarity between the user speech input and image data corresponding to the first video, a second similarity between the user speech input and voice data corresponding to the first video, a third similarity between the user's preference information and image data corresponding to the first video, or a fourth similarity between the user's preference information and voice data corresponding to the first video.

[0279] According to one embodiment of the present disclosure, the at least one processor may obtain the importance based on at least one of the third similarity or the fourth similarity if the user speech input does not exist in the predetermined time interval.

[0280] According to one embodiment of the present disclosure, the at least one processor

[0281] The time interval of the acquired user heart rate information can be adjusted to correspond to the frame time interval of the image data corresponding to the first video.

[0282] According to one embodiment of the present disclosure, the at least one processor can obtain a higher importance for each predetermined time interval as the heart rate value of the predetermined time interval increases, based on the adjusted user heart rate information.

[0283] According to one embodiment of the present disclosure, the at least one processor can acquire a frame corresponding to the predetermined time interval as the at least one scene if the importance of each predetermined time interval is greater than or equal to a first threshold value.

[0284] According to one embodiment of the present disclosure, the at least one processor can acquire the at least one scene as either a single image or a set of continuous images if the importance of each predetermined time interval is greater than or equal to a second threshold value which is greater than a first threshold value.

[0285] According to one embodiment of the present disclosure, at least one processor can acquire image embedding data based on image data of the first video. According to one embodiment of the present disclosure, at least one processor can convert voice data of the first video into text. Voice-text embedding data for the converted text can be acquired. According to one embodiment of the present disclosure, at least one processor can identify the user's speech input from the user speech input based on previously stored user voice information. Voice data including the identified user speech input can be converted into text. Speech-text embedding data for the converted text can be acquired.

[0286] A method for generating response content according to one embodiment of the present disclosure comprises: a step of obtaining content-related information including image data corresponding to a first video and voice data corresponding to the first video; a step of obtaining at least one scene included in the first video and a user speech input corresponding to the at least one scene based on user characteristic information related to the content-related information; a step of obtaining a prompt based on user characteristic information related to the content-related information, at least one scene included in the first video, and a user speech input corresponding to the at least one scene; and a step of obtaining response content corresponding to the first video through a generation model using the prompt.

[0287] The step of obtaining the prompt according to one embodiment of the present disclosure may include the step of obtaining the prompt based on at least one production content among the plurality of production contents having a correlation degree greater than or equal to a preset value, based on the correlation degree between each of the plurality of production contents related to the user and the content-related information.

[0288] According to one embodiment of the present disclosure, the degree of correlation between each of the plurality of production contents related to the user and the content-related information can be obtained by weighting the first similarity between the production content and image data corresponding to the first video, the second similarity between the production content and voice data corresponding to the first video, and the third similarity between the production content and the user speech input.

[0289] The user characteristic information according to one embodiment of the present disclosure may include preferred keywords.

[0290] The step of obtaining the prompt according to one embodiment of the present disclosure may further include the step of obtaining the prompt based on at least one preferred keyword related to the content-related information among the preferred keywords.

[0291] According to one embodiment of the present disclosure, the at least one preferred keyword can be obtained if at least one of the fourth similarity between the user's preference information and image data corresponding to the first video, the fifth similarity between the user's preference information and voice data corresponding to the first video, or the sixth similarity between the user's preference information and the user's speech input corresponds to the second condition.

[0292] The step of acquiring response content corresponding to the first video according to one embodiment of the present disclosure may include: acquiring importance by time interval based on preferred keywords included in the content-related information and the user characteristic information; acquiring at least one scene including at least one frame included in image data corresponding to the first video based on the importance by time interval; and acquiring a speech clip including a user speech input by time interval corresponding to the at least one scene from the user speech input.

[0293] The step of acquiring response content corresponding to the first video according to one embodiment of the present disclosure may include, if the user speech input exists in a predetermined time interval, acquiring the importance based on at least one of a first similarity between the user speech input and image data corresponding to the first video, a second similarity between the user speech input and voice data corresponding to the first video, a third similarity between the user's preference information and image data corresponding to the first video, or a fourth similarity between the user's preference information and voice data corresponding to the first video; and if the user speech input does not exist in the predetermined time interval, acquiring the importance based on at least one of the third similarity or the fourth similarity.

[0294] The method according to one embodiment of the present disclosure may further include the step of adjusting the time interval of the acquired user heart rate information to correspond to the frame time interval of the image data corresponding to the first video.

[0295] The step of obtaining importance for each time interval according to one embodiment of the present disclosure may include, based on the adjusted user heart rate information, obtaining a higher importance for each time interval as the heart rate value of the predetermined time interval increases.

[0296] According to one embodiment of the present disclosure, the step of acquiring the at least one scene may include: acquiring a frame corresponding to the predetermined time interval as the at least one scene if the importance of each predetermined time interval is greater than or equal to a first threshold value; and acquiring the at least one scene as either a single image or a set of continuous images if the importance of each predetermined time interval is greater than or equal to a second threshold value which is greater than the first threshold value.

[0297] The step of acquiring at least one scene according to one embodiment of the present disclosure may further include the step of acquiring a scene including a second frame corresponding to the previous time interval and the first frame, based on the time interval corresponding to the first frame acquired as the at least one scene, if the difference in importance with respect to the previous time interval is less than a third threshold value. A method for generating response recording content according to one embodiment of the present disclosure may further include the step of converting the time series content information into a vector representation. The step of converting the time series content information into a vector representation may include the step of acquiring image embedding data based on the image data of the first video, and converting the voice data of the first video into text. The method may include the step of acquiring voice-text embedding data for the converted text, and identifying the user's speech input from the user speech data based on the previously stored user voice information. The method may include the step of converting the voice data including the identified user speech input into text. The method may include the step of acquiring speech text embedding data for the converted text.

[0298] A device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory storage medium' simply means that it is a tangible device and does not contain a signal (e.g., electromagnetic waves), and the term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily. For example, a 'non-transitory storage medium' may include a buffer in which data is stored temporarily.

[0299] According to one embodiment, the method according to the various embodiments disclosed herein may be provided by being included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product (e.g., downloadable app) may be temporarily stored or temporarily created on a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

Claims

1. Regarding the method of creating responsive content, A step of obtaining content-related information including image data corresponding to a first video and voice data corresponding to the first video; A step of obtaining at least one scene included in the first video and a user voice input corresponding to the at least one scene, based on user characteristic information related to the content-related information; A step of obtaining a prompt based on user characteristic information related to the above content-related information, at least one scene included in the first video, and user voice input corresponding to the at least one scene; and A method comprising the step of obtaining response content corresponding to the first video through a generation model using the above prompt.

2. In Paragraph 1, The above user characteristic information includes multiple generated content related to the user, and The step of obtaining the above prompt is, A method comprising the step of obtaining the prompt based on the degree of correlation between each of the plurality of production contents related to the user and the information related to the content, and further based on at least one production content among the plurality of production contents having a degree of correlation greater than or equal to a preset value.

3. In Paragraph 2, The degree of correlation between each of the multiple production contents related to the above user and the information related to the said content is, A method obtained by weighting the sum of a first similarity between the production content and image data corresponding to the first video, a second similarity between the production content and voice data corresponding to the first video, and a third similarity between the production content and the user voice input.

4. In any one of paragraphs 1 to 3, The above user characteristic information includes preferred keywords, and The step of obtaining the above prompt is, A method comprising the step of obtaining the prompt based further on at least one preferred keyword related to the content-related information among the preferred keywords.

5. In Paragraph 4, The above at least one preferred keyword is, A method obtained when at least one of the fourth similarity between the user's preference information and image data corresponding to the first video, the fifth similarity between the user's preference information and voice data corresponding to the first video, or the sixth similarity between the user's preference information and the user's voice input corresponds to the second condition.

6. In any one of paragraphs 1 through 5, The step of acquiring reaction content corresponding to the first video above is, A step of obtaining importance levels by time interval based on preferred keywords included in the above content-related information and the above user characteristic information; A step of acquiring at least one scene including at least one frame included in image data corresponding to the first video, based on the importance of each time interval; and A method comprising the step of obtaining a speech clip including a user voice input for each time interval corresponding to at least one scene from the user voice input.

7. In Paragraph 6, The step of acquiring reaction content corresponding to the first video above is, If the user voice input exists in a predetermined time interval, the method comprises the step of obtaining the importance based on at least one of a first similarity between the user voice input and image data corresponding to the first video, a second similarity between the user voice input and voice data corresponding to the first video, a third similarity between the user's preference information and image data corresponding to the first video, or a fourth similarity between the user's preference information and voice data corresponding to the first video; and A method comprising the step of obtaining the importance based on at least one of the third similarity or fourth similarity if the user voice input does not exist in the above predetermined time interval.

8. In Paragraph 6 or 7, The above method is, The method further includes the step of adjusting the time interval of the acquired user heart rate information to correspond to the frame time interval of the image data corresponding to the first video, and The step of obtaining importance for each time interval above is, A method comprising the step of obtaining a higher importance for each predetermined time interval as the heart rate value of the predetermined time interval increases, based on the above-mentioned adjusted user heart rate information.

9. In any one of paragraphs 6 through 8, The step of acquiring at least one scene above is, If the importance of each predetermined time interval is greater than or equal to a first threshold value, a step of acquiring a frame corresponding to the predetermined time interval as the at least one scene; and A method comprising the step of acquiring at least one scene as either a single image or a set of continuous images, wherein the importance of each predetermined time interval is greater than or equal to a second threshold value which is greater than the first threshold value.

10. In Paragraph 9, The step of acquiring at least one scene above is, A method further comprising the step of acquiring a scene including a second frame corresponding to the previous time interval and the first frame, based on the time interval corresponding to the first frame acquired from at least one scene, if the difference in importance with the previous time interval is less than a third threshold value.

11. In an electronic device, At least one processor; and Memory comprising one or more storage media that store one or more instructions, and By the above at least one processor executing the above one or more instructions individually or collectively, the electronic device, Acquiring content-related information including image data corresponding to a first video and voice data corresponding to the first video, and Based on user characteristic information related to the above content-related information, at least one scene included in the first video and a user voice input corresponding to the at least one scene are obtained, and Based on user characteristic information related to the above content-related information, at least one scene included in the first video, and user voice input corresponding to the at least one scene, a prompt is obtained, and An electronic device that obtains response content corresponding to the first video through a generation model using the above prompt.

12. In Paragraph 11, The above user characteristic information includes multiple generated content related to the user, and By the above at least one processor executing the above one or more instructions individually or in combination, the electronic device, An electronic device that obtains the prompt based on the degree of correlation between each of the plurality of production contents related to the user and the information related to the content, and further based on at least one production content among the plurality of production contents whose degree of correlation is greater than or equal to a preset value.

13. In Paragraph 12, The degree of correlation between each of the multiple production contents related to the above user and the information related to the said content is, An electronic device obtained by weighting the sum of a first similarity between the production content and image data corresponding to the first video, a second similarity between the production content and voice data corresponding to the first video, and a third similarity between the production content and the user voice input.

14. In any one of paragraphs 11 through 13, The above user characteristic information includes preferred keywords, and By the above at least one processor executing the above one or more instructions individually or in combination, the electronic device, An electronic device that obtains the prompt based further on at least one preferred keyword related to the content-related information among the preferred keywords.

15. A computer-readable recording medium having a program recorded thereon for performing the method of any one of paragraphs 1 through 10 on a computer.