system
The system addresses the lack of detailed recipe information in cooking videos by analyzing and captioning them, enhancing viewer experience and creator efficiency through automated editing suggestions.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTBANK GROUP CORP
- Filing Date
- 2024-12-13
- Publication Date
- 2026-06-25
AI Technical Summary
Cooking videos often lack detailed recipe information, making it inconvenient for viewers to follow along, and the process of video transcription and editing is time-consuming for creators.
A system that analyzes cooking videos frame by frame using image recognition to identify ingredients and procedures, generates captions, and provides editing suggestions to enhance video quality and efficiency.
Enables quick delivery of detailed recipe information to viewers and reduces the burden on creators by automating the transcription and editing process, resulting in high-quality video content.
Smart Images

Figure 2026104383000001_ABST
Abstract
Description
Technical Field
[0001] The technology of the present disclosure relates to a system.
Background Art
[0002] Patent Document 1 discloses a persona chatbot control method performed by at least one processor, including steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] Cooking videos are visually attractive, but often lack detailed recipe information, causing inconvenience to viewers when they actually try to cook. Also, for video creators, while it is required to efficiently provide content while meeting the needs of viewers, video transcription and editing work takes a great deal of time and effort. The present invention aims to provide a new system that supports high-quality video creation while efficiently providing detailed recipe information to solve these problems.
Means for Solving the Problems
[0005] The present invention provides means for receiving and saving cooking videos and dividing the videos frame by frame. Next, it identifies the ingredients and procedures in the video using an image recognition algorithm and generates this as text data. Furthermore, it provides means for generating captions using the generated text data and creating visual and temporal editing suggestions. This makes it possible to transmit the generated text data, captions, and editing suggestions to a terminal, providing a system that can quickly and effectively deliver useful information to viewers and creators.
[0006] "Video" refers to a medium that records and plays back visual information in an electrical or digital format.
[0007] "Receiving" refers to the act of receiving specific data or signals via a network.
[0008] A "frame" refers to the individual still images that make up a video, and when played in sequence, they represent movement.
[0009] An "image recognition algorithm" refers to a computational method used to analyze digital images and extract or recognize specific information.
[0010] "Ingredients" refers to raw materials or elements used in cooking or production.
[0011] A "procedure" refers to a series of operations or actions performed sequentially to achieve a specific objective.
[0012] "Captions" are textual information added to video content, etc., that supplement visual or substantive information.
[0013] "Editing suggestions" refer to proposals or guidelines presented to improve the appearance and structure of video and audio content.
[0014] A "terminal" refers to an electronic device designed to process, display, or manipulate digital information. [Brief explanation of the drawing]
[0015] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14] This is a sequence diagram showing the processing flow of the data processing system in Application Example 2, which combines an emotion engine.
Embodiments for Carrying Out the Invention
[0016] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.
[0017] First, the terms used in the following description will be explained.
[0018] In the following embodiments, the numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.
[0019] In the following embodiments, the numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.
[0020] In the following embodiments, the numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.
[0021] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).
[0022] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."
[0023] [First Embodiment]
[0024] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.
[0025] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.
[0026] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0027] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.
[0028] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.
[0029] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.
[0030] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.
[0031] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.
[0032] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.
[0033] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0034] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0035] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".
[0036] This invention is a system that effectively analyzes cooking videos and provides convenient recipe information and editing support for both viewers and creators. Specific embodiments of the system are described below.
[0037] First, users upload cooking videos they have filmed themselves to the platform. This process requires the user's device to select and submit the video.
[0038] Next, the server receives the video and temporarily stores it. The video is then split frame by frame for analysis. This allows for detailed analysis of each frame as image data.
[0039] The server uses an AI image recognition algorithm to identify ingredients and cooking steps within video frames. This analysis generates text data containing ingredient names, cooking progress, and other relevant information.
[0040] Based on the generated text data, the server automatically creates captions corresponding to the video content. The captions are assigned along the video's timeline, allowing viewers to instantly access information that matches the video.
[0041] Furthermore, the server provides visual editing suggestions. These suggestions include things like scene transitions, shortening redundant parts, and highlighting important scenes, aiming to improve the overall quality of the video.
[0042] Finally, the user, having received the provided text data and editing suggestions, performs the final editing of the video based on that information. After adjustments, the user can publish the video on the platform.
[0043] As a concrete example, consider a case where a user uploads a video of a pasta dish. The server recognizes ingredients such as tomatoes and pasta from the frames and creates captions for steps such as "boiling the pasta" and "mixing the sauce." Furthermore, it makes editing suggestions such as "shorten the boiling time" and "emphasize the scene of stirring the cream sauce." As a result, the user can complete a high-quality video with detailed recipe information in a short amount of time.
[0044] In this way, the system reduces the burden on users while enabling the provision of user-friendly content for viewers.
[0045] The following describes the processing flow.
[0046] Step 1:
[0047] Users upload cooking videos to the platform. Through the device interface, users select video files and press the upload button, sending the video data to the server.
[0048] Step 2:
[0049] The server checks the received video and saves it to temporary storage. It verifies the format and size of the video file to ensure it complies with the platform's standards.
[0050] Step 3:
[0051] The server divides the video into frames for analysis. Based on the frame rate, it extracts frames at intervals that allow for efficient and appropriate analysis.
[0052] Step 4:
[0053] The server uses an AI model to identify ingredients and cooking steps from image data within a frame. In this process, ingredients, tools, and cooking steps are recognized as text, and the data is structured.
[0054] Step 5:
[0055] The server generates captions from the obtained material and procedure data. Based on the text data, it generates easy-to-understand explanatory text for viewers and adds captions to the video along the timeline.
[0056] Step 6:
[0057] The server provides video editing suggestions. Based on the analysis results, it generates scene cuts and improvement suggestions, and offers ideas to highlight visual effects and key points.
[0058] Step 7:
[0059] The device displays text information, captions, and editing suggestions received from the server to the user. The user can then use this information to make final edits to the video.
[0060] Step 8:
[0061] The user publishes their edited video on the platform. After a final review and any necessary adjustments, it is made available for other users to view.
[0062] (Example 1)
[0063] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0064] In recent years, with the increase in video content, it has become difficult for viewers to efficiently obtain information. Furthermore, creators are required to dedicate a significant amount of time and effort to editing. To address these issues, there is a need for the creation of video content that is convenient for both viewers and creators by automatically analyzing ingredients and cooking procedures and providing visual editing suggestions.
[0065] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0066] In this invention, the server includes means for receiving and storing video, means for dividing the video into individual images for analysis, and means for identifying materials and work procedures from the content within the video using an image recognition method and generating text information. This enables the rapid and accurate acquisition of material and procedure information for video content uploaded by users, and further enables efficient video creation through editing suggestions.
[0067] "Image" refers to visual information expressed in digital or analog format.
[0068] "Receiving" refers to the act of receiving data via a digital network or other means of communication.
[0069] "Storage" refers to the act of saving specific data in a way that allows for later access.
[0070] "Image" refers to individual still frames or their digital representations that constitute visual information.
[0071] "Partitioning" refers to the process of dividing continuous information into individual units.
[0072] "Image recognition techniques" refer to the process by which computers identify and interpret objects, features, and patterns within an image.
[0073] "Content" refers to the visual elements and their meanings contained within a video or image.
[0074] "Materials" refers to elements or items used that are specifically identified within the video.
[0075] A "work procedure" refers to a series of steps or processes performed to achieve a specific objective.
[0076] "Specification" refers to the act of identifying and evaluating specific properties or elements.
[0077] "Textual information" refers to visual or auditory data expressed in the form of characters or text.
[0078] A "title" refers to a short sentence or phrase used to describe or supplement the content of a video.
[0079] The term "time axis" refers to a framework of thought that indicates the flow or sequence of events over time.
[0080] A "scene" refers to a unit of composition in visual media, such as a shot or sequence.
[0081] This invention is a system that extracts material information and procedures from video content and provides efficient editing support. Specifically, a server receives video uploaded by a user and stores it temporarily. A server computer capable of high-speed data processing is preferred as the hardware to be used. The server divides the video into frames as a preliminary step to analysis. This process utilizes video editing software and image conversion libraries to achieve rapid frame separation.
[0082] Next, the server uses a generative AI model to analyze the image data of each frame. This model identifies the materials and work procedures within the video through image recognition techniques and generates textual information. This method allows the user to arbitrarily obtain individual materials and procedures in text format. The generative AI model can be used with prompts such as "What are the materials in the video?" to achieve accurate information extraction.
[0083] Based on the generated text information, the server automatically creates titles and places them on the timeline to match specific scenes in the video. This process is performed by software equipped with a natural language processing engine. Furthermore, it can also provide visual editing suggestions, offering collaborative editing support to increase creator efficiency. These editing suggestions include guidance on visual effects, such as highlighting important video scenes.
[0084] Ultimately, the terminal receives text information and suggestions from the server, and the user uses this to edit the final video. Through this system, users can quickly create and publish high-quality video content. The aim is to improve the convenience of digital video production and provide a valuable experience for both viewers and creators.
[0085] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0086] Step 1:
[0087] Users upload cooking videos to the platform using their devices. Specifically, the user selects a video file, and the device sends that file to the server. The input is the video file selected by the user, and the output is the transfer of the video file to the server.
[0088] Step 2:
[0089] The server receives the uploaded video and stores it temporarily. Specifically, the received video data is saved to the server's storage. The input is the video data sent from the terminal, and the output is the video data stored in the server's storage.
[0090] Step 3:
[0091] The server divides the stored video into individual frames. Specifically, it divides the video into time intervals to generate still frames. This process uses an image processing library to convert continuous data into individual images. The input is the stored video data, and the output is individual frame image data.
[0092] Step 4:
[0093] The server analyzes each frame using a generated AI model. Specifically, it uses an image recognition algorithm to identify materials and work procedures within the frame and generates text information. This analysis outputs material names and operation procedures as text. The input is frame image data, and the output is text information including materials and procedures.
[0094] Step 5:
[0095] The server generates titles based on text information and places them according to the scenes in the video. A natural language processing engine is used to create video-related captions from text data and synchronize them on the video timeline. The input is text data, and the output is titles aligned with the video timeline.
[0096] Step 6:
[0097] The server generates editing suggestions that include visual effects. Specifically, it uses a generation AI model to analyze the results and make suggestions such as highlighting important scenes in the video. The input is the analyzed video information, and the output is editing instruction data that includes suggestions for visual effects.
[0098] Step 7:
[0099] The terminal receives text information and editing suggestions from the server, and the user performs the final video editing. Specifically, the user makes necessary corrections and edits using the provided information. The input is the text information and editing suggestions sent from the server, and the output is the final edited video file.
[0100] (Application Example 1)
[0101] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0102] While video content has become widespread, and cooking-related videos are increasing, it's not easy for viewers to efficiently understand and practice cooking methods while watching videos. In particular, repeated viewing is necessary to grasp specific ingredients and procedures, which presents a significant time and cognitive burden for users. Creators are also seeking support to efficiently edit engaging videos.
[0103] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0104] In this invention, the server includes means for receiving and storing video; means for dividing the video frame by frame for analysis; means for identifying ingredients and procedures from the video using an image recognition algorithm and generating text data; means for generating visual and temporal editing suggestions; means for transmitting the generated text data, captions, and editing suggestions to a mobile device; and means for visualizing the video analysis results in real time, enabling users to understand the cooking method in accordance with the video. This allows viewers to efficiently grasp cooking procedures as they watch the video, and enables creators to provide engaging content by supporting high-quality video editing.
[0105] "Means for receiving and storing video" refers to a device or system that receives video data transmitted via the Internet or other means of communication and stores it temporarily or permanently in a physical or virtual storage device.
[0106] "Methods for dividing a video into frames for analysis" refers to the process of dividing a received video into multiple frames, which are individual still images, and converting the image data of each frame into a format that can be analyzed independently.
[0107] "A means of identifying ingredients and procedures from video footage using an image recognition algorithm and generating text data" refers to a technology or method that applies image recognition technology such as AI to frames extracted from a video to automatically extract information such as ingredients and cooking procedures, and convert that information into text.
[0108] "Means for generating visual and temporal editing suggestions" refers to a function that analyzes the entire video visually and temporally and suggests editing options to maximize visual effects and scene flow.
[0109] "Means for transmitting generated text data, captions, and editing suggestions to a mobile device" refers to a method of transmitting information such as character data, captions, and editing suggestions generated on a server to a portable device such as a smartphone used by a user, enabling display or editing assistance.
[0110] "A means of visualizing video analysis results in real time, enabling users to understand cooking methods in accordance with the video" refers to a function that displays information on cooking procedures and ingredients obtained through video analysis, overlaid visually while the video is being watched, allowing users to understand and practice the content in a way that matches the video.
[0111] This invention is a system that analyzes cooking videos to help viewers efficiently understand cooking methods and to support creators in providing high-quality content. The system is configured as follows:
[0112] The server first receives cooking videos sent by users and saves them to its storage device. These saved videos are then divided into frames. Each frame is analyzed using an AI-based image recognition algorithm to identify the ingredients and cooking steps within the video. The identified information is generated as text data and later used to create captions. Specifically, Python and TENSORFLOW® are used to perform image recognition, identify ingredients, and extract cooking steps.
[0113] Furthermore, the server automatically generates suggestions for visual and temporal editing of the entire video based on the analysis results, including suggestions for visual effects, extraction of important scenes, and shortening of redundant parts. These editing suggestions are sent to mobile devices such as smartphones via React Native. This allows users to watch the video while visualizing the analysis results in real time, enabling them to efficiently grasp the overall picture of the cooking process.
[0114] As a concrete example, suppose a user uploads a video titled "How to Make Omurice" to the system. The system analyzes the video, identifies steps such as "cracking the eggs" and "frying the rice," and adds these as captions. It also makes editing suggestions, such as "emphasizing the key points for making the eggs fluffy." Furthermore, as an example of a prompt, the AI can be asked questions such as, "What ingredients are used in this video? Also, please tell me the specific cooking steps."
[0115] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0116] Step 1:
[0117] The user selects a cooking video on their device and uploads it to the server. The input is the cooking video file provided from the device. The output is the video file transferred to the server and stored there. This process prepares the video for subsequent analysis.
[0118] Step 2:
[0119] The server divides the received video into frames. The input is a video file stored on the server. The output is the video decomposed into a collection of still image frames. This process allows for individual analysis of each frame.
[0120] Step 3:
[0121] The server applies an image recognition algorithm to identify ingredients and cooking steps within each frame and extract text data. Still image frames are provided as input. Text data containing ingredient names and cooking steps is obtained as output. An AI model using Python and TensorFlow extracts features from the images and identifies the information.
[0122] Step 4:
[0123] The server generates visual and temporal captions and editing suggestions based on extracted text data. Text data regarding materials and procedures is used as input. The output includes visual caption information and video editing suggestions. This allows the captions to be displayed in sync with the video's flow, and the editing suggestions aim to improve the content's quality.
[0124] Step 5:
[0125] The server sends the generated captions and editing suggestions to the terminal and provides them to the user. The generated captions and editing suggestions are used as input. This information is delivered to the user's terminal as output. The user can efficiently understand the video content through real-time visualization on their terminal.
[0126] Step 6:
[0127] The user uses their device to perform the final editing of the video based on the provided captions and editing suggestions. The captions and editing suggestions displayed on the device are used as input. The output is the final edited video, ready to be published on the platform. At this stage, the user determines the final form of the video according to their own needs and prepares it for distribution to viewers.
[0128] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0129] This invention combines an emotion engine with a system that analyzes cooking videos to provide detailed recipe information, enabling the adjustment of editing and caption generation based on the user's emotional state. Specific embodiments of the system are described below.
[0130] First, users upload cooking videos to the platform. The uploaded video files are received by the server and temporarily stored. The server then divides the video frame by frame and converts it into a data format that allows for detailed analysis.
[0131] Next, the server uses an image recognition algorithm to identify the ingredients and cooking steps in each frame. As a result of this process, text data about the ingredients and steps is generated.
[0132] This system incorporates an emotion engine that analyzes the user's emotional state. The device analyzes the user's voice tone and facial expression data to determine their current emotional state. For example, it collects information such as whether the user is happy, confused, or depressed.
[0133] The server uses the emotion engine's judgment to provide feedback on video caption generation and editing suggestions. As a result, the content and tone of the captions reflect the user's emotions, and different approaches are offered in editing suggestions based on those emotions. For example, if the server determines that the user is enjoying the video, it can create captions that emphasize the humor in the video.
[0134] Furthermore, the emotion engine can predict viewers' emotional reactions and suggest improvements to the video after its release. This allows for adjustments to make the video more enjoyable for viewers.
[0135] As a concrete example, suppose a user uploads a video of themselves making desserts and explains the process in a cheerful mood. The emotion engine detects this emotion and reflects it in the caption and editing suggestions. For example, a positive-toned caption such as "How to make cookies that everyone can enjoy making together" is automatically generated. In this way, the system can provide content that appeals to viewers while taking the user's emotions into consideration.
[0136] This system allows users to edit videos with emotional depth and deliver recipe videos that are tailored to maximize viewer enjoyment.
[0137] The following describes the processing flow.
[0138] Step 1:
[0139] Users film cooking videos and upload them to the platform. They use their device's interface to select and submit the video files to be uploaded.
[0140] Step 2:
[0141] The server receives the uploaded video and saves it to temporary storage. Simultaneously with saving, it verifies the file format and size and prepares it for analysis.
[0142] Step 3:
[0143] The server divides the video into frames. In this step, the video is extracted as still images at a constant frame rate for efficient analysis.
[0144] Step 4:
[0145] The server uses an image recognition algorithm to identify the ingredients and cooking steps within the frame. This algorithm then extracts the objects and their actions in the video as text data.
[0146] Step 5:
[0147] The device analyzes the user's voice and facial expressions using an emotion engine. It evaluates the user's emotional state during shooting in real time or retrospectively and sends the information to the server.
[0148] Step 6:
[0149] The server generates captions based on the text data received and the results from the sentiment engine. The generated captions are set to reflect the user's emotions in content and tone, and are organized chronologically.
[0150] Step 7:
[0151] The server provides visual and temporal editing suggestions for the video. This includes leveraging emotion engine data to tailor edits to the user's emotional state. For example, it may include editing instructions that highlight enjoyable parts.
[0152] Step 8:
[0153] The device provides the user with captions and editing suggestions received from the server. The user then makes final adjustments to the video based on this information, and reviews and corrects the captions and editing content.
[0154] Step 9:
[0155] Users review the edited videos and publish them on the platform. After final review and adjustments, the videos are distributed to viewers. This ensures that personalized and emotionally resonant cooking videos are delivered to audiences.
[0156] (Example 2)
[0157] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."
[0158] Existing video analysis systems lack the ability to generate captions or suggest edits based on the user's emotional state, making it difficult to adjust content to capture viewers' interest. This is particularly problematic for content such as cooking videos, where the inability to edit and caption in a way that considers the emotions of viewers and users makes it difficult to create engaging content.
[0159] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0160] In this invention, the server includes means for receiving video and storing it in an information recording device; means for dividing the video into electrical signals for analysis; means for identifying objects and operating procedures in the video and generating text information using image recognition processing; means for determining the emotional state of the user using a user reaction analysis device and generating text information and editing suggestions based on the determination result; and means for transmitting the generated text information, captions, and editing suggestions to an information processing device. This makes it possible to generate captions and suggest edits that correspond to the emotional state of the user, and to provide content that appeals to viewers.
[0161] "Motion images" are digital media that express movement through the continuous display of images.
[0162] An "information recording device" is a storage device for electronically storing received data.
[0163] An "electrical signal" is a signal that represents the flow of digital or analog data that makes up video or audio.
[0164] "Image recognition processing" is a technology that identifies specific visual elements from digital image data and analyzes them.
[0165] "Object" refers to an object, material, or element within an image that is the subject of identification and analysis.
[0166] An "operating procedure" is a series of actions or steps performed to achieve a specific objective.
[0167] "Character information" refers to text-based information generated through image recognition processing.
[0168] A "user response analysis device" is a device used to measure and analyze the emotions and reactions of users.
[0169] "Emotional state" refers to the emotional reactions and psychological state exhibited by the user.
[0170] An "editing suggestion" is a proposed editing option or method to make the video content more effective.
[0171] An "information processing device" is a computer used to process data and output analysis results or suggestions.
[0172] This invention is a system in which users upload cooking videos, and the system analyzes those videos to provide detailed recipe information. Furthermore, by incorporating an emotion engine, this system can adjust editing and caption generation based on the user's emotional state.
[0173] Users upload cooking videos to the platform using their own devices. These videos are received by the server and temporarily stored in an information recording device. The server then uses video analysis software such as OpenCV to convert the video into electrical signals frame by frame. This conversion process prepares each frame to be analyzed as individual image data.
[0174] The server uses image recognition technology to analyze each frame of the image and identify the object and the operating procedure. Machine learning frameworks such as TensorFlow and PyTorch are used for this analysis, and the identified information is output as text data. This generates specific text data about the materials and procedures.
[0175] The device also acquires the user's voice tone and facial expressions, and uses a user reaction analysis device to determine their emotional state. This analysis utilizes a voice recognition API and facial recognition software. This allows for accurate measurement of the user's emotional state, such as whether they are enjoying themselves or feeling confused.
[0176] The server utilizes an emotion engine to generate text information and editing suggestions based on the analysis results. For example, if the emotion analysis determines that the user is enjoying the content, it can generate a caption with a positive tone using the GPT AI model. This allows for the provision of more engaging content to viewers.
[0177] For example, if a user is smiling and explaining a "dessert making video," the caption will automatically generate text that emphasizes a fun atmosphere, such as "How to make cookies that everyone can enjoy making together." In this way, the system reflects the user's emotions in real time and ultimately provides videos that resonate with viewers.
[0178] An example of a prompt might be, "Generate a humorous caption that fits a video of making desserts. The user is enjoying themselves." Following this prompt, the generative AI model can create the necessary captions and editing suggestions.
[0179] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0180] Step 1:
[0181] Users upload cooking videos from their devices to the platform. The cooking video file is received by the server as input, and the server saves this video to an information recording device. Once saved, the video becomes available for subsequent analysis processes.
[0182] Step 2:
[0183] The server converts the stored video into electrical signals frame by frame for analysis. This process uses video analysis software, which divides the input video data into individual image data for output. This allows for analysis of each individual image.
[0184] Step 3:
[0185] The server analyzes each frame using image recognition processing to identify the object and the operating procedure. The input for this step is image data for each frame, and the output is textual information about the ingredients and cooking procedure. For this purpose, a machine learning framework is used to identify and digitize specific visual elements.
[0186] Step 4:
[0187] The terminal collects the user's voice tone and facial expressions as input data and uses a user response analysis device to determine their emotional state. The output here is data on the emotional state shown by the user. This analysis is performed using a voice recognition API and facial recognition software, and the user's emotions are displayed as a numerical value or category.
[0188] Step 5:
[0189] The server uses an emotion engine to generate text information and editing suggestions that reflect the emotional state. The input to this process is the user's emotional state data and the text information generated in the previous step, and the output is a caption and editing suggestions corresponding to the emotion. The generation AI model, for example, uses GPT to create emotion-based captions.
[0190] Step 6:
[0191] The server sends the generated captions and editing suggestions to the information processing unit. This final step involves the transmission process leading to the information processing unit. The input is the generated caption data and editing suggestion data, which are sent directly to the information processing unit as output. This prepares the content for the audience to enjoy.
[0192] (Application Example 2)
[0193] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".
[0194] In cooking videos, there is a growing demand for dynamic editing that responds to the user's emotional state, providing viewers with more engaging and personalized content. However, conventional systems have been insufficient in predicting viewers' emotional reactions and improving videos, or in generating captions that take user emotions into account. Therefore, an improvement in the user experience is desired.
[0195] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0196] In this invention, the server includes means for receiving and storing video, means for dividing the video into frames for analysis, and means for identifying substances and processes from the video using an image recognition algorithm and generating textual information. This makes it possible to analyze the user's emotional state and adjust captions and editing suggestions based on those emotions, predict the viewer's emotional response and make improvement suggestions, and send prompt sentences to a generation AI model to obtain improvement suggestions.
[0197] "Means for receiving and saving video" refers to a device or process that receives video data provided from an external source and stores it on a storage medium.
[0198] "Methods for dividing into frames" refer to techniques for dividing temporally continuous video footage into individual still images for the purpose of analyzing video data.
[0199] An "image recognition algorithm" is a program or method that automatically identifies and interprets specific objects or situations from image data.
[0200] "Means for identifying materials and processes and generating textual information" refers to a system that, based on image recognition results, transcribes the materials used and procedures performed within a video into text and records them in digital format.
[0201] An "emotion analysis engine that analyzes the emotional state of users" is a technology that infers the user's psychological state at a given time from the nuances of their facial expressions and voice.
[0202] "A method for sending prompt sentences to a generative AI model and obtaining improvement suggestions" refers to a method that uses natural language processing technology to provide input to an artificial intelligence system that generates responses to requests, and to obtain suggestions for improvement.
[0203] "Methods for predicting viewer emotional responses and proposing improvements" refers to a process of estimating emotional responses from viewer behavior and feedback, and then proposing specific changes to improve the quality of content based on those estimates.
[0204] This invention begins with a user uploading a cooking video to the platform. The server receives this video and stores it temporarily. The video is then divided into frames for analysis. Software such as OpenCV or FFmpeg is often used for this process.
[0205] Next, the server applies an image recognition algorithm to identify the substance and process from each frame. The data obtained through this process is stored in text format. At this stage, machine learning frameworks such as TensorFlow and PyTorch are used.
[0206] Furthermore, the emotion analysis engine determines the user's emotional state based on the voice and facial expression data provided by the user's device. This analysis utilizes Google® Cloud Speech-to-Text API and Microsoft® Azure® Emotion API.
[0207] Upon receiving this emotional state information, the server sends a prompt to the generation AI model to obtain feedback on caption generation and editing. An example of such a prompt is, "Generate captions that are relatable to viewers and create a fun atmosphere for a cooking video in which the user is enjoying explaining the subject."
[0208] Ultimately, suggestions for improving the video are made based on viewer emotional responses. This content is then visually and temporally edited before being sent to the device. In this way, viewers can enjoy personalized and engaging videos.
[0209] For example, when a user explains "how to make a raspberry tart" to viewers, the system captures the user's joyful emotions and generates captions that emphasize them. This makes the video even more appealing to viewers.
[0210] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0211] Step 1:
[0212] Users upload cooking videos to the platform. The server receives and stores these videos. The input is the cooking video uploaded by the user, and the output is a video file stored on the server. The server checks the video format and saves it in the appropriate format.
[0213] Step 2:
[0214] The server divides the video frame by frame for analysis. The input is a saved video file, and the output is individual frame images. The server converts the video to still images based on the frame rate and records the timestamp of each frame.
[0215] Step 3:
[0216] The server applies an image recognition algorithm to each frame of the image to identify substances and processes. The input is the frame image, and the output is textual information of the recognized substances and processes. The server uses a machine learning model to analyze the content of each frame and saves the results as text data.
[0217] Step 4:
[0218] The device acquires recorded audio data and facial expression data captured by the camera and sends them to the emotion analysis engine. The input is audio and video data, and the output is the user's emotional state. The device uses the emotion analysis engine to analyze the tone of voice and facial expressions to determine the current emotion.
[0219] Step 5:
[0220] The server sends a prompt to the generative AI model based on the emotional state and analyzed text data, and retrieves a caption and editing suggestions. The input is the emotional state and text data, and an example prompt is "Generate a caption for a cooking video where the user is enjoying explaining, that is relatable to viewers and creates a fun atmosphere." The output is the adjusted caption and editing suggestions.
[0221] Step 6:
[0222] The server predicts viewer emotional responses and provides suggestions for improving videos after publication. Inputs include generated captions, user emotional states, and past viewer data; output is improved video content. The server analyzes viewer history and reactions to generate suggestions for editing more engaging videos.
[0223] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.
[0224] Data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0225] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.
[0226] [Second Embodiment]
[0227] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.
[0228] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.
[0229] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0230] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.
[0231] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0232] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0233] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0234] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0235] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0236] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0237] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0238] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0239] This invention is a system that effectively analyzes cooking videos and provides convenient recipe information and editing support for both viewers and creators. Specific embodiments of the system are described below.
[0240] First, users upload cooking videos they have filmed themselves to the platform. This process requires the user's device to select and submit the video.
[0241] Next, the server receives the video and temporarily stores it. The video is then split frame by frame for analysis. This allows for detailed analysis of each frame as image data.
[0242] The server uses an AI image recognition algorithm to identify ingredients and cooking steps within video frames. This analysis generates text data containing ingredient names, cooking progress, and other relevant information.
[0243] Based on the generated text data, the server automatically creates captions corresponding to the video content. The captions are assigned along the video's timeline, allowing viewers to instantly access information that matches the video.
[0244] Furthermore, the server provides visual editing suggestions. These suggestions include things like scene transitions, shortening redundant parts, and highlighting important scenes, aiming to improve the overall quality of the video.
[0245] Finally, the user, having received the provided text data and editing suggestions, performs the final editing of the video based on that information. After adjustments, the user can publish the video on the platform.
[0246] As a concrete example, consider a case where a user uploads a video of a pasta dish. The server recognizes ingredients such as tomatoes and pasta from the frames and creates captions for steps such as "boiling the pasta" and "mixing the sauce." Furthermore, it makes editing suggestions such as "shorten the boiling time" and "emphasize the scene of stirring the cream sauce." As a result, the user can complete a high-quality video with detailed recipe information in a short amount of time.
[0247] In this way, the system reduces the burden on users while enabling the provision of user-friendly content for viewers.
[0248] The following describes the processing flow.
[0249] Step 1:
[0250] Users upload cooking videos to the platform. Through the device interface, users select video files and press the upload button, sending the video data to the server.
[0251] Step 2:
[0252] The server checks the received video and saves it to temporary storage. It verifies the format and size of the video file to ensure it complies with the platform's standards.
[0253] Step 3:
[0254] The server divides the video into frames for analysis. Based on the frame rate, it extracts frames at intervals that allow for efficient and appropriate analysis.
[0255] Step 4:
[0256] The server uses an AI model to identify ingredients and cooking steps from image data within a frame. In this process, ingredients, tools, and cooking steps are recognized as text, and the data is structured.
[0257] Step 5:
[0258] The server generates captions from the obtained material and procedure data. Based on the text data, it generates easy-to-understand explanatory text for viewers and adds captions to the video along the timeline.
[0259] Step 6:
[0260] The server provides video editing suggestions. Based on the analysis results, it generates scene cuts and improvement suggestions, and offers ideas to highlight visual effects and key points.
[0261] Step 7:
[0262] The device displays text information, captions, and editing suggestions received from the server to the user. The user can then use this information to make final edits to the video.
[0263] Step 8:
[0264] The user publishes their edited video on the platform. After a final review and any necessary adjustments, it is made available for other users to view.
[0265] (Example 1)
[0266] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0267] In recent years, with the increase in video content, it has become difficult for viewers to efficiently obtain information. Furthermore, creators are required to dedicate a significant amount of time and effort to editing. To address these issues, there is a need for the creation of video content that is convenient for both viewers and creators by automatically analyzing ingredients and cooking procedures and providing visual editing suggestions.
[0268] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0269] In this invention, the server includes means for receiving and storing video, means for dividing the video into individual images for analysis, and means for identifying materials and work procedures from the content within the video using an image recognition method and generating text information. This enables the rapid and accurate acquisition of material and procedure information for video content uploaded by users, and further enables efficient video creation through editing suggestions.
[0270] "Image" refers to visual information expressed in digital or analog format.
[0271] "Receiving" refers to the act of receiving data via a digital network or other means of communication.
[0272] "Storage" refers to the act of saving specific data in a way that allows for later access.
[0273] "Image" refers to individual still frames or their digital representations that constitute visual information.
[0274] "Partitioning" refers to the process of dividing continuous information into individual units.
[0275] "Image recognition techniques" refer to the process by which computers identify and interpret objects, features, and patterns within an image.
[0276] "Content" refers to the visual elements and their meanings contained within a video or image.
[0277] "Materials" refers to elements or items used that are specifically identified within the video.
[0278] A "work procedure" refers to a series of steps or processes performed to achieve a specific objective.
[0279] "Specification" refers to the act of identifying and evaluating specific properties or elements.
[0280] "Textual information" refers to visual or auditory data expressed in the form of characters or text.
[0281] A "title" refers to a short sentence or phrase used to describe or supplement the content of a video.
[0282] The term "time axis" refers to a framework of thought that indicates the flow or sequence of events over time.
[0283] "Scene" refers to a unit composed of shots or sequences in visual media.
[0284] This invention is a system that extracts material information and procedures from video content and provides efficient editing support. Specifically, the server receives the video uploaded by the user and temporarily stores it. As the hardware to be used, a server computer capable of high-speed data processing is preferred. The server divides the video frame by frame as a pre-step of analysis. In this process, video editing software and image conversion libraries are utilized to achieve rapid frame separation.
[0285] Next, the server analyzes the image data of each frame using a generative AI model. This model identifies the materials and working procedures in the video and generates character information through image recognition techniques. By this method, the user can arbitrarily obtain individual materials and procedures in text form. For this generative AI model, for example, by using a prompt sentence such as "What are the materials in the video?", accurate information extraction is realized.
[0286] Based on the generated character information, the server automatically generates a title and arranges it on the time axis according to the specific scene of the video. This process is performed by software equipped with a natural language processing engine. Furthermore, it is also possible to provide visual editing proposals, providing collaborative editing support and enhancing the efficiency of the creator. These editing proposals include, for example, guidance on visual effects such as emphasizing important video scenes.
[0287] Finally, the terminal receives the character information and proposal content provided by the server, and the user edits the final video based on this. Through this system, the user can quickly create and publish high-quality video content. The aim is to improve the convenience of digital video production and provide a valuable experience for both viewers and creators.
[0288] The flow of the specific process in Example 1 will be described using FIG. 11.
[0289] Step 1:
[0290] Users upload cooking videos to the platform using their devices. Specifically, the user selects a video file, and the device sends that file to the server. The input is the video file selected by the user, and the output is the transfer of the video file to the server.
[0291] Step 2:
[0292] The server receives the uploaded video and stores it temporarily. Specifically, the received video data is saved to the server's storage. The input is the video data sent from the terminal, and the output is the video data stored in the server's storage.
[0293] Step 3:
[0294] The server divides the stored video into individual frames. Specifically, it divides the video into time intervals to generate still frames. This process uses an image processing library to convert continuous data into individual images. The input is the stored video data, and the output is individual frame image data.
[0295] Step 4:
[0296] The server analyzes each frame using a generated AI model. Specifically, it uses an image recognition algorithm to identify materials and work procedures within the frame and generates text information. This analysis outputs material names and operation procedures as text. The input is frame image data, and the output is text information including materials and procedures.
[0297] Step 5:
[0298] The server generates titles based on text information and places them according to the scenes in the video. A natural language processing engine is used to create video-related captions from text data and synchronize them on the video timeline. The input is text data, and the output is titles aligned with the video timeline.
[0299] Step 6:
[0300] The server generates editing suggestions that include visual effects. Specifically, it uses a generation AI model to analyze the results and make suggestions such as highlighting important scenes in the video. The input is the analyzed video information, and the output is editing instruction data that includes suggestions for visual effects.
[0301] Step 7:
[0302] The terminal receives text information and editing suggestions from the server, and the user performs the final video editing. Specifically, the user makes necessary corrections and edits using the provided information. The input is the text information and editing suggestions sent from the server, and the output is the final edited video file.
[0303] (Application Example 1)
[0304] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."
[0305] While video content has become widespread, and cooking-related videos are increasing, it's not easy for viewers to efficiently understand and practice cooking methods while watching videos. In particular, repeated viewing is necessary to grasp specific ingredients and procedures, which presents a significant time and cognitive burden for users. Creators are also seeking support to efficiently edit engaging videos.
[0306] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0307] In this invention, the server includes means for receiving and storing a video, means for splitting the video frame by frame for analysis, means for identifying materials and procedures from the video in the video using an image recognition algorithm and generating text data, means for generating visual and temporal editing proposals, means for transmitting the generated text data, captions, and editing proposals to a mobile device, and means for visualizing the analysis result of the video in real time to enable the user to understand the cooking method in accordance with the video. As a result, the viewer can efficiently grasp the cooking procedure while watching the video, and the creator can provide attractive content by being supported in high-quality video editing.
[0308] The "means for receiving and storing a video" is a device or system that receives video data transmitted through the Internet or other communication means and stores it temporarily or permanently in a physical or virtual storage device.
[0309] The "means for splitting the video frame by frame for analysis" is a process of splitting the received video into a plurality of frames that are individual still images and making the image data of each frame in a form that can be analyzed independently.
[0310] The "means for identifying materials and procedures from the video in the video using an image recognition algorithm and generating text data" is a technology or method that applies image recognition technology such as AI to the frames extracted from the video to automatically extract information such as materials and cooking procedures and convert it into character information.
[0311] The "means for generating visual and temporal editing proposals" is a function that analyzes the entire video visually and temporally and proposes editing options for maximizing video effects and scene flow.
[0312] "Means for transmitting generated text data, captions, and editing suggestions to a mobile device" refers to a method of transmitting information such as character data, captions, and editing suggestions generated on a server to a portable device such as a smartphone used by a user, enabling display or editing assistance.
[0313] "A means of visualizing video analysis results in real time, enabling users to understand cooking methods in accordance with the video" refers to a function that displays information on cooking procedures and ingredients obtained through video analysis, overlaid visually while the video is being watched, allowing users to understand and practice the content in a way that matches the video.
[0314] This invention is a system that analyzes cooking videos to help viewers efficiently understand cooking methods and to support creators in providing high-quality content. The system is configured as follows:
[0315] The server first receives cooking videos sent by users and saves them to its storage device. These saved videos are then divided into frames. Each frame is analyzed using an AI-based image recognition algorithm to identify the ingredients and cooking steps within the video. The identified information is generated as text data and later used to create captions. Specifically, Python and TensorFlow are used to perform image recognition, identify ingredients, and extract cooking steps.
[0316] Furthermore, the server automatically generates suggestions for visual and temporal editing of the entire video based on the analysis results, including suggestions for visual effects, extraction of important scenes, and shortening of redundant parts. These editing suggestions are sent to mobile devices such as smartphones via React Native. This allows users to watch the video while visualizing the analysis results in real time, enabling them to efficiently grasp the overall picture of the cooking process.
[0317] As a concrete example, suppose a user uploads a video titled "How to Make Omurice" to the system. The system analyzes the video, identifies steps such as "cracking the eggs" and "frying the rice," and adds these as captions. It also makes editing suggestions, such as "emphasizing the key points for making the eggs fluffy." Furthermore, as an example of a prompt, the AI can be asked questions such as, "What ingredients are used in this video? Also, please tell me the specific cooking steps."
[0318] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0319] Step 1:
[0320] The user selects a cooking video on their device and uploads it to the server. The input is the cooking video file provided from the device. The output is the video file transferred to the server and stored there. This process prepares the video for subsequent analysis.
[0321] Step 2:
[0322] The server divides the received video into frames. The input is a video file stored on the server. The output is the video decomposed into a collection of still image frames. This process allows for individual analysis of each frame.
[0323] Step 3:
[0324] The server applies an image recognition algorithm to identify ingredients and cooking steps within each frame and extract text data. Still image frames are provided as input. Text data containing ingredient names and cooking steps is obtained as output. An AI model using Python and TensorFlow extracts features from the images and identifies the information.
[0325] Step 4:
[0326] The server generates visual and temporal captions and editing suggestions based on extracted text data. Text data regarding materials and procedures is used as input. The output includes visual caption information and video editing suggestions. This allows the captions to be displayed in sync with the video's flow, and the editing suggestions aim to improve the content's quality.
[0327] Step 5:
[0328] The server sends the generated captions and editing suggestions to the terminal and provides them to the user. The generated captions and editing suggestions are used as input. This information is delivered to the user's terminal as output. The user can efficiently understand the video content through real-time visualization on their terminal.
[0329] Step 6:
[0330] The user uses their device to perform the final editing of the video based on the provided captions and editing suggestions. The captions and editing suggestions displayed on the device are used as input. The output is the final edited video, ready to be published on the platform. At this stage, the user determines the final form of the video according to their own needs and prepares it for distribution to viewers.
[0331] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0332] This invention combines an emotion engine with a system that analyzes cooking videos to provide detailed recipe information, enabling the adjustment of editing and caption generation based on the user's emotional state. Specific embodiments of the system are described below.
[0333] First, users upload cooking videos to the platform. The uploaded video files are received by the server and temporarily stored. The server then divides the video frame by frame and converts it into a data format that allows for detailed analysis.
[0334] Next, the server uses an image recognition algorithm to identify the ingredients and cooking steps in each frame. As a result of this process, text data about the ingredients and steps is generated.
[0335] This system incorporates an emotion engine that analyzes the user's emotional state. The device analyzes the user's voice tone and facial expression data to determine their current emotional state. For example, it collects information such as whether the user is happy, confused, or depressed.
[0336] The server uses the emotion engine's judgment to provide feedback on video caption generation and editing suggestions. As a result, the content and tone of the captions reflect the user's emotions, and different approaches are offered in editing suggestions based on those emotions. For example, if the server determines that the user is enjoying the video, it can create captions that emphasize the humor in the video.
[0337] Furthermore, the emotion engine can predict viewers' emotional reactions and suggest improvements to the video after its release. This allows for adjustments to make the video more enjoyable for viewers.
[0338] As a concrete example, suppose a user uploads a video of themselves making desserts and explains the process in a cheerful mood. The emotion engine detects this emotion and reflects it in the caption and editing suggestions. For example, a positive-toned caption such as "How to make cookies that everyone can enjoy making together" is automatically generated. In this way, the system can provide content that appeals to viewers while taking the user's emotions into consideration.
[0339] This system allows users to edit videos with emotional depth and deliver recipe videos that are tailored to maximize viewer enjoyment.
[0340] The following describes the processing flow.
[0341] Step 1:
[0342] Users film cooking videos and upload them to the platform. They use their device's interface to select and submit the video files to be uploaded.
[0343] Step 2:
[0344] The server receives the uploaded video and saves it to temporary storage. Simultaneously with saving, it verifies the file format and size and prepares it for analysis.
[0345] Step 3:
[0346] The server divides the video into frames. In this step, the video is extracted as still images at a constant frame rate for efficient analysis.
[0347] Step 4:
[0348] The server uses an image recognition algorithm to identify the ingredients and cooking steps within the frame. This algorithm then extracts the objects and their actions in the video as text data.
[0349] Step 5:
[0350] The device analyzes the user's voice and facial expressions using an emotion engine. It evaluates the user's emotional state during shooting in real time or retrospectively and sends the information to the server.
[0351] Step 6:
[0352] The server generates captions based on the text data received and the results from the sentiment engine. The generated captions are set to reflect the user's emotions in content and tone, and are organized chronologically.
[0353] Step 7:
[0354] The server provides visual and temporal editing suggestions for the video. This includes leveraging emotion engine data to tailor edits to the user's emotional state. For example, it may include editing instructions that highlight enjoyable parts.
[0355] Step 8:
[0356] The device provides the user with captions and editing suggestions received from the server. The user then makes final adjustments to the video based on this information, and reviews and corrects the captions and editing content.
[0357] Step 9:
[0358] Users review the edited videos and publish them on the platform. After final review and adjustments, the videos are distributed to viewers. This ensures that personalized and emotionally resonant cooking videos are delivered to audiences.
[0359] (Example 2)
[0360] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".
[0361] Existing video analysis systems lack the ability to generate captions or suggest edits based on the user's emotional state, making it difficult to adjust content to capture viewers' interest. This is particularly problematic for content such as cooking videos, where the inability to edit and caption in a way that considers the emotions of viewers and users makes it difficult to create engaging content.
[0362] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0363] In this invention, the server includes means for receiving video and storing it in an information recording device; means for dividing the video into electrical signals for analysis; means for identifying objects and operating procedures in the video and generating text information using image recognition processing; means for determining the emotional state of the user using a user reaction analysis device and generating text information and editing suggestions based on the determination result; and means for transmitting the generated text information, captions, and editing suggestions to an information processing device. This makes it possible to generate captions and suggest edits that correspond to the emotional state of the user, and to provide content that appeals to viewers.
[0364] "Motion images" are digital media that express movement through the continuous display of images.
[0365] An "information recording device" is a storage device for electronically storing received data.
[0366] An "electrical signal" is a signal that represents the flow of digital or analog data that makes up video or audio.
[0367] "Image recognition processing" is a technology that identifies specific visual elements from digital image data and analyzes them.
[0368] "Object" refers to an object, material, or element within an image that is the subject of identification and analysis.
[0369] An "operating procedure" is a series of actions or steps performed to achieve a specific objective.
[0370] "Character information" refers to text-based information generated through image recognition processing.
[0371] A "user response analysis device" is a device used to measure and analyze the emotions and reactions of users.
[0372] "Emotional state" refers to the emotional reactions and psychological state exhibited by the user.
[0373] An "editing suggestion" is a proposed editing option or method to make the video content more effective.
[0374] An "information processing device" is a computer used to process data and output analysis results or suggestions.
[0375] This invention is a system in which users upload cooking videos, and the system analyzes those videos to provide detailed recipe information. Furthermore, by incorporating an emotion engine, this system can adjust editing and caption generation based on the user's emotional state.
[0376] Users upload cooking videos to the platform using their own devices. These videos are received by the server and temporarily stored in an information recording device. The server then uses video analysis software such as OpenCV to convert the video into electrical signals frame by frame. This conversion process prepares each frame to be analyzed as individual image data.
[0377] The server uses image recognition technology to analyze each frame of the image and identify the object and the operating procedure. Machine learning frameworks such as TensorFlow and PyTorch are used for this analysis, and the identified information is output as text data. This generates specific text data about the materials and procedures.
[0378] The device also acquires the user's voice tone and facial expressions, and uses a user reaction analysis device to determine their emotional state. This analysis utilizes a voice recognition API and facial recognition software. This allows for accurate measurement of the user's emotional state, such as whether they are enjoying themselves or feeling confused.
[0379] The server utilizes an emotion engine to generate text information and editing suggestions based on the analysis results. For example, if the emotion analysis determines that the user is enjoying the content, it can generate a caption with a positive tone using the GPT AI model. This allows for the provision of more engaging content to viewers.
[0380] For example, if a user is smiling and explaining a "dessert making video," the caption will automatically generate text that emphasizes a fun atmosphere, such as "How to make cookies that everyone can enjoy making together." In this way, the system reflects the user's emotions in real time and ultimately provides videos that resonate with viewers.
[0381] An example of a prompt might be, "Generate a humorous caption that fits a video of making desserts. The user is enjoying themselves." Following this prompt, the generative AI model can create the necessary captions and editing suggestions.
[0382] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0383] Step 1:
[0384] Users upload cooking videos from their devices to the platform. The cooking video file is received by the server as input, and the server saves this video to an information recording device. Once saved, the video becomes available for subsequent analysis processes.
[0385] Step 2:
[0386] The server converts the stored video into electrical signals frame by frame for analysis. This process uses video analysis software, which divides the input video data into individual image data for output. This allows for analysis of each individual image.
[0387] Step 3:
[0388] The server analyzes each frame using image recognition processing to identify the object and the operating procedure. The input for this step is image data for each frame, and the output is textual information about the ingredients and cooking procedure. For this purpose, a machine learning framework is used to identify and digitize specific visual elements.
[0389] Step 4:
[0390] The terminal collects the user's voice tone and facial expressions as input data and uses a user response analysis device to determine their emotional state. The output here is data on the emotional state shown by the user. This analysis is performed using a voice recognition API and facial recognition software, and the user's emotions are displayed as a numerical value or category.
[0391] Step 5:
[0392] The server uses an emotion engine to generate text information and editing suggestions that reflect the emotional state. The input to this process is the user's emotional state data and the text information generated in the previous step, and the output is a caption and editing suggestions corresponding to the emotion. The generation AI model, for example, uses GPT to create emotion-based captions.
[0393] Step 6:
[0394] The server sends the generated captions and editing suggestions to the information processing unit. This final step involves the transmission process leading to the information processing unit. The input is the generated caption data and editing suggestion data, which are sent directly to the information processing unit as output. This prepares the content for the audience to enjoy.
[0395] (Application Example 2)
[0396] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the smart glasses 214 as the "terminal".
[0397] In cooking videos, there is a growing demand for dynamic editing that responds to the user's emotional state, providing viewers with more engaging and personalized content. However, conventional systems have been insufficient in predicting viewers' emotional reactions and improving videos, or in generating captions that take user emotions into account. Therefore, an improvement in the user experience is desired.
[0398] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0399] In this invention, the server includes means for receiving and storing video, means for dividing the video into frames for analysis, and means for identifying substances and processes from the video using an image recognition algorithm and generating textual information. This makes it possible to analyze the user's emotional state and adjust captions and editing suggestions based on those emotions, predict the viewer's emotional response and make improvement suggestions, and send prompt sentences to a generation AI model to obtain improvement suggestions.
[0400] "Means for receiving and saving video" refers to a device or process that receives video data provided from an external source and stores it on a storage medium.
[0401] "Methods for dividing into frames" refer to techniques for dividing temporally continuous video footage into individual still images for the purpose of analyzing video data.
[0402] An "image recognition algorithm" is a program or method that automatically identifies and interprets specific objects or situations from image data.
[0403] "Means for identifying materials and processes and generating textual information" refers to a system that, based on image recognition results, transcribes the materials used and procedures performed within a video into text and records them in digital format.
[0404] An "emotion analysis engine that analyzes the emotional state of users" is a technology that infers the user's psychological state at a given time from the nuances of their facial expressions and voice.
[0405] "A method for sending prompt sentences to a generative AI model and obtaining improvement suggestions" refers to a method that uses natural language processing technology to provide input to an artificial intelligence system that generates responses to requests, and to obtain suggestions for improvement.
[0406] "Methods for predicting viewer emotional responses and proposing improvements" refers to a process of estimating emotional responses from viewer behavior and feedback, and then proposing specific changes to improve the quality of content based on those estimates.
[0407] This invention begins with a user uploading a cooking video to the platform. The server receives this video and stores it temporarily. The video is then divided into frames for analysis. Software such as OpenCV or FFmpeg is often used for this process.
[0408] Next, the server applies an image recognition algorithm to identify the substance and process from each frame. The data obtained through this process is stored in text format. At this stage, machine learning frameworks such as TensorFlow and PyTorch are used.
[0409] Furthermore, the emotion analysis engine determines the user's emotional state based on the voice and facial expression data provided by the user's device. Google Cloud Speech-to-Text API and Microsoft Azure's Emotion API are utilized for this analysis.
[0410] Upon receiving this emotional state information, the server sends a prompt to the generation AI model to obtain feedback on caption generation and editing. An example of such a prompt is, "Generate captions that are relatable to viewers and create a fun atmosphere for a cooking video in which the user is enjoying explaining the subject."
[0411] Ultimately, suggestions for improving the video are made based on viewer emotional responses. This content is then visually and temporally edited before being sent to the device. In this way, viewers can enjoy personalized and engaging videos.
[0412] For example, when a user explains "how to make a raspberry tart" to viewers, the system captures the user's joyful emotions and generates captions that emphasize them. This makes the video even more appealing to viewers.
[0413] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0414] Step 1:
[0415] Users upload cooking videos to the platform. The server receives and stores these videos. The input is the cooking video uploaded by the user, and the output is a video file stored on the server. The server checks the video format and saves it in the appropriate format.
[0416] Step 2:
[0417] The server divides the video frame by frame for analysis. The input is a saved video file, and the output is individual frame images. The server converts the video to still images based on the frame rate and records the timestamp of each frame.
[0418] Step 3:
[0419] The server applies an image recognition algorithm to each frame of the image to identify substances and processes. The input is the frame image, and the output is textual information of the recognized substances and processes. The server uses a machine learning model to analyze the content of each frame and saves the results as text data.
[0420] Step 4:
[0421] The device acquires recorded audio data and facial expression data captured by the camera and sends them to the emotion analysis engine. The input is audio and video data, and the output is the user's emotional state. The device uses the emotion analysis engine to analyze the tone of voice and facial expressions to determine the current emotion.
[0422] Step 5:
[0423] The server sends a prompt to the generative AI model based on the emotional state and analyzed text data, and retrieves a caption and editing suggestions. The input is the emotional state and text data, and an example prompt is "Generate a caption for a cooking video where the user is enjoying explaining, that is relatable to viewers and creates a fun atmosphere." The output is the adjusted caption and editing suggestions.
[0424] Step 6:
[0425] The server predicts viewer emotional responses and provides suggestions for improving videos after publication. Inputs include generated captions, user emotional states, and past viewer data; output is improved video content. The server analyzes viewer history and reactions to generate suggestions for editing more engaging videos.
[0426] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0427] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0428] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.
[0429] [Third Embodiment]
[0430] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.
[0431] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.
[0432] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0433] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.
[0434] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0435] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0436] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0437] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0438] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0439] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0440] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0441] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".
[0442] This invention is a system that effectively analyzes cooking videos and provides convenient recipe information and editing support for both viewers and creators. Specific embodiments of the system are described below.
[0443] First, users upload cooking videos they have filmed themselves to the platform. This process requires the user's device to select and submit the video.
[0444] Next, the server receives the video and temporarily stores it. The video is then split frame by frame for analysis. This allows for detailed analysis of each frame as image data.
[0445] The server uses an AI image recognition algorithm to identify ingredients and cooking steps within video frames. This analysis generates text data containing ingredient names, cooking progress, and other relevant information.
[0446] Based on the generated text data, the server automatically creates captions corresponding to the video content. The captions are assigned along the video's timeline, allowing viewers to instantly access information that matches the video.
[0447] Furthermore, the server provides visual editing suggestions. These suggestions include things like scene transitions, shortening redundant parts, and highlighting important scenes, aiming to improve the overall quality of the video.
[0448] Finally, the user, having received the provided text data and editing suggestions, performs the final editing of the video based on that information. After adjustments, the user can publish the video on the platform.
[0449] As a concrete example, consider a case where a user uploads a video of a pasta dish. The server recognizes ingredients such as tomatoes and pasta from the frames and creates captions for steps such as "boiling the pasta" and "mixing the sauce." Furthermore, it makes editing suggestions such as "shorten the boiling time" and "emphasize the scene of stirring the cream sauce." As a result, the user can complete a high-quality video with detailed recipe information in a short amount of time.
[0450] In this way, the system reduces the burden on users while enabling the provision of user-friendly content for viewers.
[0451] The following describes the processing flow.
[0452] Step 1:
[0453] Users upload cooking videos to the platform. Through the device interface, users select video files and press the upload button, sending the video data to the server.
[0454] Step 2:
[0455] The server checks the received video and saves it to temporary storage. It verifies the format and size of the video file to ensure it complies with the platform's standards.
[0456] Step 3:
[0457] The server divides the video into frames for analysis. Based on the frame rate, it extracts frames at intervals that allow for efficient and appropriate analysis.
[0458] Step 4:
[0459] The server uses an AI model to identify ingredients and cooking steps from image data within a frame. In this process, ingredients, tools, and cooking steps are recognized as text, and the data is structured.
[0460] Step 5:
[0461] The server generates captions from the obtained material and procedure data. Based on the text data, it generates easy-to-understand explanatory text for viewers and adds captions to the video along the timeline.
[0462] Step 6:
[0463] The server provides video editing suggestions. Based on the analysis results, it generates scene cuts and improvement suggestions, and offers ideas to highlight visual effects and key points.
[0464] Step 7:
[0465] The device displays text information, captions, and editing suggestions received from the server to the user. The user can then use this information to make final edits to the video.
[0466] Step 8:
[0467] The user publishes their edited video on the platform. After a final review and any necessary adjustments, it is made available for other users to view.
[0468] (Example 1)
[0469] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0470] In recent years, with the increase in video content, it has become difficult for viewers to efficiently obtain information. Furthermore, creators are required to dedicate a significant amount of time and effort to editing. To address these issues, there is a need for the creation of video content that is convenient for both viewers and creators by automatically analyzing ingredients and cooking procedures and providing visual editing suggestions.
[0471] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0472] In this invention, the server includes means for receiving and storing video, means for dividing the video into individual images for analysis, and means for identifying materials and work procedures from the content within the video using an image recognition method and generating text information. This enables the rapid and accurate acquisition of material and procedure information for video content uploaded by users, and further enables efficient video creation through editing suggestions.
[0473] "Image" refers to visual information expressed in digital or analog format.
[0474] "Receiving" refers to the act of receiving data via a digital network or other means of communication.
[0475] "Storage" refers to the act of saving specific data in a way that allows for later access.
[0476] "Image" refers to individual still frames or their digital representations that constitute visual information.
[0477] "Partitioning" refers to the process of dividing continuous information into individual units.
[0478] "Image recognition techniques" refer to the process by which computers identify and interpret objects, features, and patterns within an image.
[0479] "Content" refers to the visual elements and their meanings contained within a video or image.
[0480] "Materials" refers to elements or items used that are specifically identified within the video.
[0481] A "work procedure" refers to a series of steps or processes performed to achieve a specific objective.
[0482] "Specification" refers to the act of identifying and evaluating specific properties or elements.
[0483] "Textual information" refers to visual or auditory data expressed in the form of characters or text.
[0484] A "title" refers to a short sentence or phrase used to describe or supplement the content of a video.
[0485] The term "time axis" refers to a framework of thought that indicates the flow or sequence of events over time.
[0486] A "scene" refers to a unit of composition in visual media, such as a shot or sequence.
[0487] This invention is a system that extracts material information and procedures from video content and provides efficient editing support. Specifically, a server receives video uploaded by a user and stores it temporarily. A server computer capable of high-speed data processing is preferred as the hardware to be used. The server divides the video into frames as a preliminary step to analysis. This process utilizes video editing software and image conversion libraries to achieve rapid frame separation.
[0488] Next, the server uses a generative AI model to analyze the image data of each frame. This model identifies the materials and work procedures within the video through image recognition techniques and generates textual information. This method allows the user to arbitrarily obtain individual materials and procedures in text format. The generative AI model can be used with prompts such as "What are the materials in the video?" to achieve accurate information extraction.
[0489] Based on the generated text information, the server automatically creates titles and places them on the timeline to match specific scenes in the video. This process is performed by software equipped with a natural language processing engine. Furthermore, it can also provide visual editing suggestions, offering collaborative editing support to increase creator efficiency. These editing suggestions include guidance on visual effects, such as highlighting important video scenes.
[0490] Ultimately, the terminal receives text information and suggestions from the server, and the user uses this to edit the final video. Through this system, users can quickly create and publish high-quality video content. The aim is to improve the convenience of digital video production and provide a valuable experience for both viewers and creators.
[0491] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0492] Step 1:
[0493] Users upload cooking videos to the platform using their devices. Specifically, the user selects a video file, and the device sends that file to the server. The input is the video file selected by the user, and the output is the transfer of the video file to the server.
[0494] Step 2:
[0495] The server receives the uploaded video and stores it temporarily. Specifically, the received video data is saved to the server's storage. The input is the video data sent from the terminal, and the output is the video data stored in the server's storage.
[0496] Step 3:
[0497] The server divides the stored video into individual frames. Specifically, it divides the video into time intervals to generate still frames. This process uses an image processing library to convert continuous data into individual images. The input is the stored video data, and the output is individual frame image data.
[0498] Step 4:
[0499] The server analyzes each frame using a generated AI model. Specifically, it uses an image recognition algorithm to identify materials and work procedures within the frame and generates text information. This analysis outputs material names and operation procedures as text. The input is frame image data, and the output is text information including materials and procedures.
[0500] Step 5:
[0501] The server generates titles based on text information and places them according to the scenes in the video. A natural language processing engine is used to create video-related captions from text data and synchronize them on the video timeline. The input is text data, and the output is titles aligned with the video timeline.
[0502] Step 6:
[0503] The server generates editing suggestions that include visual effects. Specifically, it uses a generation AI model to analyze the results and make suggestions such as highlighting important scenes in the video. The input is the analyzed video information, and the output is editing instruction data that includes suggestions for visual effects.
[0504] Step 7:
[0505] The terminal receives text information and editing suggestions from the server, and the user performs the final video editing. Specifically, the user makes necessary corrections and edits using the provided information. The input is the text information and editing suggestions sent from the server, and the output is the final edited video file.
[0506] (Application Example 1)
[0507] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0508] While video content has become widespread, and cooking-related videos are increasing, it's not easy for viewers to efficiently understand and practice cooking methods while watching videos. In particular, repeated viewing is necessary to grasp specific ingredients and procedures, which presents a significant time and cognitive burden for users. Creators are also seeking support to efficiently edit engaging videos.
[0509] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0510] In this invention, the server includes means for receiving and storing video; means for dividing the video frame by frame for analysis; means for identifying ingredients and procedures from the video using an image recognition algorithm and generating text data; means for generating visual and temporal editing suggestions; means for transmitting the generated text data, captions, and editing suggestions to a mobile device; and means for visualizing the video analysis results in real time, enabling users to understand the cooking method in accordance with the video. This allows viewers to efficiently grasp cooking procedures as they watch the video, and enables creators to provide engaging content by supporting high-quality video editing.
[0511] "Means for receiving and storing video" refers to a device or system that receives video data transmitted via the Internet or other means of communication and stores it temporarily or permanently in a physical or virtual storage device.
[0512] "Methods for dividing a video into frames for analysis" refers to the process of dividing a received video into multiple frames, which are individual still images, and converting the image data of each frame into a format that can be analyzed independently.
[0513] "A means of identifying ingredients and procedures from video footage using an image recognition algorithm and generating text data" refers to a technology or method that applies image recognition technology such as AI to frames extracted from a video to automatically extract information such as ingredients and cooking procedures, and convert that information into text.
[0514] "Means for generating visual and temporal editing suggestions" refers to a function that analyzes the entire video visually and temporally and suggests editing options to maximize visual effects and scene flow.
[0515] "Means for transmitting generated text data, captions, and editing suggestions to a mobile device" refers to a method of transmitting information such as character data, captions, and editing suggestions generated on a server to a portable device such as a smartphone used by a user, enabling display or editing assistance.
[0516] "A means of visualizing video analysis results in real time, enabling users to understand cooking methods in accordance with the video" refers to a function that displays information on cooking procedures and ingredients obtained through video analysis, overlaid visually while the video is being watched, allowing users to understand and practice the content in a way that matches the video.
[0517] This invention is a system that analyzes cooking videos to help viewers efficiently understand cooking methods and to support creators in providing high-quality content. The system is configured as follows:
[0518] The server first receives cooking videos sent by users and saves them to its storage device. These saved videos are then divided into frames. Each frame is analyzed using an AI-based image recognition algorithm to identify the ingredients and cooking steps within the video. The identified information is generated as text data and later used to create captions. Specifically, Python and TensorFlow are used to perform image recognition, identify ingredients, and extract cooking steps.
[0519] Furthermore, the server automatically generates suggestions for visual and temporal editing of the entire video based on the analysis results, including suggestions for visual effects, extraction of important scenes, and shortening of redundant parts. These editing suggestions are sent to mobile devices such as smartphones via React Native. This allows users to watch the video while visualizing the analysis results in real time, enabling them to efficiently grasp the overall picture of the cooking process.
[0520] As a concrete example, suppose a user uploads a video titled "How to Make Omurice" to the system. The system analyzes the video, identifies steps such as "cracking the eggs" and "frying the rice," and adds these as captions. It also makes editing suggestions, such as "emphasizing the key points for making the eggs fluffy." Furthermore, as an example of a prompt, the AI can be asked questions such as, "What ingredients are used in this video? Also, please tell me the specific cooking steps."
[0521] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0522] Step 1:
[0523] The user selects a cooking video on their device and uploads it to the server. The input is the cooking video file provided from the device. The output is the video file transferred to the server and stored there. This process prepares the video for subsequent analysis.
[0524] Step 2:
[0525] The server divides the received video into frames. The input is a video file stored on the server. The output is the video decomposed into a collection of still image frames. This process allows for individual analysis of each frame.
[0526] Step 3:
[0527] The server applies an image recognition algorithm to identify ingredients and cooking steps within each frame and extract text data. Still image frames are provided as input. Text data containing ingredient names and cooking steps is obtained as output. An AI model using Python and TensorFlow extracts features from the images and identifies the information.
[0528] Step 4:
[0529] The server generates visual and temporal captions and editing suggestions based on extracted text data. Text data regarding materials and procedures is used as input. The output includes visual caption information and video editing suggestions. This allows the captions to be displayed in sync with the video's flow, and the editing suggestions aim to improve the content's quality.
[0530] Step 5:
[0531] The server sends the generated captions and editing suggestions to the terminal and provides them to the user. The generated captions and editing suggestions are used as input. This information is delivered to the user's terminal as output. The user can efficiently understand the video content through real-time visualization on their terminal.
[0532] Step 6:
[0533] The user uses their device to perform the final editing of the video based on the provided captions and editing suggestions. The captions and editing suggestions displayed on the device are used as input. The output is the final edited video, ready to be published on the platform. At this stage, the user determines the final form of the video according to their own needs and prepares it for distribution to viewers.
[0534] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0535] This invention combines an emotion engine with a system that analyzes cooking videos to provide detailed recipe information, enabling the adjustment of editing and caption generation based on the user's emotional state. Specific embodiments of the system are described below.
[0536] First, users upload cooking videos to the platform. The uploaded video files are received by the server and temporarily stored. The server then divides the video frame by frame and converts it into a data format that allows for detailed analysis.
[0537] Next, the server uses an image recognition algorithm to identify the ingredients and cooking steps in each frame. As a result of this process, text data about the ingredients and steps is generated.
[0538] This system incorporates an emotion engine that analyzes the user's emotional state. The device analyzes the user's voice tone and facial expression data to determine their current emotional state. For example, it collects information such as whether the user is happy, confused, or depressed.
[0539] The server uses the emotion engine's judgment to provide feedback on video caption generation and editing suggestions. As a result, the content and tone of the captions reflect the user's emotions, and different approaches are offered in editing suggestions based on those emotions. For example, if the server determines that the user is enjoying the video, it can create captions that emphasize the humor in the video.
[0540] Furthermore, the emotion engine can predict viewers' emotional reactions and suggest improvements to the video after its release. This allows for adjustments to make the video more enjoyable for viewers.
[0541] As a concrete example, suppose a user uploads a video of themselves making desserts and explains the process in a cheerful mood. The emotion engine detects this emotion and reflects it in the caption and editing suggestions. For example, a positive-toned caption such as "How to make cookies that everyone can enjoy making together" is automatically generated. In this way, the system can provide content that appeals to viewers while taking the user's emotions into consideration.
[0542] This system allows users to edit videos with emotional depth and deliver recipe videos that are tailored to maximize viewer enjoyment.
[0543] The following describes the processing flow.
[0544] Step 1:
[0545] Users film cooking videos and upload them to the platform. They use their device's interface to select and submit the video files to be uploaded.
[0546] Step 2:
[0547] The server receives the uploaded video and saves it to temporary storage. Simultaneously with saving, it verifies the file format and size and prepares it for analysis.
[0548] Step 3:
[0549] The server divides the video into frames. In this step, the video is extracted as still images at a constant frame rate for efficient analysis.
[0550] Step 4:
[0551] The server uses an image recognition algorithm to identify the ingredients and cooking steps within the frame. This algorithm then extracts the objects and their actions in the video as text data.
[0552] Step 5:
[0553] The device analyzes the user's voice and facial expressions using an emotion engine. It evaluates the user's emotional state during shooting in real time or retrospectively and sends the information to the server.
[0554] Step 6:
[0555] The server generates captions based on the text data received and the results from the sentiment engine. The generated captions are set to reflect the user's emotions in content and tone, and are organized chronologically.
[0556] Step 7:
[0557] The server provides visual and temporal editing suggestions for the video. This includes leveraging emotion engine data to tailor edits to the user's emotional state. For example, it may include editing instructions that highlight enjoyable parts.
[0558] Step 8:
[0559] The device provides the user with captions and editing suggestions received from the server. The user then makes final adjustments to the video based on this information, and reviews and corrects the captions and editing content.
[0560] Step 9:
[0561] Users review the edited videos and publish them on the platform. After final review and adjustments, the videos are distributed to viewers. This ensures that personalized and emotionally resonant cooking videos are delivered to audiences.
[0562] (Example 2)
[0563] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0564] Existing video analysis systems lack the ability to generate captions or suggest edits based on the user's emotional state, making it difficult to adjust content to capture viewers' interest. This is particularly problematic for content such as cooking videos, where the inability to edit and caption in a way that considers the emotions of viewers and users makes it difficult to create engaging content.
[0565] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0566] In this invention, the server includes means for receiving video and storing it in an information recording device; means for dividing the video into electrical signals for analysis; means for identifying objects and operating procedures in the video and generating text information using image recognition processing; means for determining the emotional state of the user using a user reaction analysis device and generating text information and editing suggestions based on the determination result; and means for transmitting the generated text information, captions, and editing suggestions to an information processing device. This makes it possible to generate captions and suggest edits that correspond to the emotional state of the user, and to provide content that appeals to viewers.
[0567] "Motion images" are digital media that express movement through the continuous display of images.
[0568] An "information recording device" is a storage device for electronically storing received data.
[0569] An "electrical signal" is a signal that represents the flow of digital or analog data that makes up video or audio.
[0570] "Image recognition processing" is a technology that identifies specific visual elements from digital image data and analyzes them.
[0571] "Object" refers to an object, material, or element within an image that is the subject of identification and analysis.
[0572] An "operating procedure" is a series of actions or steps performed to achieve a specific objective.
[0573] "Character information" refers to text-based information generated through image recognition processing.
[0574] A "user response analysis device" is a device used to measure and analyze the emotions and reactions of users.
[0575] "Emotional state" refers to the emotional reactions and psychological state exhibited by the user.
[0576] An "editing suggestion" is a proposed editing option or method to make the video content more effective.
[0577] An "information processing device" is a computer used to process data and output analysis results or suggestions.
[0578] This invention is a system in which users upload cooking videos, and the system analyzes those videos to provide detailed recipe information. Furthermore, by incorporating an emotion engine, this system can adjust editing and caption generation based on the user's emotional state.
[0579] Users upload cooking videos to the platform using their own devices. These videos are received by the server and temporarily stored in an information recording device. The server then uses video analysis software such as OpenCV to convert the video into electrical signals frame by frame. This conversion process prepares each frame to be analyzed as individual image data.
[0580] The server uses image recognition technology to analyze each frame of the image and identify the object and the operating procedure. Machine learning frameworks such as TensorFlow and PyTorch are used for this analysis, and the identified information is output as text data. This generates specific text data about the materials and procedures.
[0581] The device also acquires the user's voice tone and facial expressions, and uses a user reaction analysis device to determine their emotional state. This analysis utilizes a voice recognition API and facial recognition software. This allows for accurate measurement of the user's emotional state, such as whether they are enjoying themselves or feeling confused.
[0582] The server utilizes an emotion engine to generate text information and editing suggestions based on the analysis results. For example, if the emotion analysis determines that the user is enjoying the content, it can generate a caption with a positive tone using the GPT AI model. This allows for the provision of more engaging content to viewers.
[0583] For example, if a user is smiling and explaining a "dessert making video," the caption will automatically generate text that emphasizes a fun atmosphere, such as "How to make cookies that everyone can enjoy making together." In this way, the system reflects the user's emotions in real time and ultimately provides videos that resonate with viewers.
[0584] An example of a prompt might be, "Generate a humorous caption that fits a video of making desserts. The user is enjoying themselves." Following this prompt, the generative AI model can create the necessary captions and editing suggestions.
[0585] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0586] Step 1:
[0587] Users upload cooking videos from their devices to the platform. The cooking video file is received by the server as input, and the server saves this video to an information recording device. Once saved, the video becomes available for subsequent analysis processes.
[0588] Step 2:
[0589] The server converts the stored video into electrical signals frame by frame for analysis. This process uses video analysis software, which divides the input video data into individual image data for output. This allows for analysis of each individual image.
[0590] Step 3:
[0591] The server analyzes each frame using image recognition processing to identify the object and the operating procedure. The input for this step is image data for each frame, and the output is textual information about the ingredients and cooking procedure. For this purpose, a machine learning framework is used to identify and digitize specific visual elements.
[0592] Step 4:
[0593] The terminal collects the user's voice tone and facial expressions as input data and uses a user response analysis device to determine their emotional state. The output here is data on the emotional state shown by the user. This analysis is performed using a voice recognition API and facial recognition software, and the user's emotions are displayed as a numerical value or category.
[0594] Step 5:
[0595] The server uses an emotion engine to generate text information and editing suggestions that reflect the emotional state. The input to this process is the user's emotional state data and the text information generated in the previous step, and the output is a caption and editing suggestions corresponding to the emotion. The generation AI model, for example, uses GPT to create emotion-based captions.
[0596] Step 6:
[0597] The server sends the generated captions and editing suggestions to the information processing unit. This final step involves the transmission process leading to the information processing unit. The input is the generated caption data and editing suggestion data, which are sent directly to the information processing unit as output. This prepares the content for the audience to enjoy.
[0598] (Application Example 2)
[0599] Next, we will explain Application Example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."
[0600] In cooking videos, there is a growing demand for dynamic editing that responds to the user's emotional state, providing viewers with more engaging and personalized content. However, conventional systems have been insufficient in predicting viewers' emotional reactions and improving videos, or in generating captions that take user emotions into account. Therefore, an improvement in the user experience is desired.
[0601] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0602] In this invention, the server includes means for receiving and storing video, means for dividing the video into frames for analysis, and means for identifying substances and processes from the video using an image recognition algorithm and generating textual information. This makes it possible to analyze the user's emotional state and adjust captions and editing suggestions based on those emotions, predict the viewer's emotional response and make improvement suggestions, and send prompt sentences to a generation AI model to obtain improvement suggestions.
[0603] "Means for receiving and saving video" refers to a device or process that receives video data provided from an external source and stores it on a storage medium.
[0604] "Methods for dividing into frames" refer to techniques for dividing temporally continuous video footage into individual still images for the purpose of analyzing video data.
[0605] An "image recognition algorithm" is a program or method that automatically identifies and interprets specific objects or situations from image data.
[0606] "Means for identifying materials and processes and generating textual information" refers to a system that, based on image recognition results, transcribes the materials used and procedures performed within a video into text and records them in digital format.
[0607] An "emotion analysis engine that analyzes the emotional state of users" is a technology that infers the user's psychological state at a given time from the nuances of their facial expressions and voice.
[0608] "A method for sending prompt sentences to a generative AI model and obtaining improvement suggestions" refers to a method that uses natural language processing technology to provide input to an artificial intelligence system that generates responses to requests, and to obtain suggestions for improvement.
[0609] "Methods for predicting viewer emotional responses and proposing improvements" refers to a process of estimating emotional responses from viewer behavior and feedback, and then proposing specific changes to improve the quality of content based on those estimates.
[0610] This invention begins with a user uploading a cooking video to the platform. The server receives this video and stores it temporarily. The video is then divided into frames for analysis. Software such as OpenCV or FFmpeg is often used for this process.
[0611] Next, the server applies an image recognition algorithm to identify the substance and process from each frame. The data obtained through this process is stored in text format. At this stage, machine learning frameworks such as TensorFlow and PyTorch are used.
[0612] Furthermore, the emotion analysis engine determines the user's emotional state based on the voice and facial expression data provided by the user's device. Google Cloud Speech-to-Text API and Microsoft Azure's Emotion API are utilized for this analysis.
[0613] Upon receiving this emotional state information, the server sends a prompt to the generation AI model to obtain feedback on caption generation and editing. An example of such a prompt is, "Generate captions that are relatable to viewers and create a fun atmosphere for a cooking video in which the user is enjoying explaining the subject."
[0614] Ultimately, suggestions for improving the video are made based on viewer emotional responses. This content is then visually and temporally edited before being sent to the device. In this way, viewers can enjoy personalized and engaging videos.
[0615] For example, when a user explains "how to make a raspberry tart" to viewers, the system captures the user's joyful emotions and generates captions that emphasize them. This makes the video even more appealing to viewers.
[0616] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0617] Step 1:
[0618] Users upload cooking videos to the platform. The server receives and stores these videos. The input is the cooking video uploaded by the user, and the output is a video file stored on the server. The server checks the video format and saves it in the appropriate format.
[0619] Step 2:
[0620] The server divides the video frame by frame for analysis. The input is a saved video file, and the output is individual frame images. The server converts the video to still images based on the frame rate and records the timestamp of each frame.
[0621] Step 3:
[0622] The server applies an image recognition algorithm to each frame of the image to identify substances and processes. The input is the frame image, and the output is textual information of the recognized substances and processes. The server uses a machine learning model to analyze the content of each frame and saves the results as text data.
[0623] Step 4:
[0624] The device acquires recorded audio data and facial expression data captured by the camera and sends them to the emotion analysis engine. The input is audio and video data, and the output is the user's emotional state. The device uses the emotion analysis engine to analyze the tone of voice and facial expressions to determine the current emotion.
[0625] Step 5:
[0626] The server sends a prompt to the generative AI model based on the emotional state and analyzed text data, and retrieves a caption and editing suggestions. The input is the emotional state and text data, and an example prompt is "Generate a caption for a cooking video where the user is enjoying explaining, that is relatable to viewers and creates a fun atmosphere." The output is the adjusted caption and editing suggestions.
[0627] Step 6:
[0628] The server predicts viewer emotional responses and provides suggestions for improving videos after publication. Inputs include generated captions, user emotional states, and past viewer data; output is improved video content. The server analyzes viewer history and reactions to generate suggestions for editing more engaging videos.
[0629] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0630] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0631] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.
[0632] [Fourth Embodiment]
[0633] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.
[0634] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.
[0635] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).
[0636] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.
[0637] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.
[0638] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).
[0639] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.
[0640] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.
[0641] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.
[0642] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.
[0643] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.
[0644] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.
[0645] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0646] This invention is a system that effectively analyzes cooking videos and provides convenient recipe information and editing support for both viewers and creators. Specific embodiments of the system are described below.
[0647] First, users upload cooking videos they have filmed themselves to the platform. This process requires the user's device to select and submit the video.
[0648] Next, the server receives the video and temporarily stores it. The video is then split frame by frame for analysis. This allows for detailed analysis of each frame as image data.
[0649] The server uses an AI image recognition algorithm to identify ingredients and cooking steps within video frames. This analysis generates text data containing ingredient names, cooking progress, and other relevant information.
[0650] Based on the generated text data, the server automatically creates captions corresponding to the video content. The captions are assigned along the video's timeline, allowing viewers to instantly access information that matches the video.
[0651] Furthermore, the server provides visual editing suggestions. These suggestions include things like scene transitions, shortening redundant parts, and highlighting important scenes, aiming to improve the overall quality of the video.
[0652] Finally, the user, having received the provided text data and editing suggestions, performs the final editing of the video based on that information. After adjustments, the user can publish the video on the platform.
[0653] As a concrete example, consider a case where a user uploads a video of a pasta dish. The server recognizes ingredients such as tomatoes and pasta from the frames and creates captions for steps such as "boiling the pasta" and "mixing the sauce." Furthermore, it makes editing suggestions such as "shorten the boiling time" and "emphasize the scene of stirring the cream sauce." As a result, the user can complete a high-quality video with detailed recipe information in a short amount of time.
[0654] In this way, the system reduces the burden on users while enabling the provision of user-friendly content for viewers.
[0655] The following describes the processing flow.
[0656] Step 1:
[0657] Users upload cooking videos to the platform. Through the device interface, users select video files and press the upload button, sending the video data to the server.
[0658] Step 2:
[0659] The server checks the received video and saves it to temporary storage. It verifies the format and size of the video file to ensure it complies with the platform's standards.
[0660] Step 3:
[0661] The server divides the video into frames for analysis. Based on the frame rate, it extracts frames at intervals that allow for efficient and appropriate analysis.
[0662] Step 4:
[0663] The server uses an AI model to identify ingredients and cooking steps from image data within a frame. In this process, ingredients, tools, and cooking steps are recognized as text, and the data is structured.
[0664] Step 5:
[0665] The server generates captions from the obtained material and procedure data. Based on the text data, it generates easy-to-understand explanatory text for viewers and adds captions to the video along the timeline.
[0666] Step 6:
[0667] The server provides video editing suggestions. Based on the analysis results, it generates scene cuts and improvement suggestions, and offers ideas to highlight visual effects and key points.
[0668] Step 7:
[0669] The device displays text information, captions, and editing suggestions received from the server to the user. The user can then use this information to make final edits to the video.
[0670] Step 8:
[0671] The user publishes their edited video on the platform. After a final review and any necessary adjustments, it is made available for other users to view.
[0672] (Example 1)
[0673] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0674] In recent years, with the increase in video content, it has become difficult for viewers to efficiently obtain information. Furthermore, creators are required to dedicate a significant amount of time and effort to editing. To address these issues, there is a need for the creation of video content that is convenient for both viewers and creators by automatically analyzing ingredients and cooking procedures and providing visual editing suggestions.
[0675] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.
[0676] In this invention, the server includes means for receiving and storing video, means for dividing the video into individual images for analysis, and means for identifying materials and work procedures from the content within the video using an image recognition method and generating text information. This enables the rapid and accurate acquisition of material and procedure information for video content uploaded by users, and further enables efficient video creation through editing suggestions.
[0677] "Image" refers to visual information expressed in digital or analog format.
[0678] "Receiving" refers to the act of receiving data via a digital network or other means of communication.
[0679] "Storage" refers to the act of saving specific data in a way that allows for later access.
[0680] "Image" refers to individual still frames or their digital representations that constitute visual information.
[0681] "Partitioning" refers to the process of dividing continuous information into individual units.
[0682] "Image recognition techniques" refer to the process by which computers identify and interpret objects, features, and patterns within an image.
[0683] "Content" refers to the visual elements and their meanings contained within a video or image.
[0684] "Materials" refers to elements or items used that are specifically identified within the video.
[0685] A "work procedure" refers to a series of steps or processes performed to achieve a specific objective.
[0686] "Specification" refers to the act of identifying and evaluating specific properties or elements.
[0687] "Textual information" refers to visual or auditory data expressed in the form of characters or text.
[0688] A "title" refers to a short sentence or phrase used to describe or supplement the content of a video.
[0689] The term "time axis" refers to a framework of thought that indicates the flow or sequence of events over time.
[0690] A "scene" refers to a unit of composition in visual media, such as a shot or sequence.
[0691] This invention is a system that extracts material information and procedures from video content and provides efficient editing support. Specifically, a server receives video uploaded by a user and stores it temporarily. A server computer capable of high-speed data processing is preferred as the hardware to be used. The server divides the video into frames as a preliminary step to analysis. This process utilizes video editing software and image conversion libraries to achieve rapid frame separation.
[0692] Next, the server uses a generative AI model to analyze the image data of each frame. This model identifies the materials and work procedures within the video through image recognition techniques and generates textual information. This method allows the user to arbitrarily obtain individual materials and procedures in text format. The generative AI model can be used with prompts such as "What are the materials in the video?" to achieve accurate information extraction.
[0693] Based on the generated text information, the server automatically creates titles and places them on the timeline to match specific scenes in the video. This process is performed by software equipped with a natural language processing engine. Furthermore, it can also provide visual editing suggestions, offering collaborative editing support to increase creator efficiency. These editing suggestions include guidance on visual effects, such as highlighting important video scenes.
[0694] Ultimately, the terminal receives text information and suggestions from the server, and the user uses this to edit the final video. Through this system, users can quickly create and publish high-quality video content. The aim is to improve the convenience of digital video production and provide a valuable experience for both viewers and creators.
[0695] The flow of the specific processing in Example 1 will be explained using Figure 11.
[0696] Step 1:
[0697] Users upload cooking videos to the platform using their devices. Specifically, the user selects a video file, and the device sends that file to the server. The input is the video file selected by the user, and the output is the transfer of the video file to the server.
[0698] Step 2:
[0699] The server receives the uploaded video and stores it temporarily. Specifically, the received video data is saved to the server's storage. The input is the video data sent from the terminal, and the output is the video data stored in the server's storage.
[0700] Step 3:
[0701] The server divides the stored video into individual frames. Specifically, it divides the video into time intervals to generate still frames. This process uses an image processing library to convert continuous data into individual images. The input is the stored video data, and the output is individual frame image data.
[0702] Step 4:
[0703] The server analyzes each frame using a generated AI model. Specifically, it uses an image recognition algorithm to identify materials and work procedures within the frame and generates text information. This analysis outputs material names and operation procedures as text. The input is frame image data, and the output is text information including materials and procedures.
[0704] Step 5:
[0705] The server generates titles based on text information and places them according to the scenes in the video. A natural language processing engine is used to create video-related captions from text data and synchronize them on the video timeline. The input is text data, and the output is titles aligned with the video timeline.
[0706] Step 6:
[0707] The server generates editing suggestions that include visual effects. Specifically, it uses a generation AI model to analyze the results and make suggestions such as highlighting important scenes in the video. The input is the analyzed video information, and the output is editing instruction data that includes suggestions for visual effects.
[0708] Step 7:
[0709] The terminal receives text information and editing suggestions from the server, and the user performs the final video editing. Specifically, the user makes necessary corrections and edits using the provided information. The input is the text information and editing suggestions sent from the server, and the output is the final edited video file.
[0710] (Application Example 1)
[0711] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0712] While video content has become widespread, and cooking-related videos are increasing, it's not easy for viewers to efficiently understand and practice cooking methods while watching videos. In particular, repeated viewing is necessary to grasp specific ingredients and procedures, which presents a significant time and cognitive burden for users. Creators are also seeking support to efficiently edit engaging videos.
[0713] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.
[0714] In this invention, the server includes means for receiving and storing video; means for dividing the video frame by frame for analysis; means for identifying ingredients and procedures from the video using an image recognition algorithm and generating text data; means for generating visual and temporal editing suggestions; means for transmitting the generated text data, captions, and editing suggestions to a mobile device; and means for visualizing the video analysis results in real time, enabling users to understand the cooking method in accordance with the video. This allows viewers to efficiently grasp cooking procedures as they watch the video, and enables creators to provide engaging content by supporting high-quality video editing.
[0715] "Means for receiving and storing video" refers to a device or system that receives video data transmitted via the Internet or other means of communication and stores it temporarily or permanently in a physical or virtual storage device.
[0716] "Methods for dividing a video into frames for analysis" refers to the process of dividing a received video into multiple frames, which are individual still images, and converting the image data of each frame into a format that can be analyzed independently.
[0717] "A means of identifying ingredients and procedures from video footage using an image recognition algorithm and generating text data" refers to a technology or method that applies image recognition technology such as AI to frames extracted from a video to automatically extract information such as ingredients and cooking procedures, and convert that information into text.
[0718] "Means for generating visual and temporal editing suggestions" refers to a function that analyzes the entire video visually and temporally and suggests editing options to maximize visual effects and scene flow.
[0719] "Means for transmitting generated text data, captions, and editing suggestions to a mobile device" refers to a method of transmitting information such as character data, captions, and editing suggestions generated on a server to a portable device such as a smartphone used by a user, enabling display or editing assistance.
[0720] "A means of visualizing video analysis results in real time, enabling users to understand cooking methods in accordance with the video" refers to a function that displays information on cooking procedures and ingredients obtained through video analysis, overlaid visually while the video is being watched, allowing users to understand and practice the content in a way that matches the video.
[0721] This invention is a system that analyzes cooking videos to help viewers efficiently understand cooking methods and to support creators in providing high-quality content. The system is configured as follows:
[0722] The server first receives cooking videos sent by users and saves them to its storage device. These saved videos are then divided into frames. Each frame is analyzed using an AI-based image recognition algorithm to identify the ingredients and cooking steps within the video. The identified information is generated as text data and later used to create captions. Specifically, Python and TensorFlow are used to perform image recognition, identify ingredients, and extract cooking steps.
[0723] Furthermore, the server automatically generates suggestions for visual and temporal editing of the entire video based on the analysis results, including suggestions for visual effects, extraction of important scenes, and shortening of redundant parts. These editing suggestions are sent to mobile devices such as smartphones via React Native. This allows users to watch the video while visualizing the analysis results in real time, enabling them to efficiently grasp the overall picture of the cooking process.
[0724] As a concrete example, suppose a user uploads a video titled "How to Make Omurice" to the system. The system analyzes the video, identifies steps such as "cracking the eggs" and "frying the rice," and adds these as captions. It also makes editing suggestions, such as "emphasizing the key points for making the eggs fluffy." Furthermore, as an example of a prompt, the AI can be asked questions such as, "What ingredients are used in this video? Also, please tell me the specific cooking steps."
[0725] The flow of a specific process in Application Example 1 will be explained using Figure 12.
[0726] Step 1:
[0727] The user selects a cooking video on their device and uploads it to the server. The input is the cooking video file provided from the device. The output is the video file transferred to the server and stored there. This process prepares the video for subsequent analysis.
[0728] Step 2:
[0729] The server divides the received video into frames. The input is a video file stored on the server. The output is the video decomposed into a collection of still image frames. This process allows for individual analysis of each frame.
[0730] Step 3:
[0731] The server applies an image recognition algorithm to identify ingredients and cooking steps within each frame and extract text data. Still image frames are provided as input. Text data containing ingredient names and cooking steps is obtained as output. An AI model using Python and TensorFlow extracts features from the images and identifies the information.
[0732] Step 4:
[0733] The server generates visual and temporal captions and editing suggestions based on extracted text data. Text data regarding materials and procedures is used as input. The output includes visual caption information and video editing suggestions. This allows the captions to be displayed in sync with the video's flow, and the editing suggestions aim to improve the content's quality.
[0734] Step 5:
[0735] The server sends the generated captions and editing suggestions to the terminal and provides them to the user. The generated captions and editing suggestions are used as input. This information is delivered to the user's terminal as output. The user can efficiently understand the video content through real-time visualization on their terminal.
[0736] Step 6:
[0737] The user uses their device to perform the final editing of the video based on the provided captions and editing suggestions. The captions and editing suggestions displayed on the device are used as input. The output is the final edited video, ready to be published on the platform. At this stage, the user determines the final form of the video according to their own needs and prepares it for distribution to viewers.
[0738] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.
[0739] This invention combines an emotion engine with a system that analyzes cooking videos to provide detailed recipe information, enabling the adjustment of editing and caption generation based on the user's emotional state. Specific embodiments of the system are described below.
[0740] First, users upload cooking videos to the platform. The uploaded video files are received by the server and temporarily stored. The server then divides the video frame by frame and converts it into a data format that allows for detailed analysis.
[0741] Next, the server uses an image recognition algorithm to identify the ingredients and cooking steps in each frame. As a result of this process, text data about the ingredients and steps is generated.
[0742] This system incorporates an emotion engine that analyzes the user's emotional state. The device analyzes the user's voice tone and facial expression data to determine their current emotional state. For example, it collects information such as whether the user is happy, confused, or depressed.
[0743] The server uses the emotion engine's judgment to provide feedback on video caption generation and editing suggestions. As a result, the content and tone of the captions reflect the user's emotions, and different approaches are offered in editing suggestions based on those emotions. For example, if the server determines that the user is enjoying the video, it can create captions that emphasize the humor in the video.
[0744] Furthermore, the emotion engine can predict viewers' emotional reactions and suggest improvements to the video after its release. This allows for adjustments to make the video more enjoyable for viewers.
[0745] As a concrete example, suppose a user uploads a video of themselves making desserts and explains the process in a cheerful mood. The emotion engine detects this emotion and reflects it in the caption and editing suggestions. For example, a positive-toned caption such as "How to make cookies that everyone can enjoy making together" is automatically generated. In this way, the system can provide content that appeals to viewers while taking the user's emotions into consideration.
[0746] This system allows users to edit videos with emotional depth and deliver recipe videos that are tailored to maximize viewer enjoyment.
[0747] The following describes the processing flow.
[0748] Step 1:
[0749] Users film cooking videos and upload them to the platform. They use their device's interface to select and submit the video files to be uploaded.
[0750] Step 2:
[0751] The server receives the uploaded video and saves it to temporary storage. Simultaneously with saving, it verifies the file format and size and prepares it for analysis.
[0752] Step 3:
[0753] The server divides the video into frames. In this step, the video is extracted as still images at a constant frame rate for efficient analysis.
[0754] Step 4:
[0755] The server uses an image recognition algorithm to identify the ingredients and cooking steps within the frame. This algorithm then extracts the objects and their actions in the video as text data.
[0756] Step 5:
[0757] The device analyzes the user's voice and facial expressions using an emotion engine. It evaluates the user's emotional state during shooting in real time or retrospectively and sends the information to the server.
[0758] Step 6:
[0759] The server generates captions based on the text data received and the results from the sentiment engine. The generated captions are set to reflect the user's emotions in content and tone, and are organized chronologically.
[0760] Step 7:
[0761] The server provides visual and temporal editing suggestions for the video. This includes leveraging emotion engine data to tailor edits to the user's emotional state. For example, it may include editing instructions that highlight enjoyable parts.
[0762] Step 8:
[0763] The device provides the user with captions and editing suggestions received from the server. The user then makes final adjustments to the video based on this information, and reviews and corrects the captions and editing content.
[0764] Step 9:
[0765] Users review the edited videos and publish them on the platform. After final review and adjustments, the videos are distributed to viewers. This ensures that personalized and emotionally resonant cooking videos are delivered to audiences.
[0766] (Example 2)
[0767] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0768] Existing video analysis systems lack the ability to generate captions or suggest edits based on the user's emotional state, making it difficult to adjust content to capture viewers' interest. This is particularly problematic for content such as cooking videos, where the inability to edit and caption in a way that considers the emotions of viewers and users makes it difficult to create engaging content.
[0769] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.
[0770] In this invention, the server includes means for receiving video and storing it in an information recording device; means for dividing the video into electrical signals for analysis; means for identifying objects and operating procedures in the video and generating text information using image recognition processing; means for determining the emotional state of the user using a user reaction analysis device and generating text information and editing suggestions based on the determination result; and means for transmitting the generated text information, captions, and editing suggestions to an information processing device. This makes it possible to generate captions and suggest edits that correspond to the emotional state of the user, and to provide content that appeals to viewers.
[0771] "Motion images" are digital media that express movement through the continuous display of images.
[0772] An "information recording device" is a storage device for electronically storing received data.
[0773] An "electrical signal" is a signal that represents the flow of digital or analog data that makes up video or audio.
[0774] "Image recognition processing" is a technology that identifies specific visual elements from digital image data and analyzes them.
[0775] "Object" refers to an object, material, or element within an image that is the subject of identification and analysis.
[0776] An "operating procedure" is a series of actions or steps performed to achieve a specific objective.
[0777] "Character information" refers to text-based information generated through image recognition processing.
[0778] A "user response analysis device" is a device used to measure and analyze the emotions and reactions of users.
[0779] "Emotional state" refers to the emotional reactions and psychological state exhibited by the user.
[0780] An "editing suggestion" is a proposed editing option or method to make the video content more effective.
[0781] An "information processing device" is a computer used to process data and output analysis results or suggestions.
[0782] This invention is a system in which users upload cooking videos, and the system analyzes those videos to provide detailed recipe information. Furthermore, by incorporating an emotion engine, this system can adjust editing and caption generation based on the user's emotional state.
[0783] Users upload cooking videos to the platform using their own devices. These videos are received by the server and temporarily stored in an information recording device. The server then uses video analysis software such as OpenCV to convert the video into electrical signals frame by frame. This conversion process prepares each frame to be analyzed as individual image data.
[0784] The server uses image recognition technology to analyze each frame of the image and identify the object and the operating procedure. Machine learning frameworks such as TensorFlow and PyTorch are used for this analysis, and the identified information is output as text data. This generates specific text data about the materials and procedures.
[0785] The device also acquires the user's voice tone and facial expressions, and uses a user reaction analysis device to determine their emotional state. This analysis utilizes a voice recognition API and facial recognition software. This allows for accurate measurement of the user's emotional state, such as whether they are enjoying themselves or feeling confused.
[0786] The server utilizes an emotion engine to generate text information and editing suggestions based on the analysis results. For example, if the emotion analysis determines that the user is enjoying the content, it can generate a caption with a positive tone using the GPT AI model. This allows for the provision of more engaging content to viewers.
[0787] For example, if a user is smiling and explaining a "dessert making video," the caption will automatically generate text that emphasizes a fun atmosphere, such as "How to make cookies that everyone can enjoy making together." In this way, the system reflects the user's emotions in real time and ultimately provides videos that resonate with viewers.
[0788] An example of a prompt might be, "Generate a humorous caption that fits a video of making desserts. The user is enjoying themselves." Following this prompt, the generative AI model can create the necessary captions and editing suggestions.
[0789] The flow of the specific processing in Example 2 will be explained using Figure 13.
[0790] Step 1:
[0791] Users upload cooking videos from their devices to the platform. The cooking video file is received by the server as input, and the server saves this video to an information recording device. Once saved, the video becomes available for subsequent analysis processes.
[0792] Step 2:
[0793] The server converts the stored video into electrical signals frame by frame for analysis. This process uses video analysis software, which divides the input video data into individual image data for output. This allows for analysis of each individual image.
[0794] Step 3:
[0795] The server analyzes each frame using image recognition processing to identify the object and the operating procedure. The input for this step is image data for each frame, and the output is textual information about the ingredients and cooking procedure. For this purpose, a machine learning framework is used to identify and digitize specific visual elements.
[0796] Step 4:
[0797] The terminal collects the user's voice tone and facial expressions as input data and uses a user response analysis device to determine their emotional state. The output here is data on the emotional state shown by the user. This analysis is performed using a voice recognition API and facial recognition software, and the user's emotions are displayed as a numerical value or category.
[0798] Step 5:
[0799] The server uses an emotion engine to generate text information and editing suggestions that reflect the emotional state. The input to this process is the user's emotional state data and the text information generated in the previous step, and the output is a caption and editing suggestions corresponding to the emotion. The generation AI model, for example, uses GPT to create emotion-based captions.
[0800] Step 6:
[0801] The server sends the generated captions and editing suggestions to the information processing unit. This final step involves the transmission process leading to the information processing unit. The input is the generated caption data and editing suggestion data, which are sent directly to the information processing unit as output. This prepares the content for the audience to enjoy.
[0802] (Application Example 2)
[0803] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".
[0804] In cooking videos, there is a growing demand for dynamic editing that responds to the user's emotional state, providing viewers with more engaging and personalized content. However, conventional systems have been insufficient in predicting viewers' emotional reactions and improving videos, or in generating captions that take user emotions into account. Therefore, an improvement in the user experience is desired.
[0805] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.
[0806] In this invention, the server includes means for receiving and storing video, means for dividing the video into frames for analysis, and means for identifying substances and processes from the video using an image recognition algorithm and generating textual information. This makes it possible to analyze the user's emotional state and adjust captions and editing suggestions based on those emotions, predict the viewer's emotional response and make improvement suggestions, and send prompt sentences to a generation AI model to obtain improvement suggestions.
[0807] "Means for receiving and saving video" refers to a device or process that receives video data provided from an external source and stores it on a storage medium.
[0808] "Methods for dividing into frames" refer to techniques for dividing temporally continuous video footage into individual still images for the purpose of analyzing video data.
[0809] An "image recognition algorithm" is a program or method that automatically identifies and interprets specific objects or situations from image data.
[0810] "Means for identifying materials and processes and generating textual information" refers to a system that, based on image recognition results, transcribes the materials used and procedures performed within a video into text and records them in digital format.
[0811] An "emotion analysis engine that analyzes the emotional state of users" is a technology that infers the user's psychological state at a given time from the nuances of their facial expressions and voice.
[0812] "A method for sending prompt sentences to a generative AI model and obtaining improvement suggestions" refers to a method that uses natural language processing technology to provide input to an artificial intelligence system that generates responses to requests, and to obtain suggestions for improvement.
[0813] "Methods for predicting viewer emotional responses and proposing improvements" refers to a process of estimating emotional responses from viewer behavior and feedback, and then proposing specific changes to improve the quality of content based on those estimates.
[0814] This invention begins with a user uploading a cooking video to the platform. The server receives this video and stores it temporarily. The video is then divided into frames for analysis. Software such as OpenCV or FFmpeg is often used for this process.
[0815] Next, the server applies an image recognition algorithm to identify the substance and process from each frame. The data obtained through this process is stored in text format. At this stage, machine learning frameworks such as TensorFlow and PyTorch are used.
[0816] Furthermore, the emotion analysis engine determines the user's emotional state based on the voice and facial expression data provided by the user's device. Google Cloud Speech-to-Text API and Microsoft Azure's Emotion API are utilized for this analysis.
[0817] Upon receiving this emotional state information, the server sends a prompt to the generation AI model to obtain feedback on caption generation and editing. An example of such a prompt is, "Generate captions that are relatable to viewers and create a fun atmosphere for a cooking video in which the user is enjoying explaining the subject."
[0818] Ultimately, suggestions for improving the video are made based on viewer emotional responses. This content is then visually and temporally edited before being sent to the device. In this way, viewers can enjoy personalized and engaging videos.
[0819] For example, when a user explains "how to make a raspberry tart" to viewers, the system captures the user's joyful emotions and generates captions that emphasize them. This makes the video even more appealing to viewers.
[0820] The flow of a specific process in Application Example 2 will be explained using Figure 14.
[0821] Step 1:
[0822] Users upload cooking videos to the platform. The server receives and stores these videos. The input is the cooking video uploaded by the user, and the output is a video file stored on the server. The server checks the video format and saves it in the appropriate format.
[0823] Step 2:
[0824] The server divides the video frame by frame for analysis. The input is a saved video file, and the output is individual frame images. The server converts the video to still images based on the frame rate and records the timestamp of each frame.
[0825] Step 3:
[0826] The server applies an image recognition algorithm to each frame of the image to identify substances and processes. The input is the frame image, and the output is textual information of the recognized substances and processes. The server uses a machine learning model to analyze the content of each frame and saves the results as text data.
[0827] Step 4:
[0828] The device acquires recorded audio data and facial expression data captured by the camera and sends them to the emotion analysis engine. The input is audio and video data, and the output is the user's emotional state. The device uses the emotion analysis engine to analyze the tone of voice and facial expressions to determine the current emotion.
[0829] Step 5:
[0830] The server sends a prompt to the generative AI model based on the emotional state and analyzed text data, and retrieves a caption and editing suggestions. The input is the emotional state and text data, and an example prompt is "Generate a caption for a cooking video where the user is enjoying explaining, that is relatable to viewers and creates a fun atmosphere." The output is the adjusted caption and editing suggestions.
[0831] Step 6:
[0832] The server predicts viewer emotional responses and provides suggestions for improving videos after publication. Inputs include generated captions, user emotional states, and past viewer data; output is improved video content. The server analyzes viewer history and reactions to generate suggestions for editing more engaging videos.
[0833] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.
[0834] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include the following. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions indicated by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.
[0835] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.
[0836] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.
[0837] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.
[0838] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.
[0839] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.
[0840] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.
[0841] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."
[0842] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.
[0843] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.
[0844] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.
[0845] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.
[0846] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.
[0847] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.
[0848] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.
[0849] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.
[0850] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.
[0851] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.
[0852] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.
[0853] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.
[0854] The following is further disclosed regarding the embodiments described above.
[0855] (Claim 1)
[0856] A means of receiving and saving videos,
[0857] A means for dividing the aforementioned video into frames for analysis,
[0858] A means for identifying materials and procedures from the video footage using an image recognition algorithm and generating text data,
[0859] Means for generating a caption based on the aforementioned text data,
[0860] Means for generating visual and temporal editing suggestions,
[0861] Means for sending the generated text data, caption, and editing suggestions to a terminal,
[0862] A system that includes this.
[0863] (Claim 2)
[0864] The caption generation is organized based on a time axis and corresponds to the scene, according to claim 1.
[0865] (Claim 3)
[0866] The system according to claim 1 that generates editing suggestions including visual effects and audio guides.
[0867] "Example 1"
[0868] (Claim 1)
[0869] A means of receiving and storing video,
[0870] A means for dividing the aforementioned video into individual images for analysis,
[0871] A means for identifying materials and work procedures from the content within the video using an image recognition method and generating text information,
[0872] Means for generating a title based on the aforementioned textual information,
[0873] Means for formulating visual and temporal editing proposals,
[0874] Means for transmitting the generated text information, title, and editorial proposal to an information device,
[0875] A system that includes this.
[0876] (Claim 2)
[0877] The system according to claim 1, wherein the title generation is organized based on a time axis and corresponds to the scene.
[0878] (Claim 3)
[0879] The system according to claim 1 for formulating editorial proposals including visual effects and audio guidance.
[0880] "Application Example 1"
[0881] (Claim 1)
[0882] A means of receiving and saving videos,
[0883] A means for dividing the aforementioned video into frames for analysis,
[0884] A means for identifying materials and procedures from the video footage using an image recognition algorithm and generating text data,
[0885] Means for generating a caption based on the aforementioned text data,
[0886] Means for generating visual and temporal editing suggestions,
[0887] Means for transmitting the generated text data, caption, and editing suggestions to a mobile device,
[0888] A means to visualize the analysis results of the aforementioned video in real time, enabling users to understand the cooking method in accordance with the video,
[0889] A system that includes this.
[0890] (Claim 2)
[0891] The caption generation is organized based on a time axis and corresponds to the scene, according to claim 1.
[0892] (Claim 3)
[0893] The system according to claim 1, which generates editing suggestions including visual effects and audio guides, and provides information that enables users to quickly begin cooking.
[0894] "Example 2 of combining an emotion engine"
[0895] (Claim 1)
[0896] A means for receiving moving images and storing them in an information recording device,
[0897] Means for dividing the aforementioned moving image into electrical signals for analysis,
[0898] A means for identifying objects and operating procedures in the video using image recognition processing and generating text information,
[0899] A means for determining the emotional state of a user using a reaction analysis device and generating text information and editing proposals based on the determination results,
[0900] Means for transmitting the generated text information, caption, and editing proposal to an information processing device,
[0901] A system that includes this.
[0902] (Claim 2)
[0903] The caption generation is organized according to claim 1, corresponding to the scene, based on the progression of time.
[0904] (Claim 3)
[0905] The system according to claim 1 for generating an edited plan that includes visual effects and audio guidance.
[0906] "Application example 2 when combining with an emotional engine"
[0907] (Claim 1)
[0908] A means of receiving and saving videos,
[0909] A means for dividing the aforementioned video into frames for analysis,
[0910] A means for identifying substances and processes from video footage using an image recognition algorithm and generating textual information,
[0911] Means for generating a caption based on the aforementioned text information,
[0912] Includes an emotion analysis engine that analyzes the emotional state of the user, and means for adjusting captions and editing suggestions based on the said emotional state,
[0913] Means for generating visual and temporal editing suggestions,
[0914] A means of predicting viewers' emotional reactions and suggesting improvements to videos after they are released,
[0915] Means for transmitting the generated text information, captions, and editing suggestions to a terminal,
[0916] A means of sending prompt text to a generative AI model and obtaining improvement suggestions,
[0917] A system that includes this.
[0918] (Claim 2)
[0919] The caption generation is organized based on a time axis and corresponds to the scene, according to claim 1.
[0920] (Claim 3)
[0921] The system according to claim 1, which generates visual effects, audio guides, and editing suggestions based on emotion analysis results. [Explanation of Symbols]
[0922] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>
Claims
1. A means of receiving and saving videos, A means for dividing the aforementioned video into frames for analysis, A means for identifying materials and procedures from the video footage using an image recognition algorithm and generating text data, Means for generating a caption based on the aforementioned text data, Means for generating visual and temporal editing suggestions, Means for transmitting the generated text data, caption, and editing suggestions to a mobile device, A means to visualize the analysis results of the aforementioned video in real time, enabling users to understand the cooking method in accordance with the video, A system that includes this.
2. The caption generation is organized based on a time axis and corresponds to the scene, according to claim 1.
3. The system according to claim 1, which generates editing suggestions including visual effects and audio guides, and provides information that enables users to quickly start cooking.