system

A system simplifies video editing by allowing users to input instructions through a mobile app, which is processed by a server to automatically edit and synchronize music, addressing the challenge of creating emotionally rich videos without specialized knowledge.

JP2026105472APending Publication Date: 2026-06-26SOFTBANK GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SOFTBANK GROUP CORP
Filing Date
2024-12-16
Publication Date
2026-06-26

Smart Images

  • Figure 2026105472000001_ABST
    Figure 2026105472000001_ABST
Patent Text Reader

Abstract

Provide a system. 【Solution means】 User input means, Means for transmitting data via a mobile terminal, Means for analyzing the received video data, Means for analyzing natural language to determine an editing task, Means for designing an editing plan based on an advertisement concept, Means for editing a video based on the editing plan, Means for synchronizing the generated music with the video, Means for transmitting the edited video to the mobile terminal, A system including the above.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technology of the present disclosure relates to a system.

Background Art

[0002] Patent Document 1 discloses a method for controlling a persona chatbot, which is performed by at least one processor, and includes steps of receiving a user utterance, adding the user utterance to a prompt including an instruction sentence related to an explanation of a chatbot character, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] In modern society, many mobile terminal users routinely shoot videos. However, due to the lack of editing technology and specialized knowledge, there is a problem that it is difficult to produce videos that reflect their intentions. In particular, advanced skills are required to create videos that enhance emotions or professional videos that integrate visual and auditory information. It is necessary to solve such problems and provide a mechanism that allows users to easily automatically generate high-quality videos.

Means for Solving the Problems

[0005] This invention features a function that receives video editing instructions using user input and transmits the data to a server via a mobile terminal. The server analyzes the received video data and determines the editing task using natural language processing. Furthermore, it designs an editing plan based on the analysis results and automatically performs video editing. By synchronizing the generated music with the video, it provides emotionally rich visuals. This allows users to easily create videos that meet their needs without requiring specialized technical skills. This system solves the problem by integrating analysis, editing, and music generation, and automating the entire process of transmitting the edited video to the mobile terminal.

[0006] "User input means" refers to a function that allows users to input instructions regarding video editing through an interface.

[0007] "Means of transmitting data via a mobile device" refers to the function of transmitting video data and editing instructions from a mobile device to a cloud server.

[0008] "Means for analyzing received video data" refers to the function of a server that analyzes received video data and extracts features of the video and audio.

[0009] "Means for analyzing natural language to determine editing tasks" refers to a function that analyzes natural language instructions entered by the user and identifies the necessary editing actions.

[0010] "Means for designing editing plans" refers to a function that plans specific editing operations based on analyzed video data and user instructions.

[0011] "Methods for editing videos based on an editing plan" refer to functions that automatically perform editing operations, including cutting and applying effects to footage, according to a designed editing plan.

[0012] "Methods for synchronizing generated music with video" refers to a function that combines background music generated by AI with the video playback timing.

[0013] "Means for sending edited video to mobile devices" refers to the function of sending the completed video from a cloud server to the user's mobile device. [Brief explanation of the drawing]

[0014] [Figure 1] This is a conceptual diagram showing an example of the configuration of a data processing system according to the first embodiment. [Figure 2] This is a conceptual diagram showing an example of the essential functions of a data processing device and a smart device according to the first embodiment. [Figure 3] This is a conceptual diagram showing an example of the configuration of a data processing system according to the second embodiment. [Figure 4] This is a conceptual diagram showing an example of the main functions of a data processing device and smart glasses according to the second embodiment. [Figure 5] This is a conceptual diagram showing an example of the configuration of a data processing system according to the third embodiment. [Figure 6] This is a conceptual diagram showing an example of the main functions of a data processing device and a headset-type terminal according to the third embodiment. [Figure 7] This is a conceptual diagram showing an example of the configuration of a data processing system according to the fourth embodiment. [Figure 8] This is a conceptual diagram showing an example of the main functions of a data processing device and a robot according to the fourth embodiment. [Figure 9] This shows an emotion map where multiple emotions are mapped. [Figure 10] This shows an emotion map where multiple emotions are mapped. [Figure 11] This is a sequence diagram showing the processing flow of the data processing system in Example 1. [Figure 12] This is a sequence diagram showing the processing flow of the data processing system in Application Example 1. [Figure 13] This is a sequence diagram showing the processing flow of the data processing system in Example 2, which incorporates an emotion engine. [Figure 14]It is a sequence diagram showing the processing flow of a data processing system in Application Example 2 when a sentiment engine is combined.

Embodiments for Carrying Out the Invention

[0015] Hereinafter, an example of an embodiment of a system according to the technology of the present disclosure will be described with reference to the accompanying drawings.

[0016] First, the terms used in the following description will be explained.

[0017] In the following embodiments, a numbered processor (hereinafter simply referred to as "processor") may be a single arithmetic unit or a combination of multiple arithmetic units. Also, the processor may be a single type of arithmetic unit or a combination of multiple types of arithmetic units. Examples of arithmetic units include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), and the like.

[0018] In the following embodiments, a numbered RAM (Random Access Memory) is a memory in which information is temporarily stored and is used as a work memory by the processor.

[0019] In the following embodiments, a numbered storage is one or more non-volatile storage devices that store various programs and various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), or magnetic tapes, and the like.

[0020] In the following embodiments, the signed communication interface (I / F) is an interface that includes a communication processor and an antenna, etc. The communication interface manages communication between multiple computers. Examples of communication standards applicable to the communication interface include wireless communication standards such as 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), or Bluetooth (registered trademark).

[0021] In the following embodiments, "A and / or B" is synonymous with "at least one of A and B." That is, "A and / or B" means that it may be A alone, or B alone, or a combination of A and B. Furthermore, in this specification, the same concept as "A and / or B" applies when expressing three or more things linked by "and / or."

[0022] [First Embodiment]

[0023] Figure 1 shows an example of the configuration of the data processing system 10 according to the first embodiment.

[0024] As shown in Figure 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

[0025] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0026] The smart device 14 comprises a computer 36, a reception device 38, an output device 40, a camera 42, and a communication interface 44. The computer 36 comprises a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The reception device 38, output device 40, and camera 42 are also connected to the bus 52.

[0027] The reception device 38 is equipped with a touch panel 38A and a microphone 38B, etc., and receives user input. The touch panel 38A receives user input by detecting contact with an object (e.g., a pen or finger). The microphone 38B receives user input by detecting the user's voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the data indicating the user input.

[0028] The output device 40 includes a display 40A and a speaker 40B, and presents data to the user 20 by outputting the data in a form perceptible to the user 20 (e.g., audio and / or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs audio according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system such as a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

[0029] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various types of information between processor 46 and processor 28 via network 54.

[0030] Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

[0031] As shown in Figure 2, in the data processing device 12, a specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" related to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 according to the specific processing program 56 executed on the RAM 30.

[0032] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0033] In the smart device 14, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0034] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0035] This invention provides a system that allows users to easily edit high-quality videos. Users input video editing instructions using a dedicated smartphone app. The app features an intuitive UI, allowing users to input instructions in natural language and select the necessary video files for editing.

[0036] When a user enters instructions and selects a video file, this information is sent to a cloud server via the internet. The server uses advanced AI and deep learning technologies to analyze the received video data. This analysis identifies emotional elements, important scenes, and dialogue sections within the audio track.

[0037] The server understands the natural language instructions it receives and determines what kind of editing should be done. This instruction analysis uses natural language processing techniques. For example, if the instruction is "edit emotionally," it plans music and slow-motion effects to evoke emotion. If the instruction emphasizes comedic elements, it inserts appropriate captions and visual effects.

[0038] Based on the editing plan designed by the server, the actual video editing is performed automatically. This includes cutting and joining footage, applying transition effects, and adjusting color tones as needed. Furthermore, AI-generated background music is created to harmonize with the video's atmosphere. This music is automatically generated to match the video's theme.

[0039] The edited, completed video file is sent directly from the cloud server to the user's smartphone in a compressed format. Users can then review and easily share the saved video. In this way, users can easily create videos that suit their intentions, even without specialized knowledge.

[0040] The following describes the processing flow.

[0041] Step 1:

[0042] The user launches the smartphone app and selects the video they want to edit. They also input instructions regarding the editing direction and style in natural language.

[0043] Step 2:

[0044] The device sends the video file selected by the user and the entered editing instructions as data to the cloud server. This communication takes place over the internet.

[0045] Step 3:

[0046] The server analyzes the received video file and extracts various features of the video and audio. At this stage, it identifies points of change in the scene and important parts of the audio track.

[0047] Step 4:

[0048] The server analyzes the user's editing instructions using natural language processing technology and formulates an editing plan. It interprets the necessary editing effects and video themes from the instructions.

[0049] Step 5:

[0050] The server edits the video based on the established editing plan. This includes cutting video clips, adjusting sequences, and applying appropriate filters and transitions.

[0051] Step 6:

[0052] The server uses AI to automatically generate music suitable for the video and synchronizes it with the finished footage. In this way, it adds background music that is perfectly suited to the video.

[0053] Step 7:

[0054] The server compresses the edited video into the optimal format and sends it to the user's device. This process utilizes data compression techniques to reduce transfer time.

[0055] Step 8:

[0056] The device saves the received, edited video to local storage. Users can then review it and easily share it on social media or other platforms.

[0057] (Example 1)

[0058] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0059] The present invention aims to provide a system that allows users to easily edit high-quality videos without requiring special technical knowledge. Specifically, it aims to solve the problem of reducing the complexity and time required for video editing by reflecting intuitive editing instructions in natural language into the video, and by automatically emphasizing emotional elements and inserting text information.

[0060] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0061] In this invention, the server includes input means for the user to input instructions, communication means for transmitting data via a terminal device, and analysis means for analyzing the received multimedia data. This allows the user to easily input video editing instructions in natural language, and those instructions to be immediately reflected in the video editing.

[0062] "Input method" refers to a device or interface that allows users to input video editing instructions in natural language.

[0063] "Communication means" refers to the technology and mechanisms for transmitting data from a user's terminal device to a server.

[0064] "Analysis means" refers to the operation of analyzing received multimedia content using a computer process and extracting important elements and metadata from the video.

[0065] "Processing means" refers to the process of analyzing the input natural language instructions and determining how to proceed with video editing.

[0066] "Design method" refers to the process of creating a detailed video editing plan based on the analyzed instructions.

[0067] "Editing methods" refer to operations that execute a designed editing plan, such as cutting videos, applying effects, and inserting music.

[0068] "Synchronization method" refers to the process of appropriately matching and integrating generated music and sound effects with the video.

[0069] "Transmission means" refers to the technology used to send the completed, edited video back to the user's terminal device.

[0070] This invention provides a system that allows users to easily edit high-quality videos using a specific program. Users first utilize a video editing interface via a dedicated smartphone app. This app features an intuitive UI and provides natural language input functionality using speech recognition technology.

[0071] The user selects the video file they want to edit and specifies their desired editing style in natural language. For example, they can enter a simple prompt in the app such as, "Edit this to highlight my fun travel memories." The device then sends this data to a server in the cloud via the internet.

[0072] The server uses analysis algorithms equipped with deep learning technology to analyze the received data. Image processing frameworks are utilized to detect important scenes and emotional elements in the video data, and natural language processing technology is used for instruction analysis. In addition, the audio information of the video is analyzed to construct an appropriate editing plan. Accordingly, a generative AI model is used to generate and synchronize sound materials and effects suitable for the video.

[0073] As a concrete example, the server performs natural language processing using BERT and analyzes emotional elements using TENSORFLOW®. Once the editing plan is decided, editing software such as Adobe Premiere Pro is used to automatically perform editing processes such as cutting, transitions, and color adjustments.

[0074] Once editing is complete, the server compresses the finished video and sends it directly to the user's device. The user can then review the received video and share it on social media as needed. This system allows users to easily create high-quality videos without requiring any technical expertise.

[0075] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0076] Step 1:

[0077] The user launches the smartphone app and selects the video file they want to edit. During this process, the user can input video editing instructions in natural language. The input includes the video file itself and the editing instructions in natural language. The output is a data package containing both the instructions and the video file.

[0078] Step 2:

[0079] The terminal sends the user's entered editing instructions and selected video files to a server in the cloud. In this communication process, the data package is structured in JSON format. The input is the data package generated in step 1, and the output is the completion of the transmission to the remote server.

[0080] Step 3:

[0081] The server analyzes the received video file. It uses deep learning techniques to perform image processing to identify specific scenes and emotional elements within the video. The input is the video file itself, and the output is the analysis results, including metadata for important scenes and emotion labels. Specific operations include frame analysis and edge detection.

[0082] Step 4:

[0083] The server uses natural language processing technology to analyze the user's editing instructions and determine what edits should be made. The input is editing instructions in natural language, and the output is an editing plan. Semantic analysis is performed using models such as BERT to determine an editing policy based on specific requests such as "emotional" or "emphasize comedic elements."

[0084] Step 5:

[0085] Based on the information obtained in steps 3 and 4, the server creates a specific editing plan. The input is metadata for key scenes and the editing plan, and the output is a detailed set of editing instructions. This process includes selecting appropriate transitions, visual effects, and music files.

[0086] Step 6:

[0087] The server uses automated software tools like Adobe Premiere Pro to actually process the video according to the editing plan. This involves specific actions such as cutting footage, applying effects, and color correction. The input is a detailed set of editing instructions, and the output is the edited video file.

[0088] Step 7:

[0089] The server automatically generates key music and sound effects using a generative AI model and synchronizes them with the video. The input is a music theme based on the editing plan, and the output is a music file synchronized with the video. This operation includes music generation and timing adjustment.

[0090] Step 8:

[0091] The edited video is compressed and sent from the server to the user's device. The input is the edited video file, and the output is the state after delivery to the user's device is complete. After the device receives this video, it notifies the user that editing is complete via a notification.

[0092] Step 9:

[0093] Users can view edited videos on their devices and easily share them on social media, etc. This process involves the user playing the video, selecting privacy settings, and then pressing the share button to complete the sharing process.

[0094] (Application Example 1)

[0095] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart device 14 will be referred to as the "terminal."

[0096] Currently, producing advertising videos requires a significant amount of time and specialized knowledge, making it difficult to create high-quality advertising videos quickly and effectively. Therefore, there is a need for a system that allows advertising creators and marketing personnel to easily produce videos that match their respective advertising concepts.

[0097] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0098] In this invention, the server includes means for designing an editing plan based on the advertising concept, means for synchronizing generated music with video, and means for automatically inserting advertising information into the generated video. This makes it possible for users to quickly create high-quality videos suitable for advertising purposes without requiring specialized knowledge.

[0099] "User input means" refers to a device or software that provides an interface for users to input video editing instructions using natural language.

[0100] A "mobile terminal" refers to a computer device such as a smartphone or tablet carried by a user, used for sending and receiving data.

[0101] "Means of transmitting data" refers to a device or software that has the function of transmitting instructions or video data from a user's mobile device to a cloud server via the internet.

[0102] "Means for analyzing received video data" refers to programs that use artificial intelligence and deep learning technologies to analyze video data received on a cloud server and understand its content.

[0103] "A means of analyzing natural language to determine editing tasks" refers to a program that utilizes natural language processing techniques to analyze natural language instructions input by the user and determine what kind of editing should be performed based on those instructions.

[0104] "A means of designing an editing plan based on an advertising concept" refers to a system equipped with algorithms for planning how to edit a video according to the purpose and target audience of the advertisement.

[0105] A "video editing tool" is a program that automatically performs editing tasks such as cutting, splicing, and adding effects to footage based on a pre-designed editing plan.

[0106] "Methods for synchronizing generated music with video" refers to a program that provides technology for incorporating automatically generated music into video in a way that complements the video's theme and emotions, ensuring it is appropriately matched to the video.

[0107] "Means for transmitting edited video to a mobile device" refers to a device or software with network communication capabilities that compresses the completed video on a cloud server and transmits it quickly to the user's mobile device.

[0108] A "means for automatically inserting advertising information" refers to a system that has the function of automatically adding information such as text and graphics to generated videos, according to marketing objectives.

[0109] To realize this application, it is necessary to build a system that primarily involves cloud-based servers and the user's mobile device (smartphone or tablet). The mobile device has an application installed that allows the user to input video editing instructions in natural language. This application provides a means of user input, allowing users to specify advertising concepts and desired editing styles in natural language.

[0110] When a user enters instructions and selects advertising materials, this information is sent to the cloud via the internet. On the cloud server, a program developed using the Python programming language analyzes the received video data. Using software libraries such as TensorFlow and OpenCV, the program analyzes the video data and identifies important scenes. Furthermore, it utilizes the Hugging Face Transformers library to analyze the user's natural language input and determine the appropriate editing task.

[0111] The server designs an editing plan based on the user's advertising concept. Following this plan, editing is performed automatically, including cutting, splicing, and applying visual effects to the video. Additionally, a generative AI model is used to automatically generate appropriate music and synchronize it with the edited video. A deep learning model is employed for this music generation.

[0112] The edited video is compressed on a cloud server and then sent again to the user's mobile device via the internet. The user can view this completed video file at any time and easily share it as needed. For example, by entering a prompt such as "Create a friendly, trendy cosmetics advertisement video for women in their 20s," a video based on a specific editing plan will be generated.

[0113] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0114] Step 1:

[0115] The user's device receives video editing instructions in natural language via the application. The input instructions and selected advertising materials are sent to a cloud server via the internet. At this stage, the input consists of the user's text instructions and video data, and sending this to the server initiates the editing process.

[0116] Step 2:

[0117] The server analyzes the received video data using the OpenCV library to identify important scenes and target objects. This analysis involves image processing and feature extraction for each frame to extract the target scenes. As a result, scene information suitable for advertising is obtained.

[0118] Step 3:

[0119] The server uses the Hugging Face Transformers library to analyze the user's natural language instructions. Through this analysis, it understands the user's request from the input prompt and determines what editing task should be performed. The analysis results output a specific editing plan.

[0120] Step 4:

[0121] The server designs an editing plan based on the advertising concept. Here, it combines scene information obtained in the previous stage with user requests to plan specific editing techniques. This plan includes the sequence of scenes, transitions, visual effects, and so on.

[0122] Step 5:

[0123] The server automatically generates appropriate music using a generative AI model and synchronizes it with the edited video. It utilizes deep learning technology to generate music that matches the video's theme. The generated music is positioned to align with the flow of the video.

[0124] Step 6:

[0125] The server automatically cuts, splices, and applies effects to the video based on the editing plan. Specific editing tasks are performed, and color tones and text information are inserted as needed. As a result of this process, the final advertisement video is output.

[0126] Step 7:

[0127] The completed video is compressed on a cloud server and sent to the user's mobile device. The compressed video is provided in a format suitable for the user to download and review. As a result, users can easily preview the advertising video and further share or publish it.

[0128] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0129] This invention provides a system that enables video editing that reflects the user's emotions. Using a smartphone app, users can explicitly input their emotional state when entering videos and editing instructions. Furthermore, the app utilizes the camera and microphone to automatically recognize the user's facial expressions and voice tone through an emotion engine.

[0130] The device sends the user's selected video along with instructions and emotional information to the cloud server. The server analyzes the received video and, based on the emotional information from the emotion engine, designs an editing task that matches the user's intentions. This design includes cutting the video, applying slow motion and effects, and adjusting color filters.

[0131] The server uses natural language processing to understand the user's editing instructions and emotional information, and determines how the video should be edited. The emotion engine provides music and effects that match the detected emotions. For example, if the user wants an "emotional" and "calming" video, slow-motion effects and melodic background music will be selected.

[0132] The edited video automatically includes text information that matches the user's emotions. This ensures that the entire video expresses the user's feelings. The server compresses the finished video and sends it to the user's device. The user can then review the video and share it in their preferred way.

[0133] Thus, with this invention incorporating an emotion engine, users can easily create and share videos that reflect their emotions in real time, even without technical skills.

[0134] The following describes the processing flow.

[0135] Step 1:

[0136] The user launches the smartphone app and selects the video they want to edit. Furthermore, they can use the app's features to input their own emotions or have emotional information automatically collected using the camera and microphone.

[0137] Step 2:

[0138] The device sends the video file, user editing instructions, and recognized emotion data as a single package to the cloud server. The transmission is done via the internet and is encrypted to ensure security.

[0139] Step 3:

[0140] The server analyzes the received video data and emotional information. It identifies scene transitions in the video and unique characteristics in the audio, and analyzes the emotional state obtained by the emotion engine.

[0141] Step 4:

[0142] The server uses natural language processing technology to generate the optimal editing task based on the user's editing instructions and emotional information. The AI ​​interprets emotional instructions such as "emotional" or "fun" and constructs an editing plan.

[0143] Step 5:

[0144] The server then develops an editing plan based on the emotion engine. It automatically selects necessary effects and filters, inserts cutscenes, and chooses and applies music. The editing plan includes restructuring the timeline and applying effects.

[0145] Step 6:

[0146] The server automatically plays AI-generated music according to the video content. It adjusts the tempo and style of the music to match the emotional information, harmonizing it with the overall video.

[0147] Step 7:

[0148] The server automatically inserts emotionally relevant text into the edited video and compresses it into the specified format.

[0149] Step 8:

[0150] The edited video is sent from the server to the user's device. The device saves it to local storage, and the user can view the video through the app and easily share it on platforms like Facebook and Instagram as they like.

[0151] (Example 2)

[0152] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart device 14 as the "terminal".

[0153] Traditional video editing tools often struggle to effectively reflect the emotions intended by the user in the video, and frequently require specialized editing knowledge. Furthermore, the limited means of reflecting user emotions in real time made creating content that met user expectations challenging.

[0154] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0155] In this invention, the server includes a user input means, a means for transmitting video and emotional information through a terminal, and a means for analyzing the received video. This enables automatic video editing that reflects the user's emotions, making it possible for even users without specialized knowledge to create videos that effectively express emotions.

[0156] "User input means" refers to technology that provides an interface for users to select videos they want to edit, give editing instructions, and input emotional information.

[0157] A "terminal" is a portable electronic device that has the function of receiving user input and transmitting and receiving data.

[0158] "Motion images" are visual data that provides moving images by playing a series of still images at a constant speed.

[0159] "Emotional information" refers to data that indicates the user's psychological state and is analyzed by the emotion engine.

[0160] A "server" is a remote computer system used for analyzing, editing, and processing data.

[0161] "Natural language" refers to the language that humans use on a daily basis and is in a form that can be analyzed by programs.

[0162] An "editing plan" is a detailed design of how to edit video footage, generated based on user instructions and emotional information.

[0163] "Emotional data" refers to data that expresses a user's emotional state using numerical values ​​or categories.

[0164] A "generative AI model" is an artificial intelligence model trained to generate text or media based on specified prompts.

[0165] This invention is a system for reflecting user-intended emotions in moving images, and it achieves this through collaboration between the user, the terminal, and the server.

[0166] Users select videos and input editing instructions and emotional information using an application on a device such as a smartphone. Users can manually select their own emotions, and the device automatically captures the user's facial expressions and voice tone using its camera and microphone. This data is analyzed by the device's emotion engine. For example, if the camera detects the user's smile, the emotion "joy" is recorded.

[0167] The device sends collected video footage, editing instructions, and emotional information to a server in the cloud. A key aspect of this system is that communication is conducted using a secure protocol. Based on the received data, the server analyzes the video footage using a generative AI model and understands the user's intent through natural language processing. It automatically generates an editing plan and performs appropriate editing to produce video footage that matches the user's emotions.

[0168] The server applies music and visual effects based on emotional data to the video during editing. For example, if the user enters "calm mood," the server will select slow motion and gentle music. Text matching the user's entered emotion is automatically inserted into the edited video. The completed edited video is compressed and sent to the user's device.

[0169] In this system, a generative AI model plays a central role in generating natural and emotionally rich content. A concrete example of a prompt message is, "Add a cheerful atmosphere to a birthday party video. Emotion: Happiness, blessings." This invention makes it possible for users to easily create and share high-quality videos that express their emotions, even without advanced technical skills.

[0170] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0171] Step 1:

[0172] The user launches the smartphone app and selects the video or image they want to edit. Next, the user either manually enters their emotions or allows the device to use its camera and microphone to automatically capture emotion information. This saves the user's input as video and emotion data on the device.

[0173] Step 2:

[0174] The device uses an emotion engine to analyze the acquired emotional information. Facial expressions captured by the camera and voice tones captured by the microphone are quantified as emotional data. This analysis identifies emotions such as "joy" and "sadness."

[0175] Step 3:

[0176] The device sends the video and analyzed emotion information selected by the user to a cloud server. The input data is encrypted and transmitted securely. As output, data is transferred to a server waiting to receive.

[0177] Step 4:

[0178] The server uses natural language processing to analyze the received video and emotional information. This analysis interprets the user's editing instructions and generates an editing plan on how to edit the video. For example, if the instruction is "calm mood," the application of slow motion and gentle music will be considered.

[0179] Step 5:

[0180] The server uses a generative AI model to apply appropriate visual effects and music to the video. Based on the input emotion data, the video effects and soundtrack are automatically selected. For example, if the emotion "joy" is input, bright music and a vibrant color filter will be applied.

[0181] Step 6:

[0182] The server automatically inserts text information that matches the user's emotions into the edited video. For example, text such as "I love this moment" might be selected. The output is a completed video that reflects the emotions.

[0183] Step 7:

[0184] The server compresses the edited video and efficiently transfers it to the user's device. This allows the user to instantly play the edited video on their device. As output, high-quality video is provided and saved on the user's device.

[0185] Step 8:

[0186] Users view the edited videos they receive and evaluate and revise them as needed. If users are satisfied with the result, they can share the video through social media or messaging apps. Ultimately, they can share emotionally expressive videos with others.

[0187] (Application Example 2)

[0188] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as a "server" and the smart device 14 as a "terminal".

[0189] There is a growing demand for automated video content that more richly expresses people's emotions during family activities and events. However, current systems make it difficult to edit content based on emotions, and this is especially challenging for users without technical knowledge.

[0190] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0191] In this invention, the server includes means for user input, means for transmitting data via a mobile terminal, means for analyzing received video data, means for analyzing natural language to determine an editing task, means for designing an editing plan, means for editing the video based on the editing plan, means for synchronizing generated music with the video, means for transmitting the edited video to the mobile terminal, means for identifying emotional states, means for adding relevant information to the automatically corrected video, and means for recording activities at home and performing editing that is appropriate to the situation. This enables users to perform emotion-based video editing without requiring technical skills.

[0192] A "user input method" is an interface that allows users to provide instructions or information to a system.

[0193] "Means of transmitting data via mobile terminals" refers to the function of sending data to a server using mobile devices such as smartphones and tablets.

[0194] "Means for analyzing received video data" refers to the process of analyzing the transmitted video data using a computer program and understanding its content.

[0195] "A means of analyzing natural language to determine editing tasks" refers to a process that analyzes instructions provided by users in natural language and converts them into specific editing tasks.

[0196] "Means for designing editing plans" refers to the function of planning the editing policies and details of a video based on the analyzed data.

[0197] "Methods for editing a video based on an editing plan" refers to the process of cutting and applying effects to a video according to a planned policy.

[0198] "A means of synchronizing generated music with video" refers to a function that combines music with edited video at the appropriate timing.

[0199] "Means of transmitting edited video to a mobile device" refers to the process of transferring the edited video back to the user's mobile device.

[0200] "Means for identifying emotional states" refers to analytical functions using machine learning or algorithms that identify emotions from the user's speech and facial expressions.

[0201] "Methods for adding relevant information to automatically corrected footage" refers to the process of adding emotionally and contextually relevant information or text to edited footage.

[0202] "A means of recording activities at home and editing them to suit the situation" refers to a function that photographs the daily life and events of a family and edits them into the most suitable format according to the situation.

[0203] To implement this invention, first, a consumer robot equipped with a user-friendly interface for home use is prepared. This robot is equipped with a camera and microphone, which can capture the user's facial expressions and voice in real time. Without the user explicitly inputting their emotional state, the robot uses an emotion engine that automatically analyzes the collected facial data and voice tone. This engine incorporates a machine learning model that can identify the user's emotions with high accuracy.

[0204] The received video data is analyzed by a processing unit within the robot or by a server in the cloud. During this process, a natural language processing engine is utilized to appropriately understand editing requests for the video. For example, if an emotionally moving scene needs to be emphasized, editing tasks such as inserting slow-motion effects or melodic music are automatically designed. This results in video content that reflects the user's emotions.

[0205] Once editing is complete, an AI model automatically inserts emotionally relevant text into the video. For example, text such as "This moment is priceless" might be added to emotionally moving scenes. The server compresses the final edited video and sends it to the user's device. Users can then review the finished video and share it with family and friends in an emotionally resonant way.

[0206] For example, a video might be automatically generated showing a family dinner scene with soft, calming music playing in the background, highlighting moments of smiles in slow motion. In this case, an example of a prompt message might be, "Please suggest an editing style that emphasizes the happy atmosphere of dinnertime."

[0207] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0208] Step 1:

[0209] The user simply inputs their desired editing style based on videos captured by the home robot. Even without explicit input from the user, the robot uses its camera and microphone to capture the user's facial expressions and voice tone, sending this as emotion data to its internal processor. The input data consists of the user's facial images and voice, which serve as the raw materials for emotion recognition.

[0210] Step 2:

[0211] The device's emotion engine analyzes collected facial expression data and voice tone to identify the user's emotional state. Here, machine learning algorithms are used to process the data and extract emotions such as "joy" and "surprise" as the output of emotion recognition.

[0212] Step 3:

[0213] The server designs an appropriate editing task based on the user's emotional state and the video content. In this process, the server uses a natural language processing engine to analyze the user's editing instructions and determine how the video should be edited. The input is the emotional state and the user's instruction text, and the output is a specific editing plan.

[0214] Step 4:

[0215] The server edits the video based on the designed editing plan. It performs edits such as cutting, applying slow-motion effects and filters, and uses a generative AI model to select music and effects that match the emotions. The input to this process is the editing plan, and the output is the edited video.

[0216] Step 5:

[0217] The server uses an emotion-based generative AI model to automatically insert appropriate text information into the video. Here, by embedding text that expresses emotions into the edited video, a visually rich and emotionally resonant output is obtained.

[0218] Step 6:

[0219] The server compresses the edited video data and sends it back to the original mobile device. The final output is a video file processed into a format that users can easily play at home. This allows users to review and share videos that express the intended emotions.

[0220] The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

[0221] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). An example of data generation model 58 is ChatGPT (registered trademark) (Internet search).<URL: https: / / openai.com / blog / chatgpt> ), Gemini (registered trademark) (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include those described above. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions shown by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0222] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart device 14.

[0223] [Second Embodiment]

[0224] Figure 3 shows an example of the configuration of the data processing system 210 according to the second embodiment.

[0225] As shown in Figure 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

[0226] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0227] The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication interface 44. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, and camera 42 are also connected to the bus 52.

[0228] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0229] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0230] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0231] Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0232] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0233] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0234] In the smart glasses 214, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0235] Next, the identification processing performed by the identification processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0236] This invention provides a system that allows users to easily edit high-quality videos. Users input video editing instructions using a dedicated smartphone app. The app features an intuitive UI, allowing users to input instructions in natural language and select the necessary video files for editing.

[0237] When a user enters instructions and selects a video file, this information is sent to a cloud server via the internet. The server uses advanced AI and deep learning technologies to analyze the received video data. This analysis identifies emotional elements, important scenes, and dialogue sections within the audio track.

[0238] The server understands the natural language instructions it receives and determines what kind of editing should be done. This instruction analysis uses natural language processing techniques. For example, if the instruction is "edit emotionally," it plans music and slow-motion effects to evoke emotion. If the instruction emphasizes comedic elements, it inserts appropriate captions and visual effects.

[0239] Based on the editing plan designed by the server, the actual video editing is performed automatically. This includes cutting and joining footage, applying transition effects, and adjusting color tones as needed. Furthermore, AI-generated background music is created to harmonize with the video's atmosphere. This music is automatically generated to match the video's theme.

[0240] The edited, completed video file is sent directly from the cloud server to the user's smartphone in a compressed format. Users can then review and easily share the saved video. In this way, users can easily create videos that suit their intentions, even without specialized knowledge.

[0241] The following describes the processing flow.

[0242] Step 1:

[0243] The user launches the smartphone app and selects the video they want to edit. They also input instructions regarding the editing direction and style in natural language.

[0244] Step 2:

[0245] The device sends the video file selected by the user and the entered editing instructions as data to the cloud server. This communication takes place over the internet.

[0246] Step 3:

[0247] The server analyzes the received video file and extracts various features of the video and audio. At this stage, it identifies points of change in the scene and important parts of the audio track.

[0248] Step 4:

[0249] The server analyzes the user's editing instructions using natural language processing technology and formulates an editing plan. It interprets the necessary editing effects and video themes from the instructions.

[0250] Step 5:

[0251] The server edits the video based on the established editing plan. This includes cutting video clips, adjusting sequences, and applying appropriate filters and transitions.

[0252] Step 6:

[0253] The server uses AI to automatically generate music suitable for the video and synchronizes it with the finished footage. In this way, it adds background music that is perfectly suited to the video.

[0254] Step 7:

[0255] The server compresses the edited video into the optimal format and sends it to the user's device. This process utilizes data compression techniques to reduce transfer time.

[0256] Step 8:

[0257] The device saves the received, edited video to local storage. Users can then review it and easily share it on social media or other platforms.

[0258] (Example 1)

[0259] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0260] The present invention aims to provide a system that allows users to easily edit high-quality videos without requiring special technical knowledge. Specifically, it aims to solve the problem of reducing the complexity and time required for video editing by reflecting intuitive editing instructions in natural language into the video, and by automatically emphasizing emotional elements and inserting text information.

[0261] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0262] In this invention, the server includes input means for the user to input instructions, communication means for transmitting data via a terminal device, and analysis means for analyzing the received multimedia data. This allows the user to easily input video editing instructions in natural language, and those instructions to be immediately reflected in the video editing.

[0263] "Input method" refers to a device or interface that allows users to input video editing instructions in natural language.

[0264] "Communication means" refers to the technology and mechanisms for transmitting data from a user's terminal device to a server.

[0265] "Analysis means" refers to the operation of analyzing received multimedia content using a computer process and extracting important elements and metadata from the video.

[0266] "Processing means" refers to the process of analyzing the input natural language instructions and determining how to proceed with video editing.

[0267] "Design method" refers to the process of creating a detailed video editing plan based on the analyzed instructions.

[0268] "Editing methods" refer to operations that execute a designed editing plan, such as cutting videos, applying effects, and inserting music.

[0269] "Synchronization method" refers to the process of appropriately matching and integrating generated music and sound effects with the video.

[0270] "Transmission means" refers to the technology used to send the completed, edited video back to the user's terminal device.

[0271] This invention provides a system that allows users to easily edit high-quality videos using a specific program. Users first utilize a video editing interface via a dedicated smartphone app. This app features an intuitive UI and provides natural language input functionality using speech recognition technology.

[0272] The user selects the video file they want to edit and specifies their desired editing style in natural language. For example, they can enter a simple prompt in the app such as, "Edit this to highlight my fun travel memories." The device then sends this data to a server in the cloud via the internet.

[0273] The server uses analysis algorithms equipped with deep learning technology to analyze the received data. Image processing frameworks are utilized to detect important scenes and emotional elements in the video data, and natural language processing technology is used for instruction analysis. In addition, the audio information of the video is analyzed to construct an appropriate editing plan. Accordingly, a generative AI model is used to generate and synchronize sound materials and effects suitable for the video.

[0274] As a concrete example, the server performs natural language processing using BERT and analyzes emotional elements using TensorFlow. Once the editing plan is decided, editing software such as Adobe Premiere Pro is used to automatically perform editing processes such as cutting, transitions, and color adjustments.

[0275] Once editing is complete, the server compresses the finished video and sends it directly to the user's device. The user can then review the received video and share it on social media as needed. This system allows users to easily create high-quality videos without requiring any technical expertise.

[0276] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0277] Step 1:

[0278] The user launches the smartphone app and selects the video file they want to edit. During this process, the user can input video editing instructions in natural language. The input includes the video file itself and the editing instructions in natural language. The output is a data package containing both the instructions and the video file.

[0279] Step 2:

[0280] The terminal sends the user's entered editing instructions and selected video files to a server in the cloud. In this communication process, the data package is structured in JSON format. The input is the data package generated in step 1, and the output is the completion of the transmission to the remote server.

[0281] Step 3:

[0282] The server analyzes the received video file. It uses deep learning techniques to perform image processing to identify specific scenes and emotional elements within the video. The input is the video file itself, and the output is the analysis results, including metadata for important scenes and emotion labels. Specific operations include frame analysis and edge detection.

[0283] Step 4:

[0284] The server analyzes the user's editing instructions using natural language processing technology and determines what kind of editing should be performed. The input is an editing instruction in natural language, and the output is an editing plan. Semantic analysis is performed using a model such as BERT to determine an editing policy based on specific requests such as "emotional" or "emphasis on comedic elements".

[0285] Step 5:

[0286] Based on the information obtained in Steps 3 and 4, the server creates a specific editing plan. The input is the metadata of important scenes and the editing plan, and the output is a detailed set of editing instructions. This process includes the selection of appropriate transitions, visual effects, and music files.

[0287] Step 6:

[0288] The server uses a software automation tool such as Adobe Premiere Pro to actually process the video according to the editing plan. Here, specific operations such as video cuts, application of effects, and color correction are performed. The input is a detailed set of editing instructions, and the output is an edited video file.

[0289] Step 7:

[0290] The server uses a generative AI model to automatically generate key music and sound effects and synchronize them with the video. The input is a music theme based on the editing plan, and the output is a music file synchronized with the video. This operation includes music generation and timing adjustment.

[0291] Step 8:

[0292] The edited video is compressed and sent from the server to the user's terminal. The input is the edited video file, and the output is the state where the delivery to the user terminal is completed. After receiving this video, the terminal notifies the user of the completion of the editing through a notification.

[0293] Step 9:

[0294] Users can view edited videos on their devices and easily share them on social media, etc. This process involves the user playing the video, selecting privacy settings, and then pressing the share button to complete the sharing process.

[0295] (Application Example 1)

[0296] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0297] Currently, producing advertising videos requires a significant amount of time and specialized knowledge, making it difficult to create high-quality advertising videos quickly and effectively. Therefore, there is a need for a system that allows advertising creators and marketing personnel to easily produce videos that match their respective advertising concepts.

[0298] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0299] In this invention, the server includes means for designing an editing plan based on the advertising concept, means for synchronizing generated music with video, and means for automatically inserting advertising information into the generated video. This makes it possible for users to quickly create high-quality videos suitable for advertising purposes without requiring specialized knowledge.

[0300] "User input means" refers to a device or software that provides an interface for users to input video editing instructions using natural language.

[0301] A "mobile terminal" refers to a computer device such as a smartphone or tablet carried by a user, used for sending and receiving data.

[0302] The "means for transmitting data" refers to a device or software that has the function of transmitting instructions and video data from a user's mobile terminal to a cloud server via the Internet.

[0303] The "means for analyzing the received video data" refers to a program that uses artificial intelligence or deep learning technology to analyze the video data received on the cloud server and understand its content.

[0304] The "means for analyzing natural language to determine an editing task" refers to a program that utilizes natural language processing technology to analyze the natural language instructions input by the user and determine what kind of editing should be performed based on them.

[0305] The "means for designing an editing plan based on the advertisement concept" refers to a system equipped with an algorithm for planning the video editing method according to the purpose and target of the advertisement.

[0306] The "means for editing a video" refers to a program that has the function of automatically performing video editing such as cutting, splicing, and adding effects based on the designed editing plan.

[0307] The "means for synchronizing the generated music with the video" refers to a program that provides a technology for appropriately incorporating the automatically generated music that complements the theme and emotion of the video into the video.

[0308] The "means for transmitting the edited video to the mobile terminal" refers to a device or software that has a network communication function for compressing the completed video on the cloud server and quickly transmitting it to the user's mobile terminal.

[0309] The "means for automatically inserting advertisement information" refers to a system that has the function of automatically adding information such as text and graphics according to the marketing purpose to the generated video.

[0310] To realize this application, it is necessary to build a system that primarily involves cloud-based servers and the user's mobile device (smartphone or tablet). The mobile device has an application installed that allows the user to input video editing instructions in natural language. This application provides a means of user input, allowing users to specify advertising concepts and desired editing styles in natural language.

[0311] When a user enters instructions and selects advertising materials, this information is sent to the cloud via the internet. On the cloud server, a program developed using the Python programming language analyzes the received video data. Using software libraries such as TensorFlow and OpenCV, the program analyzes the video data and identifies important scenes. Furthermore, it utilizes the Hugging Face Transformers library to analyze the user's natural language input and determine the appropriate editing task.

[0312] The server designs an editing plan based on the user's advertising concept. Following this plan, editing is performed automatically, including cutting, splicing, and applying visual effects to the video. Additionally, a generative AI model is used to automatically generate appropriate music and synchronize it with the edited video. A deep learning model is employed for this music generation.

[0313] The edited video is compressed on a cloud server and then sent again to the user's mobile device via the internet. The user can view this completed video file at any time and easily share it as needed. For example, by entering a prompt such as "Create a friendly, trendy cosmetics advertisement video for women in their 20s," a video based on a specific editing plan will be generated.

[0314] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0315] Step 1:

[0316] The user's device receives video editing instructions in natural language via the application. The input instructions and selected advertising materials are sent to a cloud server via the internet. At this stage, the input consists of the user's text instructions and video data, and sending this to the server initiates the editing process.

[0317] Step 2:

[0318] The server analyzes the received video data using the OpenCV library to identify important scenes and target objects. This analysis involves image processing and feature extraction for each frame to extract the target scenes. As a result, scene information suitable for advertising is obtained.

[0319] Step 3:

[0320] The server uses the Hugging Face Transformers library to analyze the user's natural language instructions. Through this analysis, it understands the user's request from the input prompt and determines what editing task should be performed. The analysis results output a specific editing plan.

[0321] Step 4:

[0322] The server designs an editing plan based on the advertising concept. Here, it combines scene information obtained in the previous stage with user requests to plan specific editing techniques. This plan includes the sequence of scenes, transitions, visual effects, and so on.

[0323] Step 5:

[0324] The server automatically generates appropriate music using a generative AI model and synchronizes it with the edited video. It utilizes deep learning technology to generate music that matches the video's theme. The generated music is positioned to align with the flow of the video.

[0325] Step 6:

[0326] The server automatically cuts, splices, and applies effects to the video based on the editing plan. Specific editing tasks are performed, and color tones and text information are inserted as needed. As a result of this process, the final advertisement video is output.

[0327] Step 7:

[0328] The completed video is compressed on a cloud server and sent to the user's mobile device. The compressed video is provided in a format suitable for the user to download and review. As a result, users can easily preview the advertising video and further share or publish it.

[0329] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0330] This invention provides a system that enables video editing that reflects the user's emotions. Using a smartphone app, users can explicitly input their emotional state when entering videos and editing instructions. Furthermore, the app utilizes the camera and microphone to automatically recognize the user's facial expressions and voice tone through an emotion engine.

[0331] The device sends the user's selected video along with instructions and emotional information to the cloud server. The server analyzes the received video and, based on the emotional information from the emotion engine, designs an editing task that matches the user's intentions. This design includes cutting the video, applying slow motion and effects, and adjusting color filters.

[0332] The server uses natural language processing to understand the user's editing instructions and emotional information, and determines how the video should be edited. The emotion engine provides music and effects that match the detected emotions. For example, if the user wants an "emotional" and "calming" video, slow-motion effects and melodic background music will be selected.

[0333] The edited video automatically includes text information that matches the user's emotions. This ensures that the entire video expresses the user's feelings. The server compresses the finished video and sends it to the user's device. The user can then review the video and share it in their preferred way.

[0334] Thus, with this invention incorporating an emotion engine, users can easily create and share videos that reflect their emotions in real time, even without technical skills.

[0335] The following describes the processing flow.

[0336] Step 1:

[0337] The user launches the smartphone app and selects the video they want to edit. Furthermore, they can use the app's features to input their own emotions or have emotional information automatically collected using the camera and microphone.

[0338] Step 2:

[0339] The device sends the video file, user editing instructions, and recognized emotion data as a single package to the cloud server. The transmission is done via the internet and is encrypted to ensure security.

[0340] Step 3:

[0341] The server analyzes the received video data and emotional information. It identifies scene transitions in the video and unique characteristics in the audio, and analyzes the emotional state obtained by the emotion engine.

[0342] Step 4:

[0343] The server uses natural language processing technology to generate the optimal editing task based on the user's editing instructions and emotional information. The AI ​​interprets emotional instructions such as "emotional" or "fun" and constructs an editing plan.

[0344] Step 5:

[0345] The server then develops an editing plan based on the emotion engine. It automatically selects necessary effects and filters, inserts cutscenes, and chooses and applies music. The editing plan includes restructuring the timeline and applying effects.

[0346] Step 6:

[0347] The server automatically plays AI-generated music according to the video content. It adjusts the tempo and style of the music to match the emotional information, harmonizing it with the overall video.

[0348] Step 7:

[0349] The server automatically inserts emotionally relevant text into the edited video and compresses it into the specified format.

[0350] Step 8:

[0351] The edited video is sent from the server to the user's device. The device saves it to local storage, and the user can view the video through the app and easily share it on platforms like Facebook and Instagram as they like.

[0352] (Example 2)

[0353] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the smart glasses 214 will be referred to as the "terminal".

[0354] Traditional video editing tools often struggle to effectively reflect the emotions intended by the user in the video, and frequently require specialized editing knowledge. Furthermore, the limited means of reflecting user emotions in real time made creating content that met user expectations challenging.

[0355] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0356] In this invention, the server includes a user input means, a means for transmitting video and emotional information through a terminal, and a means for analyzing the received video. This enables automatic video editing that reflects the user's emotions, making it possible for even users without specialized knowledge to create videos that effectively express emotions.

[0357] "User input means" refers to technology that provides an interface for users to select videos they want to edit, give editing instructions, and input emotional information.

[0358] A "terminal" is a portable electronic device that has the function of receiving user input and transmitting and receiving data.

[0359] "Motion images" are visual data that provides moving images by playing a series of still images at a constant speed.

[0360] "Emotional information" refers to data that indicates the user's psychological state and is analyzed by the emotion engine.

[0361] A "server" is a remote computer system used for analyzing, editing, and processing data.

[0362] "Natural language" refers to the language that humans use on a daily basis and is in a form that can be analyzed by programs.

[0363] An "editing plan" is a detailed design of how to edit video footage, generated based on user instructions and emotional information.

[0364] "Emotional data" refers to data that expresses a user's emotional state using numerical values ​​or categories.

[0365] A "generative AI model" is an artificial intelligence model trained to generate text or media based on specified prompts.

[0366] This invention is a system for reflecting user-intended emotions in moving images, and it achieves this through collaboration between the user, the terminal, and the server.

[0367] Users select videos and input editing instructions and emotional information using an application on a device such as a smartphone. Users can manually select their own emotions, and the device automatically captures the user's facial expressions and voice tone using its camera and microphone. This data is analyzed by the device's emotion engine. For example, if the camera detects the user's smile, the emotion "joy" is recorded.

[0368] The device sends collected video footage, editing instructions, and emotional information to a server in the cloud. A key aspect of this system is that communication is conducted using a secure protocol. Based on the received data, the server analyzes the video footage using a generative AI model and understands the user's intent through natural language processing. It automatically generates an editing plan and performs appropriate editing to produce video footage that matches the user's emotions.

[0369] The server applies music and visual effects based on emotional data to the video during editing. For example, if the user enters "calm mood," the server will select slow motion and gentle music. Text matching the user's entered emotion is automatically inserted into the edited video. The completed edited video is compressed and sent to the user's device.

[0370] In this system, a generative AI model plays a central role in generating natural and emotionally rich content. A concrete example of a prompt message is, "Add a cheerful atmosphere to a birthday party video. Emotion: Happiness, blessings." This invention makes it possible for users to easily create and share high-quality videos that express their emotions, even without advanced technical skills.

[0371] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0372] Step 1:

[0373] The user launches the smartphone app and selects the video or image they want to edit. Next, the user either manually enters their emotions or allows the device to use its camera and microphone to automatically capture emotion information. This saves the user's input as video and emotion data on the device.

[0374] Step 2:

[0375] The device uses an emotion engine to analyze the acquired emotional information. Facial expressions captured by the camera and voice tones captured by the microphone are quantified as emotional data. This analysis identifies emotions such as "joy" and "sadness."

[0376] Step 3:

[0377] The device sends the video and analyzed emotion information selected by the user to a cloud server. The input data is encrypted and transmitted securely. As output, data is transferred to a server waiting to receive.

[0378] Step 4:

[0379] The server uses natural language processing to analyze the received video and emotional information. This analysis interprets the user's editing instructions and generates an editing plan on how to edit the video. For example, if the instruction is "calm mood," the application of slow motion and gentle music will be considered.

[0380] Step 5:

[0381] The server uses a generative AI model to apply appropriate visual effects and music to the video. Based on the input emotion data, the video effects and soundtrack are automatically selected. For example, if the emotion "joy" is input, bright music and a vibrant color filter will be applied.

[0382] Step 6:

[0383] The server automatically inserts text information that matches the user's emotions into the edited video. For example, text such as "I love this moment" might be selected. The output is a completed video that reflects the emotions.

[0384] Step 7:

[0385] The server compresses the edited video and efficiently transfers it to the user's device. This allows the user to instantly play the edited video on their device. As output, high-quality video is provided and saved on the user's device.

[0386] Step 8:

[0387] Users view the edited videos they receive and evaluate and revise them as needed. If users are satisfied with the result, they can share the video through social media or messaging apps. Ultimately, they can share emotionally expressive videos with others.

[0388] (Application Example 2)

[0389] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the smart glasses 214 will be referred to as the "terminal."

[0390] There is a growing demand for automated video content that more richly expresses people's emotions during family activities and events. However, current systems make it difficult to edit content based on emotions, and this is especially challenging for users without technical knowledge.

[0391] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0392] In this invention, the server includes means for user input, means for transmitting data via a mobile terminal, means for analyzing received video data, means for analyzing natural language to determine an editing task, means for designing an editing plan, means for editing the video based on the editing plan, means for synchronizing generated music with the video, means for transmitting the edited video to the mobile terminal, means for identifying emotional states, means for adding relevant information to the automatically corrected video, and means for recording activities at home and performing editing that is appropriate to the situation. This enables users to perform emotion-based video editing without requiring technical skills.

[0393] A "user input method" is an interface that allows users to provide instructions or information to a system.

[0394] "Means of transmitting data via mobile terminals" refers to the function of sending data to a server using mobile devices such as smartphones and tablets.

[0395] "Means for analyzing received video data" refers to the process of analyzing the transmitted video data using a computer program and understanding its content.

[0396] "A means of analyzing natural language to determine editing tasks" refers to a process that analyzes instructions provided by users in natural language and converts them into specific editing tasks.

[0397] "Means for designing editing plans" refers to the function of planning the editing policies and details of a video based on the analyzed data.

[0398] "Methods for editing a video based on an editing plan" refers to the process of cutting and applying effects to a video according to a planned policy.

[0399] "A means of synchronizing generated music with video" refers to a function that combines music with edited video at the appropriate timing.

[0400] "Means of transmitting edited video to a mobile device" refers to the process of transferring the edited video back to the user's mobile device.

[0401] "Means for identifying emotional states" refers to analytical functions using machine learning or algorithms that identify emotions from the user's speech and facial expressions.

[0402] "Methods for adding relevant information to automatically corrected footage" refers to the process of adding emotionally and contextually relevant information or text to edited footage.

[0403] "A means of recording activities at home and editing them to suit the situation" refers to a function that photographs the daily life and events of a family and edits them into the most suitable format according to the situation.

[0404] To implement this invention, first, a consumer robot equipped with a user-friendly interface for home use is prepared. This robot is equipped with a camera and microphone, which can capture the user's facial expressions and voice in real time. Without the user explicitly inputting their emotional state, the robot uses an emotion engine that automatically analyzes the collected facial data and voice tone. This engine incorporates a machine learning model that can identify the user's emotions with high accuracy.

[0405] The received video data is analyzed by a processing unit within the robot or by a server in the cloud. During this process, a natural language processing engine is utilized to appropriately understand editing requests for the video. For example, if an emotionally moving scene needs to be emphasized, editing tasks such as inserting slow-motion effects or melodic music are automatically designed. This results in video content that reflects the user's emotions.

[0406] Once editing is complete, an AI model automatically inserts emotionally relevant text into the video. For example, text such as "This moment is priceless" might be added to emotionally moving scenes. The server compresses the final edited video and sends it to the user's device. Users can then review the finished video and share it with family and friends in an emotionally resonant way.

[0407] For example, a video might be automatically generated showing a family dinner scene with soft, calming music playing in the background, highlighting moments of smiles in slow motion. In this case, an example of a prompt message might be, "Please suggest an editing style that emphasizes the happy atmosphere of dinnertime."

[0408] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0409] Step 1:

[0410] The user simply inputs their desired editing style based on videos captured by the home robot. Even without explicit input from the user, the robot uses its camera and microphone to capture the user's facial expressions and voice tone, sending this as emotion data to its internal processor. The input data consists of the user's facial images and voice, which serve as the raw materials for emotion recognition.

[0411] Step 2:

[0412] The device's emotion engine analyzes collected facial expression data and voice tone to identify the user's emotional state. Here, machine learning algorithms are used to process the data and extract emotions such as "joy" and "surprise" as the output of emotion recognition.

[0413] Step 3:

[0414] The server designs an appropriate editing task based on the user's emotional state and the video content. In this process, the server uses a natural language processing engine to analyze the user's editing instructions and determine how the video should be edited. The input is the emotional state and the user's instruction text, and the output is a specific editing plan.

[0415] Step 4:

[0416] The server edits the video based on the designed editing plan. It performs edits such as cutting, applying slow-motion effects and filters, and uses a generative AI model to select music and effects that match the emotions. The input to this process is the editing plan, and the output is the edited video.

[0417] Step 5:

[0418] The server uses an emotion-based generative AI model to automatically insert appropriate text information into the video. Here, by embedding text that expresses emotions into the edited video, a visually rich and emotionally resonant output is obtained.

[0419] Step 6:

[0420] The server compresses the edited video data and sends it back to the original mobile device. The final output is a video file processed into a format that users can easily play at home. This allows users to review and share videos that express the intended emotions.

[0421] The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0422] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include those described above. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions shown by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0423] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the smart glasses 214.

[0424] [Third Embodiment]

[0425] Figure 5 shows an example of the configuration of the data processing system 310 according to the third embodiment.

[0426] As shown in Figure 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

[0427] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0428] The headset terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a display 343. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and display 343 are also connected to the bus 52.

[0429] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0430] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0431] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0432] Figure 6 shows an example of the main functions of the data processing device 12 and the headset terminal 314. As shown in Figure 6, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0433] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0434] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0435] In the headset terminal 314, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0436] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the headset terminal 314 will be referred to as the "terminal".

[0437] This invention provides a system that allows users to easily edit high-quality videos. Users input video editing instructions using a dedicated smartphone app. The app features an intuitive UI, allowing users to input instructions in natural language and select the necessary video files for editing.

[0438] When a user enters instructions and selects a video file, this information is sent to a cloud server via the internet. The server uses advanced AI and deep learning technologies to analyze the received video data. This analysis identifies emotional elements, important scenes, and dialogue sections within the audio track.

[0439] The server understands the natural language instructions it receives and determines what kind of editing should be done. This instruction analysis uses natural language processing techniques. For example, if the instruction is "edit emotionally," it plans music and slow-motion effects to evoke emotion. If the instruction emphasizes comedic elements, it inserts appropriate captions and visual effects.

[0440] Based on the editing plan designed by the server, the actual video editing is performed automatically. This includes cutting and joining footage, applying transition effects, and adjusting color tones as needed. Furthermore, AI-generated background music is created to harmonize with the video's atmosphere. This music is automatically generated to match the video's theme.

[0441] The edited, completed video file is sent directly from the cloud server to the user's smartphone in a compressed format. Users can then review and easily share the saved video. In this way, users can easily create videos that suit their intentions, even without specialized knowledge.

[0442] The following describes the processing flow.

[0443] Step 1:

[0444] The user launches the smartphone app and selects the video they want to edit. They also input instructions regarding the editing direction and style in natural language.

[0445] Step 2:

[0446] The device sends the video file selected by the user and the entered editing instructions as data to the cloud server. This communication takes place over the internet.

[0447] Step 3:

[0448] The server analyzes the received video file and extracts various features of the video and audio. At this stage, it identifies points of change in the scene and important parts of the audio track.

[0449] Step 4:

[0450] The server analyzes the user's editing instructions using natural language processing technology and formulates an editing plan. It interprets the necessary editing effects and video themes from the instructions.

[0451] Step 5:

[0452] The server edits the video based on the established editing plan. This includes cutting video clips, adjusting sequences, and applying appropriate filters and transitions.

[0453] Step 6:

[0454] The server uses AI to automatically generate music suitable for the video and synchronizes it with the finished footage. In this way, it adds background music that is perfectly suited to the video.

[0455] Step 7:

[0456] The server compresses the edited video into the optimal format and sends it to the user's device. This process utilizes data compression techniques to reduce transfer time.

[0457] Step 8:

[0458] The device saves the received, edited video to local storage. Users can then review it and easily share it on social media or other platforms.

[0459] (Example 1)

[0460] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0461] The present invention aims to provide a system that allows users to easily edit high-quality videos without requiring special technical knowledge. Specifically, it aims to solve the problem of reducing the complexity and time required for video editing by reflecting intuitive editing instructions in natural language into the video, and by automatically emphasizing emotional elements and inserting text information.

[0462] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0463] In this invention, the server includes input means for the user to input instructions, communication means for transmitting data via a terminal device, and analysis means for analyzing the received multimedia data. This allows the user to easily input video editing instructions in natural language, and those instructions to be immediately reflected in the video editing.

[0464] "Input method" refers to a device or interface that allows users to input video editing instructions in natural language.

[0465] "Communication means" refers to the technology and mechanisms for transmitting data from a user's terminal device to a server.

[0466] "Analysis means" refers to the operation of analyzing received multimedia content using a computer process and extracting important elements and metadata from the video.

[0467] "Processing means" refers to the process of analyzing the input natural language instructions and determining how to proceed with video editing.

[0468] "Design method" refers to the process of creating a detailed video editing plan based on the analyzed instructions.

[0469] "Editing methods" refer to operations that execute a designed editing plan, such as cutting videos, applying effects, and inserting music.

[0470] "Synchronization method" refers to the process of appropriately matching and integrating generated music and sound effects with the video.

[0471] "Transmission means" refers to the technology used to send the completed, edited video back to the user's terminal device.

[0472] This invention provides a system that allows users to easily edit high-quality videos using a specific program. Users first utilize a video editing interface via a dedicated smartphone app. This app features an intuitive UI and provides natural language input functionality using speech recognition technology.

[0473] The user selects the video file they want to edit and specifies their desired editing style in natural language. For example, they can enter a simple prompt in the app such as, "Edit this to highlight my fun travel memories." The device then sends this data to a server in the cloud via the internet.

[0474] The server uses analysis algorithms equipped with deep learning technology to analyze the received data. Image processing frameworks are utilized to detect important scenes and emotional elements in the video data, and natural language processing technology is used for instruction analysis. In addition, the audio information of the video is analyzed to construct an appropriate editing plan. Accordingly, a generative AI model is used to generate and synchronize sound materials and effects suitable for the video.

[0475] As a concrete example, the server performs natural language processing using BERT and analyzes emotional elements using TensorFlow. Once the editing plan is decided, editing software such as Adobe Premiere Pro is used to automatically perform editing processes such as cutting, transitions, and color adjustments.

[0476] Once editing is complete, the server compresses the finished video and sends it directly to the user's device. The user can then review the received video and share it on social media as needed. This system allows users to easily create high-quality videos without requiring any technical expertise.

[0477] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0478] Step 1:

[0479] The user launches the smartphone app and selects the video file they want to edit. During this process, the user can input video editing instructions in natural language. The input includes the video file itself and the editing instructions in natural language. The output is a data package containing both the instructions and the video file.

[0480] Step 2:

[0481] The terminal sends the user's entered editing instructions and selected video files to a server in the cloud. In this communication process, the data package is structured in JSON format. The input is the data package generated in step 1, and the output is the completion of the transmission to the remote server.

[0482] Step 3:

[0483] The server analyzes the received video file. It uses deep learning techniques to perform image processing to identify specific scenes and emotional elements within the video. The input is the video file itself, and the output is the analysis results, including metadata for important scenes and emotion labels. Specific operations include frame analysis and edge detection.

[0484] Step 4:

[0485] The server uses natural language processing technology to analyze the user's editing instructions and determine what edits should be made. The input is editing instructions in natural language, and the output is an editing plan. Semantic analysis is performed using models such as BERT to determine an editing policy based on specific requests such as "emotional" or "emphasize comedic elements."

[0486] Step 5:

[0487] Based on the information obtained in steps 3 and 4, the server creates a specific editing plan. The input is metadata for key scenes and the editing plan, and the output is a detailed set of editing instructions. This process includes selecting appropriate transitions, visual effects, and music files.

[0488] Step 6:

[0489] The server uses automated software tools like Adobe Premiere Pro to actually process the video according to the editing plan. This involves specific actions such as cutting footage, applying effects, and color correction. The input is a detailed set of editing instructions, and the output is the edited video file.

[0490] Step 7:

[0491] The server automatically generates key music and sound effects using a generative AI model and synchronizes them with the video. The input is a music theme based on the editing plan, and the output is a music file synchronized with the video. This operation includes music generation and timing adjustment.

[0492] Step 8:

[0493] The edited video is compressed and sent from the server to the user's device. The input is the edited video file, and the output is the state after delivery to the user's device is complete. After the device receives this video, it notifies the user that editing is complete via a notification.

[0494] Step 9:

[0495] Users can view edited videos on their devices and easily share them on social media, etc. This process involves the user playing the video, selecting privacy settings, and then pressing the share button to complete the sharing process.

[0496] (Application Example 1)

[0497] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0498] Currently, producing advertising videos requires a significant amount of time and specialized knowledge, making it difficult to create high-quality advertising videos quickly and effectively. Therefore, there is a need for a system that allows advertising creators and marketing personnel to easily produce videos that match their respective advertising concepts.

[0499] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0500] In this invention, the server includes means for designing an editing plan based on the advertising concept, means for synchronizing generated music with video, and means for automatically inserting advertising information into the generated video. This makes it possible for users to quickly create high-quality videos suitable for advertising purposes without requiring specialized knowledge.

[0501] "User input means" refers to a device or software that provides an interface for users to input video editing instructions using natural language.

[0502] A "mobile terminal" refers to a computer device such as a smartphone or tablet carried by a user, used for sending and receiving data.

[0503] "Means of transmitting data" refers to a device or software that has the function of transmitting instructions or video data from a user's mobile device to a cloud server via the internet.

[0504] "Means for analyzing received video data" refers to programs that use artificial intelligence and deep learning technologies to analyze video data received on a cloud server and understand its content.

[0505] "A means of analyzing natural language to determine editing tasks" refers to a program that utilizes natural language processing techniques to analyze natural language instructions input by the user and determine what kind of editing should be performed based on those instructions.

[0506] "A means of designing an editing plan based on an advertising concept" refers to a system equipped with algorithms for planning how to edit a video according to the purpose and target audience of the advertisement.

[0507] A "video editing tool" is a program that automatically performs editing tasks such as cutting, splicing, and adding effects to footage based on a pre-designed editing plan.

[0508] "Methods for synchronizing generated music with video" refers to a program that provides technology for incorporating automatically generated music into video in a way that complements the video's theme and emotions, ensuring it is appropriately matched to the video.

[0509] "Means for transmitting edited video to a mobile device" refers to a device or software with network communication capabilities that compresses the completed video on a cloud server and transmits it quickly to the user's mobile device.

[0510] A "means for automatically inserting advertising information" refers to a system that has the function of automatically adding information such as text and graphics to generated videos, according to marketing objectives.

[0511] To realize this application, it is necessary to build a system that primarily involves cloud-based servers and the user's mobile device (smartphone or tablet). The mobile device has an application installed that allows the user to input video editing instructions in natural language. This application provides a means of user input, allowing users to specify advertising concepts and desired editing styles in natural language.

[0512] When a user enters instructions and selects advertising materials, this information is sent to the cloud via the internet. On the cloud server, a program developed using the Python programming language analyzes the received video data. Using software libraries such as TensorFlow and OpenCV, the program analyzes the video data and identifies important scenes. Furthermore, it utilizes the Hugging Face Transformers library to analyze the user's natural language input and determine the appropriate editing task.

[0513] The server designs an editing plan based on the user's advertising concept. Following this plan, editing is performed automatically, including cutting, splicing, and applying visual effects to the video. Additionally, a generative AI model is used to automatically generate appropriate music and synchronize it with the edited video. A deep learning model is employed for this music generation.

[0514] The edited video is compressed on a cloud server and then sent again to the user's mobile device via the internet. The user can view this completed video file at any time and easily share it as needed. For example, by entering a prompt such as "Create a friendly, trendy cosmetics advertisement video for women in their 20s," a video based on a specific editing plan will be generated.

[0515] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0516] Step 1:

[0517] The user's device receives video editing instructions in natural language via the application. The input instructions and selected advertising materials are sent to a cloud server via the internet. At this stage, the input consists of the user's text instructions and video data, and sending this to the server initiates the editing process.

[0518] Step 2:

[0519] The server analyzes the received video data using the OpenCV library to identify important scenes and target objects. This analysis involves image processing and feature extraction for each frame to extract the target scenes. As a result, scene information suitable for advertising is obtained.

[0520] Step 3:

[0521] The server uses the Hugging Face Transformers library to analyze the user's natural language instructions. Through this analysis, it understands the user's request from the input prompt and determines what editing task should be performed. The analysis results output a specific editing plan.

[0522] Step 4:

[0523] The server designs an editing plan based on the advertising concept. Here, it combines scene information obtained in the previous stage with user requests to plan specific editing techniques. This plan includes the sequence of scenes, transitions, visual effects, and so on.

[0524] Step 5:

[0525] The server automatically generates appropriate music using a generative AI model and synchronizes it with the edited video. It utilizes deep learning technology to generate music that matches the video's theme. The generated music is positioned to align with the flow of the video.

[0526] Step 6:

[0527] The server automatically cuts, splices, and applies effects to the video based on the editing plan. Specific editing tasks are performed, and color tones and text information are inserted as needed. As a result of this process, the final advertisement video is output.

[0528] Step 7:

[0529] The completed video is compressed on a cloud server and sent to the user's mobile device. The compressed video is provided in a format suitable for the user to download and review. As a result, users can easily preview the advertising video and further share or publish it.

[0530] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0531] This invention provides a system that enables video editing that reflects the user's emotions. Using a smartphone app, users can explicitly input their emotional state when entering videos and editing instructions. Furthermore, the app utilizes the camera and microphone to automatically recognize the user's facial expressions and voice tone through an emotion engine.

[0532] The device sends the user's selected video along with instructions and emotional information to the cloud server. The server analyzes the received video and, based on the emotional information from the emotion engine, designs an editing task that matches the user's intentions. This design includes cutting the video, applying slow motion and effects, and adjusting color filters.

[0533] The server uses natural language processing to understand the user's editing instructions and emotional information, and determines how the video should be edited. The emotion engine provides music and effects that match the detected emotions. For example, if the user wants an "emotional" and "calming" video, slow-motion effects and melodic background music will be selected.

[0534] The edited video automatically includes text information that matches the user's emotions. This ensures that the entire video expresses the user's feelings. The server compresses the finished video and sends it to the user's device. The user can then review the video and share it in their preferred way.

[0535] Thus, with this invention incorporating an emotion engine, users can easily create and share videos that reflect their emotions in real time, even without technical skills.

[0536] The following describes the processing flow.

[0537] Step 1:

[0538] The user launches the smartphone app and selects the video they want to edit. Furthermore, they can use the app's features to input their own emotions or have emotional information automatically collected using the camera and microphone.

[0539] Step 2:

[0540] The device sends the video file, user editing instructions, and recognized emotion data as a single package to the cloud server. The transmission is done via the internet and is encrypted to ensure security.

[0541] Step 3:

[0542] The server analyzes the received video data and emotional information. It identifies scene transitions in the video and unique characteristics in the audio, and analyzes the emotional state obtained by the emotion engine.

[0543] Step 4:

[0544] The server uses natural language processing technology to generate the optimal editing task based on the user's editing instructions and emotional information. The AI ​​interprets emotional instructions such as "emotional" or "fun" and constructs an editing plan.

[0545] Step 5:

[0546] The server then develops an editing plan based on the emotion engine. It automatically selects necessary effects and filters, inserts cutscenes, and chooses and applies music. The editing plan includes restructuring the timeline and applying effects.

[0547] Step 6:

[0548] The server automatically plays AI-generated music according to the video content. It adjusts the tempo and style of the music to match the emotional information, harmonizing it with the overall video.

[0549] Step 7:

[0550] The server automatically inserts emotionally relevant text into the edited video and compresses it into the specified format.

[0551] Step 8:

[0552] The edited video is sent from the server to the user's device. The device saves it to local storage, and the user can view the video through the app and easily share it on platforms like Facebook and Instagram as they like.

[0553] (Example 2)

[0554] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0555] Traditional video editing tools often struggle to effectively reflect the emotions intended by the user in the video, and frequently require specialized editing knowledge. Furthermore, the limited means of reflecting user emotions in real time made creating content that met user expectations challenging.

[0556] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0557] In this invention, the server includes a user input means, a means for transmitting video and emotional information through a terminal, and a means for analyzing the received video. This enables automatic video editing that reflects the user's emotions, making it possible for even users without specialized knowledge to create videos that effectively express emotions.

[0558] "User input means" refers to technology that provides an interface for users to select videos they want to edit, give editing instructions, and input emotional information.

[0559] A "terminal" is a portable electronic device that has the function of receiving user input and transmitting and receiving data.

[0560] "Motion images" are visual data that provides moving images by playing a series of still images at a constant speed.

[0561] "Emotional information" refers to data that indicates the user's psychological state and is analyzed by the emotion engine.

[0562] A "server" is a remote computer system used for analyzing, editing, and processing data.

[0563] "Natural language" refers to the language that humans use on a daily basis and is in a form that can be analyzed by programs.

[0564] An "editing plan" is a detailed design of how to edit video footage, generated based on user instructions and emotional information.

[0565] "Emotional data" refers to data that expresses a user's emotional state using numerical values ​​or categories.

[0566] A "generative AI model" is an artificial intelligence model trained to generate text or media based on specified prompts.

[0567] This invention is a system for reflecting user-intended emotions in moving images, and it achieves this through collaboration between the user, the terminal, and the server.

[0568] Users select videos and input editing instructions and emotional information using an application on a device such as a smartphone. Users can manually select their own emotions, and the device automatically captures the user's facial expressions and voice tone using its camera and microphone. This data is analyzed by the device's emotion engine. For example, if the camera detects the user's smile, the emotion "joy" is recorded.

[0569] The device sends collected video footage, editing instructions, and emotional information to a server in the cloud. A key aspect of this system is that communication is conducted using a secure protocol. Based on the received data, the server analyzes the video footage using a generative AI model and understands the user's intent through natural language processing. It automatically generates an editing plan and performs appropriate editing to produce video footage that matches the user's emotions.

[0570] The server applies music and visual effects based on emotional data to the video during editing. For example, if the user enters "calm mood," the server will select slow motion and gentle music. Text matching the user's entered emotion is automatically inserted into the edited video. The completed edited video is compressed and sent to the user's device.

[0571] In this system, a generative AI model plays a central role in generating natural and emotionally rich content. A concrete example of a prompt message is, "Add a cheerful atmosphere to a birthday party video. Emotion: Happiness, blessings." This invention makes it possible for users to easily create and share high-quality videos that express their emotions, even without advanced technical skills.

[0572] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0573] Step 1:

[0574] The user launches the smartphone app and selects the video or image they want to edit. Next, the user either manually enters their emotions or allows the device to use its camera and microphone to automatically capture emotion information. This saves the user's input as video and emotion data on the device.

[0575] Step 2:

[0576] The device uses an emotion engine to analyze the acquired emotional information. Facial expressions captured by the camera and voice tones captured by the microphone are quantified as emotional data. This analysis identifies emotions such as "joy" and "sadness."

[0577] Step 3:

[0578] The device sends the video and analyzed emotion information selected by the user to a cloud server. The input data is encrypted and transmitted securely. As output, data is transferred to a server waiting to receive.

[0579] Step 4:

[0580] The server uses natural language processing to analyze the received video and emotional information. This analysis interprets the user's editing instructions and generates an editing plan on how to edit the video. For example, if the instruction is "calm mood," the application of slow motion and gentle music will be considered.

[0581] Step 5:

[0582] The server uses a generative AI model to apply appropriate visual effects and music to the video. Based on the input emotion data, the video effects and soundtrack are automatically selected. For example, if the emotion "joy" is input, bright music and a vibrant color filter will be applied.

[0583] Step 6:

[0584] The server automatically inserts text information that matches the user's emotions into the edited video. For example, text such as "I love this moment" might be selected. The output is a completed video that reflects the emotions.

[0585] Step 7:

[0586] The server compresses the edited video and efficiently transfers it to the user's device. This allows the user to instantly play the edited video on their device. As output, high-quality video is provided and saved on the user's device.

[0587] Step 8:

[0588] Users view the edited videos they receive and evaluate and revise them as needed. If users are satisfied with the result, they can share the video through social media or messaging apps. Ultimately, they can share emotionally expressive videos with others.

[0589] (Application Example 2)

[0590] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server," and the headset-type terminal 314 will be referred to as the "terminal."

[0591] There is a growing demand for automated video content that more richly expresses people's emotions during family activities and events. However, current systems make it difficult to edit content based on emotions, and this is especially challenging for users without technical knowledge.

[0592] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0593] In this invention, the server includes means for user input, means for transmitting data via a mobile terminal, means for analyzing received video data, means for analyzing natural language to determine an editing task, means for designing an editing plan, means for editing the video based on the editing plan, means for synchronizing generated music with the video, means for transmitting the edited video to the mobile terminal, means for identifying emotional states, means for adding relevant information to the automatically corrected video, and means for recording activities at home and performing editing that is appropriate to the situation. This enables users to perform emotion-based video editing without requiring technical skills.

[0594] A "user input method" is an interface that allows users to provide instructions or information to a system.

[0595] "Means of transmitting data via mobile terminals" refers to the function of sending data to a server using mobile devices such as smartphones and tablets.

[0596] "Means for analyzing received video data" refers to the process of analyzing the transmitted video data using a computer program and understanding its content.

[0597] "A means of analyzing natural language to determine editing tasks" refers to a process that analyzes instructions provided by users in natural language and converts them into specific editing tasks.

[0598] "Means for designing editing plans" refers to the function of planning the editing policies and details of a video based on the analyzed data.

[0599] "Methods for editing a video based on an editing plan" refers to the process of cutting and applying effects to a video according to a planned policy.

[0600] "A means of synchronizing generated music with video" refers to a function that combines music with edited video at the appropriate timing.

[0601] "Means of transmitting edited video to a mobile device" refers to the process of transferring the edited video back to the user's mobile device.

[0602] "Means for identifying emotional states" refers to analytical functions using machine learning or algorithms that identify emotions from the user's speech and facial expressions.

[0603] "Methods for adding relevant information to automatically corrected footage" refers to the process of adding emotionally and contextually relevant information or text to edited footage.

[0604] "A means of recording activities at home and editing them to suit the situation" refers to a function that photographs the daily life and events of a family and edits them into the most suitable format according to the situation.

[0605] To implement this invention, first, a consumer robot equipped with a user-friendly interface for home use is prepared. This robot is equipped with a camera and microphone, which can capture the user's facial expressions and voice in real time. Without the user explicitly inputting their emotional state, the robot uses an emotion engine that automatically analyzes the collected facial data and voice tone. This engine incorporates a machine learning model that can identify the user's emotions with high accuracy.

[0606] The received video data is analyzed by a processing unit within the robot or by a server in the cloud. During this process, a natural language processing engine is utilized to appropriately understand editing requests for the video. For example, if an emotionally moving scene needs to be emphasized, editing tasks such as inserting slow-motion effects or melodic music are automatically designed. This results in video content that reflects the user's emotions.

[0607] Once editing is complete, an AI model automatically inserts emotionally relevant text into the video. For example, text such as "This moment is priceless" might be added to emotionally moving scenes. The server compresses the final edited video and sends it to the user's device. Users can then review the finished video and share it with family and friends in an emotionally resonant way.

[0608] For example, a video might be automatically generated showing a family dinner scene with soft, calming music playing in the background, highlighting moments of smiles in slow motion. In this case, an example of a prompt message might be, "Please suggest an editing style that emphasizes the happy atmosphere of dinnertime."

[0609] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0610] Step 1:

[0611] The user simply inputs their desired editing style based on videos captured by the home robot. Even without explicit input from the user, the robot uses its camera and microphone to capture the user's facial expressions and voice tone, sending this as emotion data to its internal processor. The input data consists of the user's facial images and voice, which serve as the raw materials for emotion recognition.

[0612] Step 2:

[0613] The device's emotion engine analyzes collected facial expression data and voice tone to identify the user's emotional state. Here, machine learning algorithms are used to process the data and extract emotions such as "joy" and "surprise" as the output of emotion recognition.

[0614] Step 3:

[0615] The server designs an appropriate editing task based on the user's emotional state and the video content. In this process, the server uses a natural language processing engine to analyze the user's editing instructions and determine how the video should be edited. The input is the emotional state and the user's instruction text, and the output is a specific editing plan.

[0616] Step 4:

[0617] The server edits the video based on the designed editing plan. It performs edits such as cutting, applying slow-motion effects and filters, and uses a generative AI model to select music and effects that match the emotions. The input to this process is the editing plan, and the output is the edited video.

[0618] Step 5:

[0619] The server uses an emotion-based generative AI model to automatically insert appropriate text information into the video. Here, by embedding text that expresses emotions into the edited video, a visually rich and emotionally resonant output is obtained.

[0620] Step 6:

[0621] The server compresses the edited video data and sends it back to the original mobile device. The final output is a video file processed into a format that users can easily play at home. This allows users to review and share videos that express the intended emotions.

[0622] The specific processing unit 290 transmits the result of the specific processing to the headset terminal 314. In the headset terminal 314, the control unit 46A causes the speaker 240 and display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0623] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include those described above. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions shown by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0624] In the above embodiment, an example was given in which specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and specific processing may also be performed by the headset terminal 314.

[0625] [Fourth Embodiment]

[0626] Figure 7 shows an example of the configuration of the data processing system 410 according to the fourth embodiment.

[0627] As shown in Figure 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

[0628] The data processing device 12 comprises a computer 22, a database 24, and a communication interface 26. The computer 22 is an example of a "computer" related to the technology of this disclosure. The computer 22 comprises a processor 28, RAM 30, and storage 32. The processor 28, RAM 30, and storage 32 are connected to a bus 34. The database 24 and the communication interface 26 are also connected to the bus 34. The communication interface 26 is connected to a network 54. An example of the network 54 is a WAN (Wide Area Network) and / or a LAN (Local Area Network).

[0629] The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication interface 44, and a controlled object 443. The computer 36 includes a processor 46, RAM 48, and storage 50. The processor 46, RAM 48, and storage 50 are connected to a bus 52. The microphone 238, speaker 240, camera 42, and controlled object 443 are also connected to the bus 52.

[0630] The microphone 238 receives voice signals from the user 20 and receives instructions from the user 20. The microphone 238 captures the voice signals from the user 20, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs audio according to the instructions from the processor 46.

[0631] Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an image sensor such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures images of the area around the user 20 (for example, an imaging range defined by a field of view equivalent to the width of a typical healthy person's field of vision).

[0632] Communication interface 44 is connected to network 54. Communication interfaces 44 and 26 are responsible for the exchange of various information between processor 46 and processor 28 via network 54. The exchange of various information between processor 46 and processor 28 using communication interfaces 44 and 26 is performed in a secure manner.

[0633] The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and feet. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and feet. Some of the robot 414's emotions can be expressed by controlling these motors. Furthermore, the robot 414's facial expressions can also be expressed by controlling the illumination state of the LEDs in its eyes.

[0634] Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, the data processing device 12 performs specific processing using the processor 28. The storage 32 stores the specific processing program 56.

[0635] The specific processing program 56 is an example of a "program" relating to the technology of this disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

[0636] The storage 32 stores the data generation model 58 and the emotion identification model 59. The data generation model 58 and the emotion identification model 59 are used by the identification processing unit 290.

[0637] In robot 414, the processor 46 performs the reception output processing. The storage 50 stores the reception output program 60. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output processing is realized by the processor 46 operating as a control unit 46A according to the reception output program 60 executed on the RAM 48.

[0638] Next, the specific processing performed by the specific processing unit 290 of the data processing device 12 will be described. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0639] This invention provides a system that allows users to easily edit high-quality videos. Users input video editing instructions using a dedicated smartphone app. The app features an intuitive UI, allowing users to input instructions in natural language and select the necessary video files for editing.

[0640] When a user enters instructions and selects a video file, this information is sent to a cloud server via the internet. The server uses advanced AI and deep learning technologies to analyze the received video data. This analysis identifies emotional elements, important scenes, and dialogue sections within the audio track.

[0641] The server understands the natural language instructions it receives and determines what kind of editing should be done. This instruction analysis uses natural language processing techniques. For example, if the instruction is "edit emotionally," it plans music and slow-motion effects to evoke emotion. If the instruction emphasizes comedic elements, it inserts appropriate captions and visual effects.

[0642] Based on the editing plan designed by the server, the actual video editing is performed automatically. This includes cutting and joining footage, applying transition effects, and adjusting color tones as needed. Furthermore, AI-generated background music is created to harmonize with the video's atmosphere. This music is automatically generated to match the video's theme.

[0643] The edited, completed video file is sent directly from the cloud server to the user's smartphone in a compressed format. Users can then review and easily share the saved video. In this way, users can easily create videos that suit their intentions, even without specialized knowledge.

[0644] The following describes the processing flow.

[0645] Step 1:

[0646] The user launches the smartphone app and selects the video they want to edit. They also input instructions regarding the editing direction and style in natural language.

[0647] Step 2:

[0648] The device sends the video file selected by the user and the entered editing instructions as data to the cloud server. This communication takes place over the internet.

[0649] Step 3:

[0650] The server analyzes the received video file and extracts various features of the video and audio. At this stage, it identifies points of change in the scene and important parts of the audio track.

[0651] Step 4:

[0652] The server analyzes the user's editing instructions using natural language processing technology and formulates an editing plan. It interprets the necessary editing effects and video themes from the instructions.

[0653] Step 5:

[0654] The server edits the video based on the established editing plan. This includes cutting video clips, adjusting sequences, and applying appropriate filters and transitions.

[0655] Step 6:

[0656] The server uses AI to automatically generate music suitable for the video and synchronizes it with the finished footage. In this way, it adds background music that is perfectly suited to the video.

[0657] Step 7:

[0658] The server compresses the edited video into the optimal format and sends it to the user's device. This process utilizes data compression techniques to reduce transfer time.

[0659] Step 8:

[0660] The device saves the received, edited video to local storage. Users can then review it and easily share it on social media or other platforms.

[0661] (Example 1)

[0662] Next, we will describe Example 1. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0663] The present invention aims to provide a system that allows users to easily edit high-quality videos without requiring special technical knowledge. Specifically, it aims to solve the problem of reducing the complexity and time required for video editing by reflecting intuitive editing instructions in natural language into the video, and by automatically emphasizing emotional elements and inserting text information.

[0664] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 1 is realized by the following means.

[0665] In this invention, the server includes input means for the user to input instructions, communication means for transmitting data via a terminal device, and analysis means for analyzing the received multimedia data. This allows the user to easily input video editing instructions in natural language, and those instructions to be immediately reflected in the video editing.

[0666] "Input method" refers to a device or interface that allows users to input video editing instructions in natural language.

[0667] "Communication means" refers to the technology and mechanisms for transmitting data from a user's terminal device to a server.

[0668] "Analysis means" refers to the operation of analyzing received multimedia content using a computer process and extracting important elements and metadata from the video.

[0669] "Processing means" refers to the process of analyzing the input natural language instructions and determining how to proceed with video editing.

[0670] "Design method" refers to the process of creating a detailed video editing plan based on the analyzed instructions.

[0671] "Editing methods" refer to operations that execute a designed editing plan, such as cutting videos, applying effects, and inserting music.

[0672] "Synchronization method" refers to the process of appropriately matching and integrating generated music and sound effects with the video.

[0673] "Transmission means" refers to the technology used to send the completed, edited video back to the user's terminal device.

[0674] This invention provides a system that allows users to easily edit high-quality videos using a specific program. Users first utilize a video editing interface via a dedicated smartphone app. This app features an intuitive UI and provides natural language input functionality using speech recognition technology.

[0675] The user selects the video file they want to edit and specifies their desired editing style in natural language. For example, they can enter a simple prompt in the app such as, "Edit this to highlight my fun travel memories." The device then sends this data to a server in the cloud via the internet.

[0676] The server uses analysis algorithms equipped with deep learning technology to analyze the received data. Image processing frameworks are utilized to detect important scenes and emotional elements in the video data, and natural language processing technology is used for instruction analysis. In addition, the audio information of the video is analyzed to construct an appropriate editing plan. Accordingly, a generative AI model is used to generate and synchronize sound materials and effects suitable for the video.

[0677] As a concrete example, the server performs natural language processing using BERT and analyzes emotional elements using TensorFlow. Once the editing plan is decided, editing software such as Adobe Premiere Pro is used to automatically perform editing processes such as cutting, transitions, and color adjustments.

[0678] Once editing is complete, the server compresses the finished video and sends it directly to the user's device. The user can then review the received video and share it on social media as needed. This system allows users to easily create high-quality videos without requiring any technical expertise.

[0679] The flow of the specific processing in Example 1 will be explained using Figure 11.

[0680] Step 1:

[0681] The user launches the smartphone app and selects the video file they want to edit. During this process, the user can input video editing instructions in natural language. The input includes the video file itself and the editing instructions in natural language. The output is a data package containing both the instructions and the video file.

[0682] Step 2:

[0683] The terminal sends the user's entered editing instructions and selected video files to a server in the cloud. In this communication process, the data package is structured in JSON format. The input is the data package generated in step 1, and the output is the completion of the transmission to the remote server.

[0684] Step 3:

[0685] The server analyzes the received video file. It uses deep learning techniques to perform image processing to identify specific scenes and emotional elements within the video. The input is the video file itself, and the output is the analysis results, including metadata for important scenes and emotion labels. Specific operations include frame analysis and edge detection.

[0686] Step 4:

[0687] The server uses natural language processing technology to analyze the user's editing instructions and determine what edits should be made. The input is editing instructions in natural language, and the output is an editing plan. Semantic analysis is performed using models such as BERT to determine an editing policy based on specific requests such as "emotional" or "emphasize comedic elements."

[0688] Step 5:

[0689] Based on the information obtained in steps 3 and 4, the server creates a specific editing plan. The input is metadata for key scenes and the editing plan, and the output is a detailed set of editing instructions. This process includes selecting appropriate transitions, visual effects, and music files.

[0690] Step 6:

[0691] The server uses automated software tools like Adobe Premiere Pro to actually process the video according to the editing plan. This involves specific actions such as cutting footage, applying effects, and color correction. The input is a detailed set of editing instructions, and the output is the edited video file.

[0692] Step 7:

[0693] The server automatically generates key music and sound effects using a generative AI model and synchronizes them with the video. The input is a music theme based on the editing plan, and the output is a music file synchronized with the video. This operation includes music generation and timing adjustment.

[0694] Step 8:

[0695] The edited video is compressed and sent from the server to the user's device. The input is the edited video file, and the output is the state after delivery to the user's device is complete. After the device receives this video, it notifies the user that editing is complete via a notification.

[0696] Step 9:

[0697] Users can view edited videos on their devices and easily share them on social media, etc. This process involves the user playing the video, selecting privacy settings, and then pressing the share button to complete the sharing process.

[0698] (Application Example 1)

[0699] Next, we will explain Application Example 1. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0700] Currently, producing advertising videos requires a significant amount of time and specialized knowledge, making it difficult to create high-quality advertising videos quickly and effectively. Therefore, there is a need for a system that allows advertising creators and marketing personnel to easily produce videos that match their respective advertising concepts.

[0701] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 1 is realized by the following means.

[0702] In this invention, the server includes means for designing an editing plan based on the advertising concept, means for synchronizing generated music with video, and means for automatically inserting advertising information into the generated video. This makes it possible for users to quickly create high-quality videos suitable for advertising purposes without requiring specialized knowledge.

[0703] "User input means" refers to a device or software that provides an interface for users to input video editing instructions using natural language.

[0704] A "mobile terminal" refers to a computer device such as a smartphone or tablet carried by a user, used for sending and receiving data.

[0705] "Means of transmitting data" refers to a device or software that has the function of transmitting instructions or video data from a user's mobile device to a cloud server via the internet.

[0706] "Means for analyzing received video data" refers to programs that use artificial intelligence and deep learning technologies to analyze video data received on a cloud server and understand its content.

[0707] "A means of analyzing natural language to determine editing tasks" refers to a program that utilizes natural language processing techniques to analyze natural language instructions input by the user and determine what kind of editing should be performed based on those instructions.

[0708] "A means of designing an editing plan based on an advertising concept" refers to a system equipped with algorithms for planning how to edit a video according to the purpose and target audience of the advertisement.

[0709] A "video editing tool" is a program that automatically performs editing tasks such as cutting, splicing, and adding effects to footage based on a pre-designed editing plan.

[0710] "Methods for synchronizing generated music with video" refers to a program that provides technology for incorporating automatically generated music into video in a way that complements the video's theme and emotions, ensuring it is appropriately matched to the video.

[0711] "Means for transmitting edited video to a mobile device" refers to a device or software with network communication capabilities that compresses the completed video on a cloud server and transmits it quickly to the user's mobile device.

[0712] A "means for automatically inserting advertising information" refers to a system that has the function of automatically adding information such as text and graphics to generated videos, according to marketing objectives.

[0713] To realize this application, it is necessary to build a system that primarily involves cloud-based servers and the user's mobile device (smartphone or tablet). The mobile device has an application installed that allows the user to input video editing instructions in natural language. This application provides a means of user input, allowing users to specify advertising concepts and desired editing styles in natural language.

[0714] When a user enters instructions and selects advertising materials, this information is sent to the cloud via the internet. On the cloud server, a program developed using the Python programming language analyzes the received video data. Using software libraries such as TensorFlow and OpenCV, the program analyzes the video data and identifies important scenes. Furthermore, it utilizes the Hugging Face Transformers library to analyze the user's natural language input and determine the appropriate editing task.

[0715] The server designs an editing plan based on the user's advertising concept. Following this plan, editing is performed automatically, including cutting, splicing, and applying visual effects to the video. Additionally, a generative AI model is used to automatically generate appropriate music and synchronize it with the edited video. A deep learning model is employed for this music generation.

[0716] The edited video is compressed on a cloud server and then sent again to the user's mobile device via the internet. The user can view this completed video file at any time and easily share it as needed. For example, by entering a prompt such as "Create a friendly, trendy cosmetics advertisement video for women in their 20s," a video based on a specific editing plan will be generated.

[0717] The flow of a specific process in Application Example 1 will be explained using Figure 12.

[0718] Step 1:

[0719] The user's device receives video editing instructions in natural language via the application. The input instructions and selected advertising materials are sent to a cloud server via the internet. At this stage, the input consists of the user's text instructions and video data, and sending this to the server initiates the editing process.

[0720] Step 2:

[0721] The server analyzes the received video data using the OpenCV library to identify important scenes and target objects. This analysis involves image processing and feature extraction for each frame to extract the target scenes. As a result, scene information suitable for advertising is obtained.

[0722] Step 3:

[0723] The server uses the Hugging Face Transformers library to analyze the user's natural language instructions. Through this analysis, it understands the user's request from the input prompt and determines what editing task should be performed. The analysis results output a specific editing plan.

[0724] Step 4:

[0725] The server designs an editing plan based on the advertising concept. Here, it combines scene information obtained in the previous stage with user requests to plan specific editing techniques. This plan includes the sequence of scenes, transitions, visual effects, and so on.

[0726] Step 5:

[0727] The server automatically generates appropriate music using a generative AI model and synchronizes it with the edited video. It utilizes deep learning technology to generate music that matches the video's theme. The generated music is positioned to align with the flow of the video.

[0728] Step 6:

[0729] The server automatically cuts, splices, and applies effects to the video based on the editing plan. Specific editing tasks are performed, and color tones and text information are inserted as needed. As a result of this process, the final advertisement video is output.

[0730] Step 7:

[0731] The completed video is compressed on a cloud server and sent to the user's mobile device. The compressed video is provided in a format suitable for the user to download and review. As a result, users can easily preview the advertising video and further share or publish it.

[0732] Furthermore, an emotion engine that estimates the user's emotions may be incorporated. That is, the identification processing unit 290 may use the emotion identification model 59 to estimate the user's emotions and perform identification processing using the user's emotions.

[0733] This invention provides a system that enables video editing that reflects the user's emotions. Using a smartphone app, users can explicitly input their emotional state when entering videos and editing instructions. Furthermore, the app utilizes the camera and microphone to automatically recognize the user's facial expressions and voice tone through an emotion engine.

[0734] The device sends the user's selected video along with instructions and emotional information to the cloud server. The server analyzes the received video and, based on the emotional information from the emotion engine, designs an editing task that matches the user's intentions. This design includes cutting the video, applying slow motion and effects, and adjusting color filters.

[0735] The server uses natural language processing to understand the user's editing instructions and emotional information, and determines how the video should be edited. The emotion engine provides music and effects that match the detected emotions. For example, if the user wants an "emotional" and "calming" video, slow-motion effects and melodic background music will be selected.

[0736] The edited video automatically includes text information that matches the user's emotions. This ensures that the entire video expresses the user's feelings. The server compresses the finished video and sends it to the user's device. The user can then review the video and share it in their preferred way.

[0737] Thus, with this invention incorporating an emotion engine, users can easily create and share videos that reflect their emotions in real time, even without technical skills.

[0738] The following describes the processing flow.

[0739] Step 1:

[0740] The user launches the smartphone app and selects the video they want to edit. Furthermore, they can use the app's features to input their own emotions or have emotional information automatically collected using the camera and microphone.

[0741] Step 2:

[0742] The device sends the video file, user editing instructions, and recognized emotion data as a single package to the cloud server. The transmission is done via the internet and is encrypted to ensure security.

[0743] Step 3:

[0744] The server analyzes the received video data and emotional information. It identifies scene transitions in the video and unique characteristics in the audio, and analyzes the emotional state obtained by the emotion engine.

[0745] Step 4:

[0746] The server uses natural language processing technology to generate the optimal editing task based on the user's editing instructions and emotional information. The AI ​​interprets emotional instructions such as "emotional" or "fun" and constructs an editing plan.

[0747] Step 5:

[0748] The server then develops an editing plan based on the emotion engine. It automatically selects necessary effects and filters, inserts cutscenes, and chooses and applies music. The editing plan includes restructuring the timeline and applying effects.

[0749] Step 6:

[0750] The server automatically plays AI-generated music according to the video content. It adjusts the tempo and style of the music to match the emotional information, harmonizing it with the overall video.

[0751] Step 7:

[0752] The server automatically inserts emotionally relevant text into the edited video and compresses it into the specified format.

[0753] Step 8:

[0754] The edited video is sent from the server to the user's device. The device saves it to local storage, and the user can view the video through the app and easily share it on platforms like Facebook and Instagram as they like.

[0755] (Example 2)

[0756] Next, we will describe Example 2. In the following description, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0757] Traditional video editing tools often struggle to effectively reflect the emotions intended by the user in the video, and frequently require specialized editing knowledge. Furthermore, the limited means of reflecting user emotions in real time made creating content that met user expectations challenging.

[0758] The identification process performed by the identification processing unit 290 of the data processing device 12 in Example 2 is realized by the following means.

[0759] In this invention, the server includes a user input means, a means for transmitting video and emotional information through a terminal, and a means for analyzing the received video. This enables automatic video editing that reflects the user's emotions, making it possible for even users without specialized knowledge to create videos that effectively express emotions.

[0760] "User input means" refers to technology that provides an interface for users to select videos they want to edit, give editing instructions, and input emotional information.

[0761] A "terminal" is a portable electronic device that has the function of receiving user input and transmitting and receiving data.

[0762] "Motion images" are visual data that provides moving images by playing a series of still images at a constant speed.

[0763] "Emotional information" refers to data that indicates the user's psychological state and is analyzed by the emotion engine.

[0764] A "server" is a remote computer system used for analyzing, editing, and processing data.

[0765] "Natural language" refers to the language that humans use on a daily basis and is in a form that can be analyzed by programs.

[0766] An "editing plan" is a detailed design of how to edit video footage, generated based on user instructions and emotional information.

[0767] "Emotional data" refers to data that expresses a user's emotional state using numerical values ​​or categories.

[0768] A "generative AI model" is an artificial intelligence model trained to generate text or media based on specified prompts.

[0769] This invention is a system for reflecting user-intended emotions in moving images, and it achieves this through collaboration between the user, the terminal, and the server.

[0770] Users select videos and input editing instructions and emotional information using an application on a device such as a smartphone. Users can manually select their own emotions, and the device automatically captures the user's facial expressions and voice tone using its camera and microphone. This data is analyzed by the device's emotion engine. For example, if the camera detects the user's smile, the emotion "joy" is recorded.

[0771] The device sends collected video footage, editing instructions, and emotional information to a server in the cloud. A key aspect of this system is that communication is conducted using a secure protocol. Based on the received data, the server analyzes the video footage using a generative AI model and understands the user's intent through natural language processing. It automatically generates an editing plan and performs appropriate editing to produce video footage that matches the user's emotions.

[0772] The server applies music and visual effects based on emotional data to the video during editing. For example, if the user enters "calm mood," the server will select slow motion and gentle music. Text matching the user's entered emotion is automatically inserted into the edited video. The completed edited video is compressed and sent to the user's device.

[0773] In this system, a generative AI model plays a central role in generating natural and emotionally rich content. A concrete example of a prompt message is, "Add a cheerful atmosphere to a birthday party video. Emotion: Happiness, blessings." This invention makes it possible for users to easily create and share high-quality videos that express their emotions, even without advanced technical skills.

[0774] The flow of the specific processing in Example 2 will be explained using Figure 13.

[0775] Step 1:

[0776] The user launches the smartphone app and selects the video or image they want to edit. Next, the user either manually enters their emotions or allows the device to use its camera and microphone to automatically capture emotion information. This saves the user's input as video and emotion data on the device.

[0777] Step 2:

[0778] The device uses an emotion engine to analyze the acquired emotional information. Facial expressions captured by the camera and voice tones captured by the microphone are quantified as emotional data. This analysis identifies emotions such as "joy" and "sadness."

[0779] Step 3:

[0780] The device sends the video and analyzed emotion information selected by the user to a cloud server. The input data is encrypted and transmitted securely. As output, data is transferred to a server waiting to receive.

[0781] Step 4:

[0782] The server uses natural language processing to analyze the received video and emotional information. This analysis interprets the user's editing instructions and generates an editing plan on how to edit the video. For example, if the instruction is "calm mood," the application of slow motion and gentle music will be considered.

[0783] Step 5:

[0784] The server uses a generative AI model to apply appropriate visual effects and music to the video. Based on the input emotion data, the video effects and soundtrack are automatically selected. For example, if the emotion "joy" is input, bright music and a vibrant color filter will be applied.

[0785] Step 6:

[0786] The server automatically inserts text information that matches the user's emotions into the edited video. For example, text such as "I love this moment" might be selected. The output is a completed video that reflects the emotions.

[0787] Step 7:

[0788] The server compresses the edited video and efficiently transfers it to the user's device. This allows the user to instantly play the edited video on their device. As output, high-quality video is provided and saved on the user's device.

[0789] Step 8:

[0790] Users view the edited videos they receive and evaluate and revise them as needed. If users are satisfied with the result, they can share the video through social media or messaging apps. Ultimately, they can share emotionally expressive videos with others.

[0791] (Application Example 2)

[0792] Next, we will explain application example 2. In the following explanation, the data processing device 12 will be referred to as the "server" and the robot 414 as the "terminal".

[0793] There is a growing demand for automated video content that more richly expresses people's emotions during family activities and events. However, current systems make it difficult to edit content based on emotions, and this is especially challenging for users without technical knowledge.

[0794] The specific processing performed by the specific processing unit 290 of the data processing device 12 in Application Example 2 is realized by the following means.

[0795] In this invention, the server includes means for user input, means for transmitting data via a mobile terminal, means for analyzing received video data, means for analyzing natural language to determine an editing task, means for designing an editing plan, means for editing the video based on the editing plan, means for synchronizing generated music with the video, means for transmitting the edited video to the mobile terminal, means for identifying emotional states, means for adding relevant information to the automatically corrected video, and means for recording activities at home and performing editing that is appropriate to the situation. This enables users to perform emotion-based video editing without requiring technical skills.

[0796] A "user input method" is an interface that allows users to provide instructions or information to a system.

[0797] "Means of transmitting data via mobile terminals" refers to the function of sending data to a server using mobile devices such as smartphones and tablets.

[0798] "Means for analyzing received video data" refers to the process of analyzing the transmitted video data using a computer program and understanding its content.

[0799] "A means of analyzing natural language to determine editing tasks" refers to a process that analyzes instructions provided by users in natural language and converts them into specific editing tasks.

[0800] "Means for designing editing plans" refers to the function of planning the editing policies and details of a video based on the analyzed data.

[0801] "Methods for editing a video based on an editing plan" refers to the process of cutting and applying effects to a video according to a planned policy.

[0802] "A means of synchronizing generated music with video" refers to a function that combines music with edited video at the appropriate timing.

[0803] "Means of transmitting edited video to a mobile device" refers to the process of transferring the edited video back to the user's mobile device.

[0804] "Means for identifying emotional states" refers to analytical functions using machine learning or algorithms that identify emotions from the user's speech and facial expressions.

[0805] "Methods for adding relevant information to automatically corrected footage" refers to the process of adding emotionally and contextually relevant information or text to edited footage.

[0806] "A means of recording activities at home and editing them to suit the situation" refers to a function that photographs the daily life and events of a family and edits them into the most suitable format according to the situation.

[0807] To implement this invention, first, a consumer robot equipped with a user-friendly interface for home use is prepared. This robot is equipped with a camera and microphone, which can capture the user's facial expressions and voice in real time. Without the user explicitly inputting their emotional state, the robot uses an emotion engine that automatically analyzes the collected facial data and voice tone. This engine incorporates a machine learning model that can identify the user's emotions with high accuracy.

[0808] The received video data is analyzed by a processing unit within the robot or by a server in the cloud. During this process, a natural language processing engine is utilized to appropriately understand editing requests for the video. For example, if an emotionally moving scene needs to be emphasized, editing tasks such as inserting slow-motion effects or melodic music are automatically designed. This results in video content that reflects the user's emotions.

[0809] Once editing is complete, an AI model automatically inserts emotionally relevant text into the video. For example, text such as "This moment is priceless" might be added to emotionally moving scenes. The server compresses the final edited video and sends it to the user's device. Users can then review the finished video and share it with family and friends in an emotionally resonant way.

[0810] For example, a video might be automatically generated showing a family dinner scene with soft, calming music playing in the background, highlighting moments of smiles in slow motion. In this case, an example of a prompt message might be, "Please suggest an editing style that emphasizes the happy atmosphere of dinnertime."

[0811] The flow of a specific process in Application Example 2 will be explained using Figure 14.

[0812] Step 1:

[0813] The user simply inputs their desired editing style based on videos captured by the home robot. Even without explicit input from the user, the robot uses its camera and microphone to capture the user's facial expressions and voice tone, sending this as emotion data to its internal processor. The input data consists of the user's facial images and voice, which serve as the raw materials for emotion recognition.

[0814] Step 2:

[0815] The device's emotion engine analyzes collected facial expression data and voice tone to identify the user's emotional state. Here, machine learning algorithms are used to process the data and extract emotions such as "joy" and "surprise" as the output of emotion recognition.

[0816] Step 3:

[0817] The server designs an appropriate editing task based on the user's emotional state and the video content. In this process, the server uses a natural language processing engine to analyze the user's editing instructions and determine how the video should be edited. The input is the emotional state and the user's instruction text, and the output is a specific editing plan.

[0818] Step 4:

[0819] The server edits the video based on the designed editing plan. It performs edits such as cutting, applying slow-motion effects and filters, and uses a generative AI model to select music and effects that match the emotions. The input to this process is the editing plan, and the output is the edited video.

[0820] Step 5:

[0821] The server uses an emotion-based generative AI model to automatically insert appropriate text information into the video. Here, by embedding text that expresses emotions into the edited video, a visually rich and emotionally resonant output is obtained.

[0822] Step 6:

[0823] The server compresses the edited video data and sends it back to the original mobile device. The final output is a video file processed into a format that users can easily play at home. This allows users to review and share videos that express the intended emotions.

[0824] The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the controlled object 443 to output the result of the specific processing. The microphone 238 acquires audio indicating user input for the result of the specific processing. The control unit 46A transmits the audio data indicating user input acquired by the microphone 238 to the data processing unit 12. In the data processing unit 12, the specific processing unit 290 acquires the audio data.

[0825] Data generation model 58 is a type of so-called generative AI (Artificial Intelligence). One example of data generation model 58 is ChatGPT (Internet search<URL: https: / / openai.com / blog / chatgpt> ), Gemini (Internet search) <url: https: gemini.google.com ?hl="ja">Examples of generative AI include those described above. The data generation model 58 is obtained by performing deep learning on a neural network. The data generation model 58 is input with prompts containing instructions, and with inference data such as audio data representing speech, text data representing text, and image data representing images. The data generation model 58 infers from the input inference data according to the instructions shown by the prompts, and outputs the inference results in data formats such as audio data and text data. Here, inference refers to, for example, analysis, classification, prediction, and / or summarization.

[0826] In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of this disclosure is not limited thereto, and the specific processing may also be performed by the robot 414.

[0827] Furthermore, the emotion identification model 59, acting as an emotion engine, may determine the user's emotion according to a specific mapping. Specifically, the emotion identification model 59 may determine the user's emotion according to a specific mapping, which is an emotion map (see Figure 9). Similarly, the emotion identification model 59 may also determine the robot's emotion, and the identification processing unit 290 may perform identification processing using the robot's emotion.

[0828] Figure 9 shows an emotion map 400 in which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive the emotions are located. Further out of the concentric circles, emotions representing states and actions arising from mental states are located. Emotion is a concept that includes feelings and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions occurring in the brain are located. On the right side of the concentric circles, emotions that are generally induced by situational judgment are located. Above and below the concentric circles, emotions that are generally generated from reactions occurring in the brain and induced by situational judgment are located. In addition, the emotion of "pleasure" is located on the upper side of the concentric circles, and the emotion of "displeasure" is located on the lower side. Thus, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions arise, and emotions that are likely to occur simultaneously are mapped close together.

[0829] These emotions are distributed at the 3 o'clock position on the Emotion Map 400, and usually fluctuate between feelings of security and anxiety. In the right half of the Emotion Map 400, situational awareness takes precedence over internal feelings, resulting in a calm impression.

[0830] The inside of the Emotion Map 400 represents inner thoughts, while the outside represents actions. Therefore, the further you go from the outside of the Emotion Map 400, the more visible (expressed in actions) your emotions become.

[0831] Here, human emotions are based on various balances, such as posture and blood sugar levels. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. Similarly, in robots, cars, motorcycles, etc., emotions can be created based on various balances, such as posture and battery level. When these balances deviate from the ideal, it results in discomfort, and when they approach the ideal, it results in pleasure. The emotion map can be generated, for example, based on Dr. Mitsuyoshi's emotion map (Research on a system for analyzing brain physiological signals of speech emotion recognition and emotion, Tokushima University, doctoral dissertation: https: / / ci.nii.ac.jp / naid / 500000375379). The left half of the emotion map contains emotions belonging to a region called "response," where sensation is dominant. The right half of the emotion map contains emotions belonging to a region called "situation," where situational awareness is dominant.

[0832] The emotion map defines two emotions that promote learning. One is the emotion around the middle of the negative "repentance" and "reflection" on the situation side. In other words, it is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the emotion around the positive "desire" on the reaction side. In other words, it is when the robot has positive feelings such as "I want more" or "I want to know more."

[0833] The emotion identification model 59 inputs user input into a pre-trained neural network, obtains emotion values ​​representing each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple training data sets, which are combinations of user input and emotion values ​​representing each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions located close together have similar values, as shown in the emotion map 900 in Figure 10. Figure 10 shows an example where multiple emotions such as "reassured," "calm," and "confident" have similar emotion values.

[0834] The above description primarily focuses on the functions of the data processing device 12 in relation to this disclosure. However, the system related to this disclosure is not necessarily implemented on a server. The system related to this disclosure may be implemented as a general information processing system. This disclosure may be implemented, for example, as a software program that runs on a personal computer or as an application that runs on a smartphone. The method related to this disclosure may be provided to users in SaaS (Software as a Service) format.

[0835] In the above embodiment, an example was given in which a specific process is performed by a single computer 22. However, the technology of this disclosure is not limited thereto, and a distributed processing of the specific process may be performed by multiple computers, including computer 22. For example, a data generation model 58 may be provided in an external device of the data processing device 12, and the external device may generate data according to the input data.

[0836] In the above embodiment, an example was given in which the specific processing program 56 is stored in the storage 32, but the technology of this disclosure is not limited thereto. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-temporary storage medium such as a USB (Universal Serial Bus) memory. The specific processing program 56 stored in the non-temporary storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes specific processing according to the specific processing program 56.

[0837] Alternatively, the specific processing program 56 may be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed on the computer 22 in response to a request from the data processing device 12.

[0838] Furthermore, it is not necessary to store the entirety of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store the entirety of the specific processing program 56 in the storage 32; it is acceptable to store only a portion of the specific processing program 56.

[0839] The following types of processors can be used as hardware resources to perform specific processing. Examples of processors include a CPU, a general-purpose processor that functions as a hardware resource to perform specific processing by executing software, i.e., a program. Other examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which have circuit configurations specifically designed to perform specific processing. All of these processors have built-in or connected memory, and all of them perform specific processing by using memory.

[0840] The hardware resource that performs a specific process may consist of one of these various processors, or it may consist of a combination of two or more processors of the same or different types (for example, a combination of multiple FPGAs, or a combination of a CPU and an FPGA). Alternatively, the hardware resource that performs a specific process may consist of a single processor.

[0841] Examples of configurations using a single processor include, firstly, a configuration in which one or more CPUs and software are combined to form a single processor, and this processor functions as a hardware resource that performs a specific process. Secondly, there is a configuration using a processor that realizes the functions of the entire system, including multiple hardware resources that perform a specific process, on a single IC chip, as exemplified by SoCs (System-on-a-chip). In this way, a specific process is realized using one or more of the above types of processors as hardware resources.

[0842] Furthermore, the hardware structure of these various processors can more specifically utilize electrical circuits that combine circuit elements such as semiconductor devices. Also, the specific processing described above is merely an example. Therefore, it goes without saying that unnecessary steps can be deleted, new steps added, or the processing order rearranged, as long as it does not deviate from the main purpose.

[0843] The descriptions and illustrations presented above are detailed explanations of the technical aspects of this disclosure and are merely examples of the technical aspects. For example, the above descriptions of the structure, function, operation, and effect are examples of the structure, function, operation, and effect of the technical aspects of this disclosure. Therefore, it goes without saying that you may delete unnecessary parts, add new elements, or replace elements in the descriptions and illustrations presented above, as long as you do not deviate from the essence of the technical aspects of this disclosure. Furthermore, in order to avoid confusion and facilitate understanding of the technical aspects of this disclosure, explanations of common technical knowledge and the like that do not require special explanation to enable the implementation of the technical aspects of this disclosure have been omitted from the descriptions and illustrations presented above.

[0844] All documents, patent applications, and technical standards described herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually noted to be incorporated by reference.

[0845] The following is further disclosed regarding the embodiments described above.

[0846] (Claim 1)

[0847] User input means and

[0848] A means of transmitting data via a mobile terminal,

[0849] A means of analyzing the received video data,

[0850] A means of analyzing natural language to determine the editing task,

[0851] Methods for designing an editing plan,

[0852] Methods for editing videos based on an editing plan,

[0853] A means of synchronizing the generated music with the video,

[0854] A means of transmitting edited video to a mobile device,

[0855] A system that includes this.

[0856] (Claim 2)

[0857] The system according to claim 1, further comprising means for identifying emotional elements by voice analysis.

[0858] (Claim 3)

[0859] The system according to claim 1, further comprising means for automatically inserting text information into the generated video.

[0860] "Example 1"

[0861] (Claim 1)

[0862] An input means for the user to enter instructions,

[0863] A means of communication for transmitting data via a terminal device,

[0864] An analysis means for analyzing received multimedia data,

[0865] A processing means for analyzing natural language and determining editing instructions,

[0866] A design method for constructing an editing plan based on editing instructions,

[0867] An editing tool for automatically editing videos according to an editing plan,

[0868] A synchronization method for appropriately matching the generated sound material to the media,

[0869] A transmission means for sending the edited content to a terminal device,

[0870] A system that includes this.

[0871] (Claim 2)

[0872] The system according to claim 1, further comprising specific means for extracting emotional aspects by analyzing voice data.

[0873] (Claim 3)

[0874] The system according to claim 1, further comprising arrangement means for automatically adding text information to the generated video.

[0875] "Application Example 1"

[0876] (Claim 1)

[0877] User input means and

[0878] A means of transmitting data via a mobile terminal,

[0879] A means of analyzing the received video data,

[0880] A means of analyzing natural language to determine the editing task,

[0881] A means of designing an editorial plan based on the advertising concept,

[0882] Methods for editing videos based on an editing plan,

[0883] A means of synchronizing the generated music with the video,

[0884] A means of transmitting edited video to a mobile device,

[0885] A system that includes this.

[0886] (Claim 2)

[0887] The system according to claim 1, further comprising means for identifying emotional elements by voice analysis.

[0888] (Claim 3)

[0889] The system according to claim 1, further comprising means for automatically inserting advertising information into the generated video.

[0890] "Example 2 of combining an emotion engine"

[0891] (Claim 1)

[0892] User input means and

[0893] A means for transmitting video and emotional information through a terminal,

[0894] A means for analyzing the received video image,

[0895] A means of analyzing editorial instructions and sentiment information using natural language,

[0896] Means for designing a video editing plan,

[0897] A means of editing video based on an editing plan,

[0898] A means for applying music and visual effects that are suited to emotional data to moving images,

[0899] A method for automatically inserting text into videos,

[0900] A means of sending edited video images to a terminal,

[0901] A system that includes this.

[0902] (Claim 2)

[0903] The system according to claim 1, further comprising means for identifying emotions from voice data.

[0904] (Claim 3)

[0905] The system according to claim 1, further comprising means for processing prompt sentences using a generative AI model.

[0906] "Application example 2 when combining with an emotional engine"

[0907] (Claim 1)

[0908] User input means and

[0909] A means of transmitting data via a mobile terminal,

[0910] A means of analyzing the received video data,

[0911] A means of analyzing natural language to determine the editing task,

[0912] Methods for designing an editing plan,

[0913] Methods for editing videos based on an editing plan,

[0914] A means of synchronizing the generated music with the video,

[0915] A means of transmitting edited video to a mobile device,

[0916] Means for identifying emotional states,

[0917] A means of adding relevant information to automatically corrected video,

[0918] A means of recording activities at home and editing them to suit the situation,

[0919] A system that includes this.

[0920] (Claim 2)

[0921] The system according to claim 1, further comprising means for identifying emotional elements by voice analysis.

[0922] (Claim 3)

[0923] The system according to claim 1, further comprising means for automatically inserting text information into the generated video. [Explanation of symbols]

[0924] 10, 210, 310, 410 Data Processing Systems 12 Data Processing Devices 14 Smart Devices 214 Smart Glasses 314 Headset-type terminal 414 Robots< / url:> < / url:> < / url:> < / url:>

Claims

1. User input means and A means of transmitting data via a mobile terminal, A means of analyzing the received video data, A means of analyzing natural language to determine the editing task, A means of designing an editorial plan based on the advertising concept, Methods for editing videos based on an editing plan, A means of synchronizing the generated music with the video, A means of transmitting edited video to a mobile device, A system that includes this.

2. The system according to claim 1, further comprising means for identifying emotional elements by voice analysis.

3. The system according to claim 1, further comprising means for automatically inserting advertising information into the generated video.