Methods for editing media content, computing devices, computing systems, non-temporary computer-readable media, and computer programs.

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A machine learning-based media content editing architecture with a large language model and prompt manager enhances user experience by predicting and performing edits, addressing software complexity issues and improving over time based on user interactions and engagement metrics.

JP2026521024APending Publication Date: 2026-06-25LEMON CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: LEMON CO LTD
Filing Date: 2024-06-28
Publication Date: 2026-06-25

Application Information

Patent Timeline

28 Jun 2024

Application

25 Jun 2026

Publication

JP2026521024A

IPC: H04N21/854; G06F8/35

AI Tagging

Application Domain

Model driven code Selective content distribution

Technology Topics

MediaFLO Engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Systems and methods for providing improved skip and delay functionality in media assets
US20260181214A1Selective content distribution MediaFLO Engineering
Information recognition method and device based on attention module
CN117312582BMultimedia data indexingSpecial data processing applicationsConcurrent computationIn vehicle
Dynamic conditional advertisement insertion
US12666115B2Selective content distribution MediaFLOContent retrieval
Electronic device, method, and non-transitory computer-readable storage medium for generating media collection including media contents
WO2026134524A1Data processing applications Biological models MediaFLO World Wide Web
Digital media large screen
CN310045574SComputer graphics (images)MediaFLO

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing media content editing software is complex and underutilized due to its complexity, leading to users not fully exploiting its capabilities, as they lack knowledge of available features and navigation is difficult.

Method used

A media content editing architecture using machine learning techniques that includes a large language model (LLM) to analyze user input, predict editing actions, and perform edits through a dialog-assisted interface, with a prompt manager and tool database to enhance usability and flexibility.

Benefits of technology

The architecture provides intuitive and flexible editing capabilities, allowing users to effectively utilize advanced features without extensive software knowledge, and continuously improves through user feedback and engagement metrics.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 2026521024000001_ABST

Patent Text Reader

Abstract

An example of a media content editing architecture using machine learning techniques is provided. One embodiment includes a method for editing media content, the method comprising: receiving media content from a user; receiving an editing request for the media content from the user; obtaining a prompt selected based on the editing request from a prompt pool; analyzing the obtained prompt and the editing request using a large-scale language model to generate one or more editing actions to be performed on the media content; and editing the media content based on the editing request and generating the edited media content by performing the one or more editing actions on the media content.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] Cross - reference to Related Applications This application claims priority to U.S. Application No. 18 / 346,727, filed on July 3, 2023, with the title "Technical Architecture for Media Content Editing Using Machine Learning", the disclosure of which is hereby incorporated by reference in its entirety.

[0002] The present invention relates to a technical architecture for media content editing using machine learning.

Background Art

[0003] Raw media content in its original recording format is usually edited before publication in order to enhance its appeal for better viewer engagement. Editing media content (e.g., images, audio, video, and other modalities) typically involves the use of software with editing capabilities provided in the form of editing tools. Editing of media content can include various operations and modifications. For example, in the context of video editing, editing can include trimming segments, changing the order of segments, adjusting the playback speed, embedding content such as special effects and caption text, adjusting audio, cropping, etc. Furthermore, by using powerful editing software, a non - linear editing (NLE) system is made possible where multiple edits can be performed on raw media content in a non - destructive process so that the original data can be recovered, i.e., the edits can be reversed.

Summary of the Invention

[0004] An example of a media content editing architecture using machine learning techniques is provided. One embodiment includes a method for editing media content, the method comprising: receiving media content from a user; receiving an editing request for the media content from the user; obtaining a prompt selected based on the editing request from a prompt pool; analyzing the obtained prompt and the editing request using a large-scale language model to generate one or more editing actions to be performed on the media content; and editing the media content based on the editing request and generating the edited media content by performing the one or more editing actions on the media content.

[0005] This summary is provided to present a simplified excerpt of the concept, which will be further described in the embodiments for carrying out the invention described below. This summary is not intended to identify the main or basic features of the claimed subject matter, nor to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to an implementation that solves any or all of the defects described in any part of this disclosure. [Brief explanation of the drawing]

[0006] [Figure 1] This is a block diagram model representing a typical pipeline and various components of an exemplary technical architecture for realizing a media content editing application. [Figure 2] This block diagram model shows an exemplary backend tool service for providing editing tools and capabilities, which can be achieved with the typical pipeline described in Figure 1. [Figure 3] This block diagram model illustrates an exemplary use of context memory 302 in a media content editing architecture, achievable with the typical pipeline described in Figure 1. [Figure 4] This block diagram model illustrates an exemplary system evolution and improvement application for a media content editing architecture, achievable with the typical pipeline described in Figure 1. [Figure 5] This block diagram model shows an exemplary media content editing model architecture with system evolution and improvement processes, providing a detailed illustration of the typical pipeline described in Figure 1. [Figure 6] This flowchart illustrates an exemplary method for a media content editing process using machine learning technology, which can be implemented using the technical architecture shown in Figure 1. [Figure 7] This flowchart shows an exemplary method for improving a media content editing architecture, which can be achieved using the technical architecture shown in Figure 1. [Figure 8] This figure schematically illustrates a non-limiting embodiment of a computing system capable of implementing one or more of the methods and processes described above. [Modes for carrying out the invention]

[0007] Media content editing software, capable of providing powerful editing tools, is widely available for both commercial and personal use. Typically, content editing software involves the use of a user interface (UI) with various sections, menus, buttons, etc., for navigating and selecting desired editing tools. These technologies have evolved over time, offering a vast array of tools for performing numerous editing tasks. However, software with more powerful editing capabilities and features naturally becomes more complex. As a result, many features remain unexplored by the typical user. Complex UI navigation, a lack of knowledge about the software's capabilities, and the difficulty in utilizing these capabilities can all contribute to underutilization of editing software. For example, a typical user of editing software may not be aware of or capable of using specific tools or features of the software to perform their desired edits.

[0008] In light of the above observations, a media content editing architecture using machine learning techniques is provided. The machine learning-based architecture can be configured in various ways to provide an intuitive media content editing application. Such an application may be configured to receive editing requests from a user and perform one or more desired edits on the media content provided by the user. In some implementations, the media content to be edited is generated by the application. Editing requests are provided in text form and can be converted into one or more edits to be performed by applying machine learning techniques and natural language processing. Edits can be performed, and the rendering results are provided to the user for evaluation. In some implementations, the editing process is performed as a non-linear editing (NLE) process. Such implementations allow for better utilization of the architecture's editing capabilities in a more flexible manner. The user may revert edits or provide another editing request. The process can be repeated and continued until the user is satisfied, at which point the user publishes the edited media content.

[0009] Various machine learning techniques, such as deep learning models, can be applied. In some implementations, the media content editing architecture includes a large language model (LLM) for analyzing and interpreting user input to predict one or more editing actions to be performed. Inference prediction can be performed based on conversational text interaction with the user by receiving user text input and responding with dialog replies. The media content editing architecture may further include a prompt manager that provides prompts in response to editing requests. Prompts may be obtained from a prompt database. The prompt manager then generates a list of instructions or actions corresponding to the edits to be performed by feeding the user's request into the provided prompts and performing inference predictions using the LLM. The LLM agent may further be configured to perform the above actions for editing media content. To perform editing, the LLM agent utilizes a register database of available editing tools that the agent can access. The database may be linked to available editing tools and their associated application programming interfaces (APIs) that the LLM agent can use to perform editing actions.

[0010] In some implementations, the media content editing architecture is configured to have a system evolution process that trains and improves the architecture. For example, the LLM and / or prompt database may be improved based on the operation history and edited media content. The media content editing architecture may be configured to store and remember conversation history and / or contextual information (e.g., asset descriptions of edited media content). To prevent dilution of valid samples, the media content editing architecture may be configured to store information about successful submissions (e.g., edited content that is ultimately published by the user). The stored information may be used to improve the media content editing architecture based on a predetermined reward function. For example, the reward function can be determined using various metrics associated with already published edited media content. Such metrics may include those that represent the success of the edited media content in terms of viewer engagement. Examples of such metrics include views, comments, likes, shares, etc., associated with published edited media content.

[0011] Next, we will turn to the diagrams to further illustrate and describe a media content editing architecture utilizing machine learning technology. Figure 1 is a block diagram model showing a computing system 100. The block diagram model represents a general pipeline and various components of an exemplary technical architecture for realizing a media content editing application in a client-server environment. The computing system 100 includes a server system 101, which includes multiple server computing devices configured to realize a social media network platform by executing the illustrated modules and services. The server system 101 is configured to communicate with multiple client computing devices 103, each running a social network client 102, via a computer network N, for example, the Internet. For example, the computing system 100 can realize a short video social media network where users create, publish, share, and engage with short videos. In other realizations, the computing system 100 may be realized as an offline application on a computing device. It will be understood that certain modules shown on the server system may be realized on a client computing device, for example, a backend editing tool. Furthermore, the social network client may be a mobile client of the social media network, an effects editing software program running on a personal computer, or other software.

[0012] The editing process is performed via a dialog-assisted editing interface 104, which includes a dialog interface 106. User 108 provides media content 110 to be edited, along with an editing request 112. In some implementations, the media content 110 is generated by a media content editing application in response to the user's request. For example, the media content can be generated using generative machine learning techniques. The media content 110 may have various modalities. For example, the media content 110 may be an image, recording, video, etc. The media content 110 may be displayed via the dialog-assisted editing interface 104. Furthermore, edits performed on the media content 110 during the editing process can be displayed to user 108 via the dialog-assisted editing interface 104, allowing user 108 to evaluate their next steps.

[0013] An edit request 112 is provided to the prompt manager module 114. In some implementations, the edit request 112 is provided in the form of text input. Accordingly, the prompt manager module 114 queries the prompt pool 116 to obtain a prompt 118. In some implementations, the obtained prompt 118 is selected from the prompt pool 118 based on the edit request. The prompt pool 116 may include a database of predefined prompts, each which may be associated with the editing capabilities of the computing system 100. For example, a prompt may include a basic description of a given tool, typical questions associated with the tool, a predefined input format to the tool, and / or possible intermediate steps when using the tool. The use of the prompt pool 116 offers several advantages. One advantage is the standardization of input. Another advantage is the flexibility to extend the editing toolset. For example, if a new tool is added to the editing capabilities of the computing system, a corresponding prompt may be added to the prompt pool 116.

[0014] The prompt manager module 114 fills in the edit request 112 with the acquired prompt 118 and passes it to the LLM agent 120. In some implementations, the computing system 100 includes a content asset analyzer 121 for processing media content 110 and generating metadata that can be provided as input to the LLM agent 120. For example, the content asset analyzer 121 may preprocess video content to extract individual frames, analyze the visual and audio content of the video content, and generate video metadata which may include text descriptions of the analyzed visual and audio content, recognized entities, timestamps for key events, and video captions of the video content.

[0015] The LLM agent 120 includes an LLM prediction module 122 that utilizes the LLM 124 to perform inference predictions on received inputs. The LLM 124 can be implemented as a language model formed from a trained neural network with numerous parameters. The LLM 124 can be trained as a general-purpose model or for a limited range of tasks. For example, a media content editing architecture can be implemented with a single general-purpose trained LLM, or with multiple LLMs, each trained for different tasks. In some implementations, a set of LLMs, each trained for a specific range of tasks, is provided, and the LLM agent 120 selects the LLM to use based on the received prompt 118. By utilizing the prompt 118 along with the user's editing request 112, structure and context are provided to the input to the LLM 124. Therefore, since the input is somewhat predictable in terms of structure, the LLM 124 is able to provide more accurate inference predictions. The LLM agent 120 may be configured to provide an interactive text conversation with user 108, in which a dialog response 126 is generated using the LLM 124 and provided to user 108 in return via the dialog interface 106. User 108 can then provide new text input to advance the conversation. The conversation continues until the LLM agent 120 decides to end the conversation, which may be based on new text input and / or the current round number in the conversation. Once the conversation is ended, the LLM prediction module 122 generates an inference prediction using the LLM 124 based on the received text input.

[0016] The LLM agent 120 includes an action planning and execution module 128 that analyzes inference predictions to create a list of editing actions. Possible editing actions may be selected from a tool database 130 that lists editing tools available for use by the computing system 100 when editing media content 110. The tool database 130 is provided in a backend service 132 that includes tools 134 and associated APIs 136. The tools for editing media content may include, but are not limited to, tools for adding, deleting, and / or modifying content of various modalities, such as text, images, video, audio, etc. For example, a tool may be implemented to embed a recording into video content. In some implementations, the added content is created using a generation process.

[0017] The action planning and execution module 128 executes a list of editing actions using appropriate API 136 calls to the tools 134 required to perform the editing actions. The editing actions are performed on the media content 110 provided by the user 108, and the edited media content 138 is provided back to the user 108 via a dialog-assisted editing interface 104, in which a rendering of the edited media content 138 is displayed to the user for viewing and deciding on the next action. For example, the user 108 may decide to revert the edits that were performed, provide a new editing request 112 for additional edits, or publish 140 the edited media content 138. Upon publication 140, a copy of the edited media content 138 may be stored on the content server 142. In the example shown in Figure 1, the published media content 144 is provided on the social network client 102 for viewing by other users 146 on the platform.

[0018] In some implementations, the media content editing architecture includes a system evolution process that improves its ability to propose and / or execute actions / edits more efficiently. Various types of feedback are available in the system evolution process. For example, direct user feedback (e.g., user 108 may provide feedback in the form of a rating system in which they believe the cause of effectiveness lies in the prompts and / or tools used when performing the edit). Another example of feedback involves the use of conversation history and / or contextual information of successful submissions (e.g., published media content 144). Different reward functions may be used to determine the impact of a given improvement iteration. In the example shown in Figure 1, the computing system 100 includes a platform audience engagement aggregation module 148 for providing information about audience engagement metrics regarding published media content 144. Exemplary metrics include views / listenings, comments, shares, likes, etc. Higher audience engagement metrics suggest more "successful" edited media content. Therefore, greater weight can be given to the information used in the improvement process related to published media content with higher audience engagement metrics. For example, when a predetermined audience engagement metric threshold (e.g., a predetermined number of video views within a predetermined time frame) is reached, an improvement process may be implemented for the published media content that has reached the threshold.

[0019] The improvement process can be performed on various modules within the architecture. In the example in Figure 1, the computing system 100 includes a prompt improvement module 150 for improving the prompt pool 116. The computing system 100 further includes an LLM fine-tuning module 152 for improving the LLM 124. Conversation history and / or contextual information of a given published media content can be used to improve the prompt pool 116 and / or LLM 124. For example, it is not practical to provide all editing options for a given prompt. Therefore, one set of available options is usually provided for a given prompt. Improvements to the prompt pool can affect the set of options provided to the user. By using a high audience engagement metric as a surrogate metric for the success of edited media content, the conversation history and / or contextual information related to the editing of the media content can be used to improve the options provided in a given prompt so that more "popular" options are provided. As a more specific example, in response to a user request to embed music in video content, the prompt may initially include options for different music genres. The prompts may be improved later to include the above-mentioned popular genre options, based on published content that showed a high audience engagement metric when edited to embed more popular genres of music. In this way, the computing system 100 can be continuously updated to better respond to user editing requests.

[0020] FIG. 2 is a block diagram showing an aspect of a configuration example of the computing system 100 of FIG. 1. FIG. 2 shows an exemplary backend tool service 132 for providing editing tools and editing capabilities that can be used in the computing system 100. The exemplary backend tool service 132 provides backend support and functions that can be used for an LLM agent to perform edits on media content (for example, the action planning and execution module 128 can use the backend tool service 132 to perform edits on the media content 110).

[0021] The backend tool service 132 includes a repository of available editing tools / capabilities for a media content editing architecture. In some implementations, the tools are arranged and organized within groups. In another embodiment, the tools are organized within multiple levels of a hierarchy. Such an organization scheme enables a conversational interaction that presents the user with a practical number of options for a given selection. For example, instead of listing all available tools for the user to select from, by first providing groupings, the user's desired edits can be narrowed down and a more appropriate context can be provided.

[0022] In the example shown in Figure 2, tool 134 is organized into group 202. For example, music recommendation tool 134A and filter recommendation tool 134N are shown to be organized under the recommendation group 202. Other groups and classifications shown include understanding, describing, artificial intelligence (AI) generation, search and matching, localization, structural analysis, AI correction, and evaluation. Each grouping may include various editing capabilities across various modalities, such as images, videos, music, text, and audio. Exemplary capabilities for understanding groupings may include content embedding and tagging. Exemplary capabilities for localization groupings may include object detection, event detection, character recognition, object segmentation, event / scene detection, and beat / chorus / beginning detection. Exemplary capabilities for describing groupings may include image captioning, video captioning, title generation, and text summarization. Exemplary capabilities for structural analysis groupings may include slicing (shot boundary) and highlight detection. Exceptional capabilities for AI-generated grouping may include various generation processes for content generation, such as image generation, video generation, music generation, and video script generation. Exceptional capabilities for AI-corrected grouping may include trimming, volume adjustment, voice modification, noise reduction, super-resolution, cropping, background removal, tone mapping, inpainting, video-audio synchronization, and speed curves. Exceptional capabilities for search and matching grouping may include material search and material replacement. Exceptional capabilities for recommendation grouping may include recommendation and application of various content, such as filters, music, titles, narrative speech, animations, special effects, stickers, and text (including different fonts, styles, animations, and positions). Exceptional capabilities for evaluation grouping may include image quality, video quality, and music quality. For ease of understanding, the backend tool service 132 may include any number of groupings 202 using any classification scheme.Furthermore, each grouping 202 may include any number of tools 134, and these tools 134 may be further classified into subgroups in some examples.

[0023] Each tool 134 includes information describing a callback API 136 that can be used by an LLM agent to be called when an editing tool executes an edit on media content. A collection of callback APIs 136 is aggregated within a tool API pool 130 that functions as a repository accessible by the LLM agent. For example, as shown in FIG. 1, an action planning and execution module 128 utilizes the tool API pool 130 to execute a list of edit actions it generated to form edited media content 138.

[0024] For each tool 134, a corresponding prompt 204 is generated. A collection of prompts 204 is aggregated within a prompt pool 116, and in some examples, the prompt pool 116 can be accessed by a prompt manager during prompt acquisition. The prompt 204 can be formatted in various ways. In some implementations, the prompt 204 for a given tool includes a basic description of the tool, typical questions related to the tool, a defined input format to the tool, and / or intermediate steps when using the tool. The backend tool service 132 can be dynamically implemented by its ability to add and remove tools 134. When a new tool 134 is added, a corresponding prompt 204 is generated, added to the backend tool service 132, and consequently added to the prompt pool 116.

[0025] Figure 3 is a block diagram showing an example configuration of the computing system 100 in Figure 1. Figure 3 illustrates an exemplary use of context memory 302 within the media content editing architecture available in the computing system 100. Directed connections are illustrated to show the relationships between components related to context memory 302. One direct method for editing media content 110 involves a direct command 303 given by the user via a dialog-assisted editing interface 104. The direct command 303 describes a specific editing action desired by the user and is formatted so that the media content editing architecture understands the given command without using LLM or LLM agent 120. Thus, the direct command 303 allows the user to directly invoke editing tools from the backend tool service 132 to perform edits on the media content 110. For more advanced unstructured queries, context memory 302 can be used to store context information and guide the editing process.

[0026] The context memory 302 can contain storage for various contextual information that the media content editing architecture can use for various purposes. For example, during media content editing by the action plan and execution module 128, an edit draft history 304 can be compiled based on a list of nonlinear edits and associated editing tools. The edit draft history 304 may include steps and edits for rendering the edited media content to the user via the dialog-assisted editing interface 304. The context memory 302 further comprises an editing context 306 that provides context to the tools, and editing capabilities provided by the backend tool service 132.

[0027] Conversational interactions between the user and the LLM can also be memorized. In the example shown in Figure 3, the conversation history 308 is stored in the context memory 302. For example, a dialogue can be memorized when conversational input is provided to the LLM agent 120 and when the LLM prediction module 122 generates a dialogue response 126 using the LLM. The conversation history 308 can be used for various purposes. During the editing process, the conversation history 308 can provide information to the media content editing architecture to determine how many edits have been performed. In some implementations, the media content editing architecture is configured to suggest publishing the edited content 110 after a certain amount of conversational back-and-forth editing and / or multiple rounds of editing. Another use involves prompt suggestion based on previous interactions in the conversation. For example, a previous interaction in which the user rejected a suggested edit may be memorized in the conversation history 308, and the media content editing architecture may be configured to be less likely to provide a relevant prompt.

[0028] In some implementations, the conversation history 308 and / or edit draft history 304 may be used for training or improving the media content editing architecture, which may include improvements to the prompt pool and / or LLM. The conversation history 308 and / or edit draft history 304 may be stored for each user submission, and their contents may be used to improve the prompt pool and / or LLM. For example, published edited media content with a high audience engagement metric may be considered a training sample for the improvement process. The conversation history 308 and / or edit draft history 304 of the published edited media content may be used to improve the prompt pool and / or LLM so that the prompts and dialogue responses 126 associated with the published edited media content are more likely to appear in future interactions.

[0029] Figure 4 is a block diagram showing an example configuration of the computing system 100 in Figure 1. Figure 4 illustrates an exemplary system evolution and improvement application for a media content editing architecture usable in the computing system 100. In the example shown in Figure 4, an exemplary system evolution and improvement process is performed for the backend tool service 132 and LLM 124. Various processes can be used to improve the media content editing architecture. In some implementations, reinforcement learning algorithms, such as reinforcement learning from human feedback (RLHF) and proximal policy optimization (PPO), are implemented. For example, information recorded from a successful editing process 402 can be used to improve the prompts provided by LLM 124 and the backend tool service 132 to provide more relevant prompts and responses in future interactions with the user. Records of successful editing processes can be provided at various stages of the editing process. Various information, such as conversation history and contextual information (e.g., asset descriptions), can be recorded. Prompt / query and response pairs in such information can be used as training samples for the improvement process. The conversation history may include the results of both successful and less successful conversations. For example, as illustrated in Figure 3, the context memory 302 may be implemented to store various information about the editing process, such as the conversation history 308, the edit draft history 304, the editing context 306, and so on.

[0030] A "successful" editing process can be defined in various ways. In some implementations, the editing process is considered successful when the edited media content is published. At that point, information related to the editing process, such as conversation history 308 and edit draft history 30, is recorded. In other implementations, all editing process interactions with the user are recorded. However, this can generate a large amount of unnecessary data that has little impact on whether prompts and tool suggestions were effective. In other implementations, the editing process of published edited media content is considered successful when a predetermined audience engagement threshold is reached.

[0031] In the example shown in Figure 4, information recorded about a given successful editing process is incorporated into user acceptance interaction 404 and user rejection interaction 406. Such interactions may include user responses to proposed editing and tool options provided by the prompt manager and / or LLM agent. Model 400 includes an editing experience pool 408 that aggregates the records of a successful editing process 402, including user accepted and rejected interactions 404, 406. The aggregated information in the editing experience pool 408 is available to the prompt improvement module 150 to improve the backend tool service 132. More specifically, the editing experience pool 408 can be used to improve prompts provided by the backend tool service 132. For example, prompts can be modified in accordance with information in the editing experience pool 408 that can be associated with user acceptance interaction 404 and user rejection interaction 406, respectively, describing efficient and inefficient prompts. User acceptance interaction 404 can provide context that suggests the accepted prompt is more likely to lead to a successful editing process. Therefore, similar prompts may be configured to be suggested more frequently for future interactions. Similarly, prompts associated with user rejection interaction 406 may be modified accordingly or configured to be suggested less frequently in other interactions.

[0032] Furthermore, aggregated information within the editing experience pool 408 may be used by the LLM fine-tuning module 152 to improve the LLM 124. Similar to the prompt improvement module 150, the LLM fine-tuning module 152 can improve the LLM 124 by utilizing information within the editing experience pool that describes efficient and inefficient interactions as positive and negative reinforcement data, respectively. In some implementations, the reward function is implemented to determine the extent to which the information within the editing experience pool influences the improvement process. Various reward models are possible. In the example shown in Figure 4, online performance data is used as the reward model 410 to improve the LLM 124. Online performance data for published edited media content can be quantified using various audience engagement metrics and indicators, such as views, likes, shares, and comments. A platform audience engagement aggregation module 148, for example illustrated and described in Figure 1, may be used to aggregate relevant audience engagement indicators for the published edited media content from the hosting service of the published edited media content. Such data can be fed into the reward model 410 to determine the weights of training samples (information within the editing experience pool 408) in the improvement process. Figure 4 shows how the evolution and improvement system uses online performance data as a reward model for improving LLM 124, but such a model can also be used for improving prompts in the backend tool service 132.

[0033] Figure 5 is a block diagram illustrating an exemplary media content editing model architecture with a system evolution and improvement process, usable with the computer system 100 of Figure 1. Figure 5 provides a detailed illustration of the pipeline flow of a conversational nonlinear editing process using the exemplary content editing model architecture. The process begins with a user 108 interacting with a dialogue-assisted editing interface 104 and providing media content 110 to be edited. The media content 110 may be any modality, including images, audio, video, etc. In some implementations, the media content 110 is generated by the exemplary media content editing model architecture via a generative AI process. The dialogue-assisted editing interface 104 can be implemented on any computing device. In some implementations, the dialogue-assisted editing interface 104 is provided within a social networking client, for example, the social networking client 102 shown in Figure 1. The social networking client may include various social networking platforms, such as short video social media platforms, as described above.

[0034] The dialog-assisted editing interface 104 provides an interface that allows user 108 to view media content 110 during the editing process, for example, while rendering the results of selected edits. Furthermore, the dialog-assisted editing interface 104 includes a dialog interface 106 on which text commands can be sent and received. The editing process includes user 108 providing an editing request 112 using the dialog interface 106. The editing request 112 is provided to the prompt manager module 114. Because the editing capabilities of the exemplary media content editing model architecture may include numerous editing tools, the prompt manager module 114 may be implemented to contribute to structuring and narrowing the editing request to a subset of the architecture's editing capabilities. Through prompt engineering, the prompt manager module 114 and the prompt acquisition module 502 operate to acquire prompts from the prompt pool 116. The acquired prompts are typically related to the editing request 112. For example, if the editing request 112 is related to music, the prompt acquisition module 502 can query the prompt pool 116 to acquire prompts related to music. In some implementations, the query provides a set of prompts with similar descriptions that match the edit request 112, and this set of prompts is combined with a fixed prompt about a tool related to the edit request 112 to form a new prompt.

[0035] Generally, the prompt pool 116 contains at least one prompt corresponding to each editing capability. The prompt pool 116 can be implemented as a dynamic database where prompts can be added, deleted, and modified, thus providing flexibility in extending the architecture's editing toolset. For example, if a new tool is registered in the toolset, a corresponding prompt may also be added to the prompt pool 116. Prompts can be formatted in various ways. In some implementations, a prompt includes a basic description of the tool, typical questions related to the tool, a predefined input format for the tool, and / or possible intermediate steps when using the tool.

[0036] The edit request 112 and the acquired prompt can be supplied to the LLM prediction module 122 of the LLM agent 120, which uses the LLM 124 to perform inference prediction. The LLM agent 120 can be implemented as a text command transmitter / receiver providing conversational interaction with the user 108, and the LLM agent 120 uses the LLM 124 to predict the response to the text input it receives (edit request 112 and prompt). Since prompts are generally predefined, the LLM 124 can output structured results. In some implementations, the LLM 124 is a single general-purpose LLM. In other implementations, the LLM agent 120 can access a repository of LLMs, each trained for one or more specific tasks. In such cases, the choice of which LLM to use may be based on the edit request 112 and / or prompt.

[0037] The LLM agent 120 may be configured to convert the structured results from LLM 24 into a tool execution sequence and inputs for executing the tool. The LLM agent 120 includes an LLM output analysis unit 504 that obtains structured information within the predicted response by analyzing the predicted response from the LLM prediction module 122. The LLM agent 120 further includes an LLM action planning module 128A and an LLM tool execution model 128B. The LLM action planning module 128A and the LLM tool execution model 128B may be implemented similarly to the action planning and execution module 128 in Figure 1. The LLM action planning module 128A may be implemented to plan actions to be performed based on the structured information. Based on the planned actions, the LLM tool execution module 128B forms a toolchain and executes the toolchain using API calls from the tool API pool 130 for the tools in the toolchain. For open questions or complex requests, the LLM agent 120 can use the LLM 124 to perform self-search using the self-search module 506 and the tool execution chain module 508, respectively, generating multiple intermediate steps. At each step, the LLM 124 can gradually approach the final answer using search or follow-up questions. Conversational back-and-forth text is possible. For example, the dialog response 126 and subsequent responses may be provided to the user 108 via the dialog interface 106 of the dialog-assisted editing interface 104. In some implementations, the dialog response is stored in the context memory 302, which records the conversation history 308.

[0038] When the toolchain is executed, it uses API calls to the tools within the toolchain to perform editing on the media content 110. The backend tool service 132 provides editing capabilities and, by executing the editing steps, stores the above steps in the editing draft history 304 in context memory 302. The edited media content is provided to the user 108 via the dialog-assisted editing interface 104, and the user 108 can decide on the next set of actions. For example, the user 108 can decide to revert the edits, provide an additional editing request, or publish the edited media content 140.

[0039] Model 500 includes a system evolution architecture that enables an improvement process for the media content editing architecture. The improvement process can be implemented using similar components and methods as described in Figure 4. In the example shown in Figure 5, the prompt improvement module 150 and the LLM fine-tuning module 152 are implemented to improve prompts in the backend tool service 132 and LLM 124, respectively. Training data for the improvement process may include various contextual information stored during the editing process. For each submission (a set of interactions with the user 108 about a given media content 110), the conversation history 302 and contextual information, such as asset descriptions, may be stored in the context memory 302. This conversation history may include the results of both successful and unsuccessful conversations. In some implementations, only successful submissions are saved. A “successful” submission can be defined in various ways. For example, a submission may be considered successful when the edited media content is published 140.

[0040] When the edited media content is published, a record of the successful editing process 402 is obtained. Such a record may include context data stored during the editing process for the edited media content, for example, data stored in context memory 302. In some examples, the context data may be separated into user acceptance interactions 404 and user rejection interactions 406. Such interactions may include user responses to proposed editing and tool options. The editing experience pool 408 aggregates the context data, which is then used by the prompt improvement module 150 and the LLM fine-tuning module 152 to improve the backend tool service 132 and LLM 124, respectively.

[0041] The reward model 410 can be implemented to assign different weights to the training data. Rewards can be based on various criteria. In the example shown in Figure 5, online performance data in the form of an audience engagement metric is used as the reward function. Higher audience engagement indicates a higher reward for the training data (contextual data) that generated the published edited content. The audience engagement metric may include various metrics related to the online performance data of the published edited media content. Exemplary metrics include views, likes, shares, and comments. The platform audience engagement aggregation module 148, for example illustrated and described in Figure 1, can be used to aggregate relevant audience engagement metrics for the published edited media content from the hosting service of the published edited media content. In some implementations, the reward model 410 is also applied to the prompt improvement module 150.

[0042] Figure 6 is a flowchart illustrating an exemplary method 600 for a media content editing process utilizing machine learning techniques. Such a method may be performed on a media content editing architecture, for example, one illustrated and described in Figure 5. In step 602, method 600 includes receiving media content from a user. Various types of media content and modalities are available. For example, media content may be images, audio recordings, or videos. Media content may be provided by the user, for example, by an upload process. In some implementations, media content is provided by a generative AI process.

[0043] In step 604, method 600 includes receiving an edit request from a user regarding the media content. The edit request may be received from the user by using a dialog-assisted editing interface. Typically, the edit request is received in the form of text input. The edit request may be received using a prompt manager module. The edit request may include a request to revert previous edits made to the media content. In some implementations, the edit request may be a direct command in a structured format that allows direct access to editing tools by the media content editing architecture.

[0044] In step 606, method 600 includes editing media content based on an editing request to produce edited media content. Editing media content can be performed using various processes. Substeps 606A to 606C describe one such process. In substep 606A, method 600 includes obtaining a prompt from a prompt pool. The prompt can be obtained using a prompt manager module. The prompt pool may contain multiple prompts, each corresponding to at least one editing tool.

[0045] In substep 606B, method 600 includes parsing the acquired prompts and the edit requests using a large language model to generate one or more edit actions to be performed on the media content. An LLM agent can be used to receive the input and feed the input to the large language model. The use of prompts can enable more structured input, which in turn allows the large language model to provide a more consistent response. The large language model may be configured to parse the input and generate the one or more edit actions in the form of an action toollist.

[0046] In substep 606C, method 600 includes performing one or more editing actions on media content to produce edited media content. Performing editing actions may include using API calls to the corresponding editing tool. The API may be obtained from the tool API pool.

[0047] In step 608, method 600 optionally includes publishing the edited media content. The edited media content may be published on various platforms. For example, the edited media content may be published on a short video social media network.

[0048] Figure 7 is a flowchart illustrating an exemplary method 700 for improving a media content editing architecture. Improving the media content editing architecture can be achieved using a system evolutionary architecture, for example, one illustrated and described in Figure 4. In step 702, method 700 includes editing media content using a media content editing architecture, for example, one illustrated and described in Figure 5. The method described in Figure 6 may be used to edit the media content. Various types of media content and modalities are available. For example, the media content may be images, audio recordings, or videos. The media content editing architecture may include a large language model and backend tool services. The backend tool services may include a prompt pool.

[0049] In step 704, method 700 includes publishing the edited media content. The edited media content may be published on various platforms. For example, the edited media content may be published on a short video social media network.

[0050] In step 706, method 700 includes storing contextual information related to the editing of media content. Examples of contextual information include conversational history, editing context, and editing draft history. In some implementations, contextual information includes an asset description of the edited media content. In some implementations, contextual information is stored in context memory. Contextual information can be used for various purposes. During the editing process, contextual information is aware of past actions in the editing process that may influence the dialog responses of the media content editing architecture. For example, if contextual information includes a conversational history in which a user rejected a given proposed edit, the media content editing architecture may be configured not to propose the above edit for a given editing process. Another use of contextual information includes improving the media content editing architecture.

[0051] In step 708, method 700 includes improving the media content editing architecture using stored contextual information. Improving the media content editing architecture may include improving the prompt pool and / or the large-scale language model. In some implementations, the stored contextual information includes conversation history categorized into user-accepting and user-rejecting interactions, and improving the media content editing architecture includes improving the prompt pool based on user-accepting and user-rejecting interactions. For example, prompts in the prompt pool may be improved to suggest relevant editing actions corresponding to user-accepting interactions compared to user-rejecting interactions. The improvement process may include using an audience engagement metric associated with the published edited media content as a reward function. Exemplary audience engagement metrics include views, likes, shares, and comments. In some implementations, the improvement process is executed when a threshold of a given audience engagement metric (e.g., a given number of video views within a given time frame) is reached.

[0052] Media content editing architectures can be designed to provide general users with intuitive editing tools and experiences. By using LLM in combination with various editing tools, users can perform powerful editing on media content without requiring advanced software knowledge. Such an architecture can receive user input as unstructured text, predict desired editing requests using prompts and natural language processing techniques, and perform these predictions using a pool of available editing tools. Another implementation may involve improvements to such a technical architecture. By utilizing online performance data of published edited content, the system can evolve and improve itself without requiring the costly and labor-intensive training processes of traditional LLM models.

[0053] In some embodiments, the methods and processes described herein may be linked to a computing system of one or more computing devices. Specifically, such methods and processes may be implemented as computer application programs or services, application programming interfaces, libraries, and / or other computer program products.

[0054] Figure 8 schematically illustrates a non-limiting embodiment of a computing system 800 capable of implementing one or more of the methods and processes described above. The computing system 800 is shown in a simplified form. The computing system 800 may take the form of one or more personal computers, server computers, tablet computers, home entertainment computers, network computing devices, game devices, mobile computing devices, mobile communication devices (e.g., smartphones) and / or other computing devices, and wearable computing devices, such as smart watches and head-mounted augmented reality devices.

[0055] The computing system 800 comprises a logical processor 802, a volatile memory 804, and a non-volatile storage device 806. The computing system 800 may optionally include a display subsystem 808, an input subsystem 810, a communication subsystem 812, and / or other components not shown in Figure 8.

[0056] The logical processor 802 includes one or more physical devices configured to execute instructions. For example, the logical processor may be configured to execute instructions that are part of one or more applications, programs, routines, libraries, objects, components, data structures, or other logical structures. Such instructions may be implemented to perform tasks, realize data types, convert the state of one or more components, achieve technical effects, or otherwise achieve desired results.

[0057] A logic processor may include one or more physical processors (hardware) configured to execute software instructions. Additionally or alternatively, a logic processor may include one or more hardware logic circuits or firmware devices configured to execute hardware-implemented logic or firmware instructions. The processors of logic processor 802 may be single-core or multi-core, and the instructions executed thereon may be configured for sequential, parallel, and / or distributed processing. Individual components of the logic processor may optionally be distributed across two or more separate devices located remotely and / or configured for coordinated processing. Embodiments of the logic processor may be virtualized and executed by remotely accessible, network-connected computing devices configured in a cloud computing setup. In such cases, it should be understood that these virtualized embodiments run on different physical logic processors on various different machines.

[0058] The non-volatile storage device 806 includes one or more physical devices configured to hold instructions executable by a logical processor in order to implement the methods and processes described herein. When such methods and processes are implemented, the state of the non-volatile storage device 806 may be transformed, for example, to hold different data.

[0059] The non-volatile storage device 806 may include removable and / or built-in physical devices. The non-volatile storage device 806 may include optical memory (e.g., CD, DVD, HD-DVD, etc.), semiconductor memory (e.g., ROM, EPROM, EEPROM, flash memory, etc.), and / or magnetic memory (e.g., hard disk drive, floppy disk drive, tape drive, MRAM, etc.), or other mass storage technology. The non-volatile storage device 806 may include non-volatile, dynamic, static, read / write, read-only, sequential access, position-addressable, file-addressable, and / or content-addressable devices. It will be understood that the non-volatile storage device 806 is configured to retain instructions even when power to the non-volatile storage device 806 is cut off.

[0060] The volatile memory 804 may include a physical device that has random access memory. The volatile memory 804 is typically used by the logical processor 802 to temporarily store information during the processing of software instructions. If power to the volatile memory 804 is cut off, it is expected that the volatile memory 804 will not continue to store instructions.

[0061] The logic processor 802, the volatile memory 804, and the non-volatile storage device 806 may be integrated together into one or more hardware logic components. Such hardware logic components may include, for example, field-programmable gate arrays (FPGAs), program-and-application-specific integrated circuits (PASICs / ASICs), program-and-application-specific standard products (PSSPs / ASSPs), system-on-a-chips (SOCs), and complex programmable logic devices (CPLDs).

[0062] The terms “module,” “program,” and “engine” may be used to describe a form of computing system 800 that is typically implemented in software, such that a processor uses a portion of volatile memory to perform a certain function. The function includes a conversion process to configure the processor to perform the function. Thus, a module, program, or engine may be instantiated using a portion of volatile memory 804 via a logic processor 802 that executes instructions held by a non-volatile memory 806. It will be understood that different modules, programs, and / or engines may be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Similarly, the same module, program, and / or engine may be instantiated from different applications, services, code block, object, routine, API, function, etc. The terms “module,” “program,” and “engine” may include individuals or groups such as executable files, data files, libraries, drivers, scripts, database records, etc.

[0063] The display subsystem 808, if included, may be used to present a visual representation of the data held by the non-volatile memory 806. This visual representation may take the form of a graphical user interface (GUI). The methods and processes described herein modify the data held by the non-volatile memory and transform the state of the non-volatile memory, so the state of the display subsystem 808 may also be transformed to visually represent the changes in the underlying data. The display subsystem 808 may include one or more display devices utilizing substantially any type of technology. Such display devices may be combined with the logical processor 802, volatile memory 804 and / or non-volatile memory 806 within a shared enclosure, or such display devices may be peripheral display devices.

[0064] The input subsystem 810 may include, if included, one or more user input devices, such as a keyboard, mouse, touchscreen, or game controller, and may interface with them. In some embodiments, the input subsystem may include, and may interface with selected natural user input (NUI) components. These components may be integrated or peripheral, and the transmission and / or processing of input actions may be handled onboard or offboard. Exemplary NUI components may include a microphone for speech and / or voice recognition, an infrared camera, color camera, stereo camera, and / or depth camera for machine vision and / or gesture recognition, a head tracker, eye tracker, accelerometer and / or gyroscope, and / or any other suitable sensors for motion detection and / or intent recognition.

[0065] The communication subsystem 812, if included, may be configured to connect the various computing devices described herein to each other or to other devices in a communicative manner. The communication subsystem 812 may include wired and / or wireless communication devices compatible with one or more different communication protocols. In a non-limiting example, the communication subsystem may be configured to communicate over a wireless telephone network or a wired or wireless local or wide area network. In some embodiments, the communication subsystem may enable the computing system 800 to send and receive messages with other devices over a network such as the Internet.

[0066] The following paragraphs provide further explanation of the subject matter of this disclosure. One embodiment provides a method for editing media content, the method comprising: receiving media content from a user; receiving an edit request for the media content from the user; obtaining a prompt selected based on the edit request from a prompt pool; parsing the obtained prompt and the edit request using a large language model to generate one or more edit actions to be performed on the media content; and editing the media content based on the edit request to generate the edited media content by performing the one or more edit actions on the media content to generate the edited media content. In this embodiment, further or alternatively, performing the one or more edit actions includes performing an application programming interface call provided by a backend tool service including a plurality of editing tools, each application programming interface call corresponding to each of the plurality of editing tools. In this embodiment, further or alternatively, each editing tool in the plurality of editing tools corresponds to one or more prompts in the prompt pool. In this embodiment, further or alternatively, the plurality of editing tools are organized into a plurality of groupings, and the prompt pool is generated at least partially based on the plurality of groupings. In this embodiment, further or alternatively, the method further includes rendering and displaying the edited media content to the user and receiving a second edit request. In this embodiment, further or alternatively, the second edit request includes a request to return one or more edit actions that have been performed. In this embodiment, further or alternatively, the method further includes storing contextual information related to the editing of the media content. In this embodiment, further or alternatively, the contextual information includes one or more of conversation history, editing context, or edit draft history.In this embodiment, the method further includes, or alternatively, improving the prompt pool based on the context information. In this embodiment, editing the media content further includes providing the user with a dialogue response generated by the large language model in response to the acquired prompt and the editing request, and receiving a dialogue response from the user in response to the dialogue response. Furthermore, in this embodiment, a non-temporary computer-readable medium is provided which, when executed by a computing device, causes the computing device to implement the method described herein.

[0067] Another embodiment provides a computing device for editing media content, the computing device comprising a processor and memory of the computing device, wherein the processor is configured to edit the media content and produce the edited media content by executing a program using a portion of the memory, receiving media content from a user, receiving an editing request for the media content from the user, obtaining a prompt selected based on the editing request from a prompt pool, parsing the obtained prompt and the editing request using a large language model to generate one or more editing actions to be performed on the media content, and performing the one or more editing actions on the media content to produce the edited media content. In this embodiment, further or alternatively, performing the one or more editing actions includes performing an application programming interface call provided by a backend tool service including multiple editing tools, each application programming interface call corresponding to each of the multiple editing tools. In this embodiment, further or alternatively, each editing tool within the plurality of editing tools corresponds to one or more prompts in the prompt pool, the plurality of editing tools are organized into a plurality of groupings, and the prompt pool is generated based on at least the plurality of groupings. In this embodiment, further or alternatively, the processor is further configured to store contextual information related to editing the media content, the contextual information includes one or more of conversation history, editing context, or editing draft history. In this embodiment, further or alternatively, editing the media content further includes providing the user with a dialog response generated by the large language model in response to the acquired prompts and the editing request, and receiving a dialog response from the user in response to the dialog response.

[0068] Another embodiment provides a computing system for editing media content, the computing system comprising a display, a prompt pool, a plurality of editing tools, and a backend tool service having a plurality of application programming interfaces, each corresponding to one of the plurality of editing tools, and a processor and memory of the computing device, wherein the processor is configured to receive media content from a user, receive an editing request for the media content from the user, obtain a prompt selected based on the editing request from the prompt pool, parse the obtained prompt and the editing request using one or more large language models to generate one or more editing actions to be performed on the media content, execute the one or more editing actions on the media content by calling at least one of the plurality of application programming interfaces to generate edited media content, and edit the media content based on the editing request and generate the edited media content by rendering and displaying the edited media content via a dialog-assisted editing interface using the display. In this embodiment, further or alternatively, the one or more large language models comprise a plurality of large language models, each trained for at least one task, and the processor is configured to select a large language model from the plurality of large language models to parse the acquired prompts and the edit requests. In this embodiment, further or alternatively, each edit tool in the plurality of edit tools corresponds to one or more prompts in the prompt pool, the plurality of edit tools are organized into a plurality of groupings, and the prompts in the prompt pool are generated at least partially based on the plurality of groupings.In this embodiment, the processor is further configured to store contextual information relating to the editing of the media content, the contextual information including one or more of conversation history, editing context, or editing draft history.

[0069] It will be understood that the configurations and / or approaches described herein are illustrative in nature and are subject to numerous modifications; therefore, these specific embodiments or examples should not be considered restrictively. The specific routines or methods described herein may represent one or more of any number of processing strategies. For this reason, the illustrated and / or described operations may be performed in parallel, in any other order, or omitted, in the order illustrated and / or described. Similarly, the order of the processes described above may be changed.

[0070] The subject matter of this disclosure includes novel and non-obvious combinations and subcombinations of the various processes, systems, configurations, and other features, functions, operations, and / or characteristics disclosed herein, as well as all equivalents thereof.

Claims

1. A method for editing media content, Receiving media content from users and Receiving an editing request for the aforementioned media content from the user, The prompt selected based on the aforementioned editing request is retrieved from the prompt pool. The acquired prompts and the editing requests are analyzed using a large-scale language model to generate one or more editing actions to be performed on the media content. By performing one or more editing actions on the media content to generate edited media content, The process involves editing the media content based on the editing request to generate the edited media content, A method that includes this.

2. Performing one or more of the aforementioned editing actions involves making application programming interface calls provided by a backend tool service that includes multiple editing tools, each application programming interface call corresponding to one of the multiple editing tools. The method according to claim 1.

3. Each editing tool in the aforementioned set of editing tools corresponds to one or more prompts in the prompt pool. The method according to claim 2.

4. The aforementioned multiple editing tools are organized into multiple groupings, and the prompt pool is generated at least partially based on the aforementioned multiple groupings. The method according to claim 2.

5. The edited media content is rendered and displayed to the user, Receiving a second edit request, The method according to claim 1, further comprising:

6. The second edit request includes a request to return one or more edit actions that have been performed. The method according to claim 5.

7. To store contextual information related to the editing of the aforementioned media content, The method according to claim 1, further comprising:

8. The aforementioned context information includes one or more of the following: conversation history, editing context, or editing draft history. The method according to claim 7.

9. To improve the prompt pool based on the aforementioned context information, The method according to claim 8, further comprising

10. Editing the aforementioned media content To provide the user with a dialog response generated by the large-scale language model in response to the acquired prompt and the editing request, The further includes receiving a dialog response from the user in response to the dialog response. The method according to claim 1.

11. A computing device for editing media content, Computing device comprising a processor and memory, The processor executes a program using a portion of the memory, Receive media content from the user, The system receives an editing request for the aforementioned media content from the user. The prompt selected based on the aforementioned editing request is retrieved from the prompt pool. The acquired prompts and the editing requests are analyzed using a large-scale language model to generate one or more editing actions to be performed on the media content. By performing one or more editing actions on the media content to generate edited media content, The system is configured to edit the media content based on the editing request and generate the edited media content. Computing device.

12. Performing one or more of the aforementioned editing actions involves making application programming interface calls provided by a backend tool service that includes multiple editing tools, each application programming interface call corresponding to one of the multiple editing tools. The computing device according to claim 11.

13. Each editing tool in the aforementioned plurality of editing tools corresponds to one or more prompts in the prompt pool, The aforementioned multiple editing tools are organized into multiple groupings, The prompt pool is generated based on at least the plurality of groupings. The computing device according to claim 12.

14. The processor is further configured to store contextual information related to the editing of the media content, the contextual information including one or more of the following: conversation history, editing context, or editing draft history. The computing device according to claim 11.

15. Editing the aforementioned media content To provide the user with a dialog response generated by the large-scale language model in response to the acquired prompt and the editing request, The further includes receiving a dialog response from the user in response to the dialog response. The computing device according to claim 11.

16. A computing system for editing media content, The display and A backend tool service comprising a prompt pool, multiple editing tools, and multiple application programming interfaces, each corresponding to one of the editing tools, Computing device comprising a processor and memory, The processor executes a program using a portion of the memory, Receive media content from the user, The system receives an editing request for the aforementioned media content from the user. The prompt selected based on the aforementioned editing request is retrieved from the prompt pool. The acquired prompts and the editing requests are analyzed using one or more large-scale language models to generate one or more editing actions to be performed on the media content. The system performs one or more editing actions on the media content by calling at least one of the aforementioned application programming interfaces to generate the edited media content. By rendering and displaying the edited media content via a dialog-assisted editing interface using the aforementioned display, The system is configured to edit the media content based on the editing request and generate the edited media content. Computing system.

17. The one or more large language models include a plurality of large language models, each trained for at least one task, and the processor is configured to select a large language model from the plurality of large language models to parse the acquired prompt and the edit request. The computing system according to claim 16.

18. Each editing tool in the aforementioned plurality of editing tools corresponds to one or more prompts in the prompt pool, The aforementioned multiple editing tools are organized into multiple groupings, The prompts in the prompt pool are generated at least partially based on the plurality of groupings. The computing system according to claim 16.

19. The processor is further configured to store contextual information related to the editing of the media content, the contextual information including one or more of the following: conversation history, editing context, or editing draft history. The computing system according to claim 16.

20. The method, when executed by a computing device, includes an instruction that causes the computing device to implement the method of claim 1. A non-temporary, computer-readable medium for editing media content.