Script visualization method and device based on multi-agent cooperation, computer device, storage medium and computer program product

By deconstructing the script into structured information through a multi-agent collaborative architecture, a professional audiovisual design scheme is generated, which solves the problem of poor quality when converting text scripts into videos and achieves high-quality video generation and script restoration.

CN122309632APending Publication Date: 2026-06-30GUANGZHOU XINGHUO SHENZHI ANIMATION CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU XINGHUO SHENZHI ANIMATION CO LTD
Filing Date
2026-04-02
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

When converting text scripts into video content using existing technologies, there are problems such as poor visualization quality, chaotic narrative logic, and inconsistent visual style. This is mainly due to the lack of narrative understanding ability and professional collaboration mechanism in the underlying generation model.

Method used

By constructing a multi-agent collaborative architecture, including a script analysis agent, a video creation agent, and a quality inspection agent, the script is deconstructed into structured information layer by layer, generating professional audiovisual design schemes. These schemes are then converted into executable prompts for the video generation model by the prompt engineering agent, ultimately generating high-quality videos.

Benefits of technology

It significantly improves the quality of video generation and the fidelity to the script, solves problems such as poor visualization quality, chaotic narrative logic, and inconsistent visual style, and achieves high-quality video generation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309632A_ABST
    Figure CN122309632A_ABST
Patent Text Reader

Abstract

This application provides a script visualization method, apparatus, computer device, storage medium, and computer program product based on multi-agent collaboration. The method involves acquiring an original script and parsing it with a script analysis agent to obtain structured script information. Multiple video creation agents collaboratively generate an audiovisual design scheme for the original script based on this structured information. A prompting agent generates video generation prompts based on the audiovisual design scheme. Finally, a video generation model is invoked to generate the video based on the video generation prompts. This method improves the visualization quality of text-based scripts.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a script visualization method, apparatus, computer device, storage medium and computer program product based on multi-agent collaboration. Background Technology

[0002] With the rapid development of generative artificial intelligence technology, using AI models to automatically convert text scripts into visual video content has become an important area of ​​exploration for lowering the threshold of visual content creation and improving production efficiency.

[0003] Currently, most methods for converting text-based scripts into video content directly utilize text-to-image or image-to-video models. This involves creating prompts for key passages in the script, generating visuals segment by segment, and then stitching them together to form video clips. The core logic of these methods relies on the single-point generation capabilities of the underlying generative model, completing the transformation from text to visuals through a linear process of "script → prompts → visuals → video." However, when applied to scripts with complete narrative structures, the underlying generative model can only react to the input prompts and lacks the ability to understand and plan the script as a whole. This can lead to incoherent video content, resulting in poor video quality from the script visualization.

[0004] Therefore, there is a technical problem with the poor visualization quality of text scripts in the existing technology. Summary of the Invention

[0005] Based on this, the purpose of this application is to at least solve one of the above-mentioned technical defects, especially the technical defect of poor visualization effect of text scripts in the prior art. This application provides a script visualization method, device, computer equipment, storage medium and computer program product based on multi-agent collaboration that can improve the visualization quality of text scripts.

[0006] Firstly, this application provides a script visualization method based on multi-agent collaboration, including:

[0007] The original script is obtained and parsed by a script analysis agent to obtain the script's structured information.

[0008] Multiple video creation agents collaboratively generate audiovisual design schemes for the original script based on the script's structured information.

[0009] Based on the audiovisual design scheme, the intelligent agent generates video prompts using prompt words.

[0010] Based on the prompts generated in the video, the video generation model is invoked to generate the video.

[0011] In one exemplary embodiment, the script structure information includes global information and scene information; the script analysis agent parses the original script to obtain the script structure information, including:

[0012] The script analysis agent performs scene segmentation based on the original script to obtain each scene unit, and extracts the scene structure information of each scene unit as scene information.

[0013] The script analysis agent aggregates the structured information of each scene unit to generate global character profiles, global scene profiles, global narrative structure, and global emotion curves as global information.

[0014] In one exemplary embodiment, the video creation intelligent agent includes a director intelligent agent, a storyboard design intelligent agent, an art director intelligent agent, and a cinematographer intelligent agent; the video creation intelligent agents collaboratively generate an audiovisual design scheme for the original script based on the script's structured information, including:

[0015] The director agent retrieves matching related narrative structure knowledge from a pre-built narrative structure knowledge base based on global information, and generates a director's plan based on global information and related narrative structure knowledge.

[0016] The storyboard design agent generates storyboard schemes based on scene information and director's plans.

[0017] The art director intelligent agent generates art plans based on storyboard schemes and global information;

[0018] The photography director intelligent agent generates a photography plan based on the director's plan, storyboard plan, and art plan;

[0019] Based on the director's plan, storyboard plan, art plan, and cinematography plan, an audiovisual design plan is generated for the original script.

[0020] In one exemplary embodiment, the video creation agent further includes a quality inspection agent; based on the director's scheme, storyboard scheme, art scheme, and cinematography scheme, it generates an audiovisual design scheme for the original script, including:

[0021] The quality inspection AI agent conducts quality reviews of the director's plan, storyboard plan, art plan, and cinematography plan respectively.

[0022] If the review fails, the quality inspection AI will provide feedback on the question to the corresponding video creation AI. The corresponding video creation AI will then make corrections based on the feedback, and the quality inspection AI will review the corrections. This process will continue until the director's plan, storyboard plan, art plan, and cinematography plan all meet the preset quality standards or reach the preset maximum number of iterations.

[0023] Based on the director's plan, storyboard plan, art plan, and cinematography plan that meet the preset quality standards, an audiovisual design plan is generated for the original script.

[0024] In an exemplary embodiment, the storyboard scheme includes multiple storyboard units; each storyboard unit corresponds to a shot description; the video generation prompts include drawing prompts, video prompts, and camera movement parameters for each storyboard unit; based on the video generation prompts, a video generation model is invoked to generate a video, including:

[0025] Based on the drawing prompts for each storyboard unit, the image generation model is invoked to generate keyframe images for each storyboard unit based on the scene description of each storyboard unit;

[0026] Based on the video cues and camera movement parameters of each storyboard unit, the video generation model is called to generate video clips for each storyboard unit, using the keyframe image of each storyboard unit as the first frame.

[0027] A video is generated based on the video clips from each storyboard unit.

[0028] In one exemplary embodiment, the method further includes:

[0029] Perform automated quality assessment on keyframe images and / or video clips for each storyboard unit; the assessment dimensions include one or more of the following: image quality, content matching, character consistency, and style consistency.

[0030] Adjust the drawing prompts, video prompts, and camera movement parameters of storyboard units that failed the automated quality assessment to regenerate video generation prompts.

[0031] In one exemplary embodiment, a prompt word engineering agent generates video prompt words based on an audiovisual design scheme, including:

[0032] The prompt word engineering agent retrieves prompt word writing rules that are compatible with the video generation model to be called from a pre-built prompt word engineering knowledge base;

[0033] The audiovisual design scheme is converted into video-generated prompts by an intelligent agent based on prompt word writing rules.

[0034] Secondly, this application provides a script visualization device based on multi-agent collaboration, comprising:

[0035] The script parsing module is used to acquire the original script and parse it through a script analysis agent to obtain the script's structured information.

[0036] The audiovisual design module is used to collaboratively generate audiovisual design schemes for the original script based on the structured information of the script through multiple video creation intelligent agents.

[0037] The prompt word generation module is used to generate video prompt words based on the audiovisual design scheme through the prompt word engineering intelligent agent;

[0038] The model invocation module is used to generate prompts based on the video and then invoke the video generation model to generate the video.

[0039] Thirdly, this application provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described method.

[0040] Fourthly, this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described method.

[0041] Fifthly, this application provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the above-described method.

[0042] As can be seen from the above technical solutions, the embodiments of this application have the following advantages:

[0043] The script visualization method, apparatus, computer equipment, storage medium, and computer program product based on multi-agent collaboration provided in this application, through a multi-agent collaborative script visualization method, obtains the original script and parses it with a script analysis agent to obtain structured script information; multiple video creation agents collaboratively generate an audiovisual design scheme for the original script based on the structured script information; a prompting word engineering agent generates video generation prompts based on the audiovisual design scheme; and a video generation model is invoked to generate the video based on the video generation prompts. Thus, by introducing a multi-agent collaborative architecture, an intelligent intermediate layer is built between the underlying video generation model and the original script. This transforms text scripts, which are difficult to directly generate high-quality videos from, into structured script information, then uses multiple specialized agents to collaboratively design a professional audiovisual scheme, and finally converts it into prompts executable by the video generation model. This solves the technical problems of poor visualization quality, chaotic narrative logic, and inconsistent visual style caused by directly inputting text scripts into the video generation model, significantly improving the quality of the generated video and the fidelity to the script. Attached Figure Description

[0044] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0045] Figure 1 A flowchart illustrating a script visualization method based on multi-agent collaboration provided in this application embodiment;

[0046] Figure 2 This application provides an iterative loop process between a quality inspection intelligent agent and an audited intelligent agent;

[0047] Figure 3 A flowchart illustrating a multi-agent collaborative script visualization implementation provided in this application embodiment;

[0048] Figure 4 This application provides a schematic diagram of a complete automated process from original script input to final video synthesis, as illustrated in an embodiment of the present application.

[0049] Figure 5 This application provides a schematic diagram of the data read / write relationship and flow path between system components as an embodiment of the present application.

[0050] Figure 6 A flowchart illustrating another script visualization method based on multi-agent collaboration provided in this application embodiment;

[0051] Figure 7 A schematic diagram of a script visualization device based on multi-agent collaboration provided in an embodiment of this application;

[0052] Figure 8 This is a schematic diagram of the internal structure of a computer device provided in an embodiment of this application. Detailed Implementation

[0053] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0054] In the creation of visual content such as animation, the traditional production process from written scripts to visual finished products requires deep collaboration among multiple professional roles, including screenwriters, directors, storyboard artists, art directors, cinematographers, and post-production compositing. This process is characterized by long production cycles, high labor costs, and a high dependence on the skills of the team members. Currently, generative AI technologies, represented by Large Language Models (LLM), text-to-image models, and image-to-video models, while possessing single-point generation capabilities, suffer from a fundamental structural flaw: "The underlying generative models (image-to-image models and video-to-video models) are essentially execution units without narrative comprehension. They only receive a set of prompts and begin working, failing to understand plot logic, perceive character relationships, possess professional cinematic language, or maintain visual consistency across shots." Therefore, while multi-shot footage and video clips generated directly using generative models may meet individual frame quality standards, the overall composition generally lacks narrative, exhibits inconsistent character portrayals, unprofessional cinematic language, and a drifting overall style, falling significantly short of the production standards for animated series / short dramas.

[0055] The specific shortcomings of the existing technical solutions include: (1) lack of a systematic deconstruction mechanism from unstructured text scripts to multi-dimensional structured storyboard data - existing solutions are mostly simple "script → prompt words" direct translations, without deep deconstruction of the script at the scene level, shot level, and element level; (2) lack of an intelligent decision-making layer that simulates the collaboration of multiple professions in a professional film and television team, that is, existing solutions rely on a single model or a single process to make all decisions, and cannot take into account the comprehensive consideration of multiple professional dimensions such as narrative, art, and photography; (3) lack of engineering strategies to deal with the limitations of the large language model itself, that is, the original script is usually long (thousands to tens of thousands of words), and the large language model has problems such as attention decay and omission of key information when dealing with long contexts. Existing solutions lack targeted task deconstruction and scheduling mechanisms; (4) lack of an effective full-link quality assurance closed loop, that is, multiple links from script deconstruction to the final film lack automated quality inspection and iterative correction mechanisms, and the output quality is uncontrollable.

[0056] The technical solution provided in this application, under the premise that the underlying generation model lacks narrative understanding capabilities, constructs an upper-level multi-agent collaborative decision-making system to deconstruct the original text script layer by layer into precise structured generation instructions, driving the underlying model to output a visualized finished film with story coherence, character consistency, professional cinematography, and stylistic unity. For details regarding the technical solution of this application, please refer to the specific descriptions of the following embodiments.

[0057] In one exemplary embodiment, Figure 1 A flowchart illustrating a script visualization method based on multi-agent collaboration provided in this application embodiment is shown below. Figure 1As shown, a script visualization method based on multi-agent collaboration is provided. Taking the application of this method to a server as an example, the method includes the following steps S102 to S108. Wherein:

[0058] Step S102: Obtain the original script and parse it through the script analysis agent to obtain the script's structured information.

[0059] The original script was a text script that had not undergone any structure processing.

[0060] Among them, the script analysis agent is an agent deployed on the server. Its core task is to deeply deconstruct unstructured long text scripts into multi-granular structured data.

[0061] Among them, the script structured information is information extracted from the original script and organized according to a preset data structure.

[0062] Optionally, the server receives the original text script file uploaded by the user, starts the script analysis agent to parse the original script, and obtains the script's structured information.

[0063] Step S104: Multiple video creation intelligent agents collaboratively generate an audiovisual design scheme for the original script based on the script's structured information.

[0064] Among them, multiple video creation agents are multiple agents with different responsibilities, used for collaborative creation based on script structured information.

[0065] Among them, the audiovisual design scheme is a comprehensive design scheme for the visual and auditory presentation of the original script.

[0066] Optionally, the server can launch multiple video creation agents to work collaboratively based on the script's structured information. These agents can interact with each other through shared memory to achieve information closure and two-way verification.

[0067] Step S106: Based on the audiovisual design scheme, the prompt word engineering agent generates video prompt words.

[0068] Among them, the prompt word engineering agent is an agent used to convert audiovisual design schemes into prompt words that can be recognized by the video generation model.

[0069] Among them, video generation prompts are structured instructions used to guide the video generation model in generating videos.

[0070] Optionally, after receiving the audiovisual design scheme, the server initiates the prompt word engineering agent to convert it into video-generated prompt words.

[0071] Step S108: Generate prompts based on the video and call the video generation model to generate the video.

[0072] Among them, the video generation model is a generative model used to generate video content based on prompt words and / or reference images, including but not limited to graph-generated video models based on diffusion model architecture.

[0073] Optionally, the server invokes a pre-trained video generation model based on the prompt words to perform a video generation task. The generated video can be returned to the user as a file or stored on the server for later use.

[0074] The aforementioned script visualization method based on multi-agent collaboration obtains the original script and parses it using a script analysis agent to obtain structured information. Multiple video creation agents then collaboratively generate an audiovisual design scheme based on this structured information. A prompting agent then generates video generation prompts based on the audiovisual design scheme. Finally, the video generation model is invoked to generate the video using these prompts. By introducing a multi-agent collaborative architecture, an intelligent intermediate layer is built between the underlying video generation model and the original script. This transforms text-based scripts, which are difficult to directly generate high-quality videos from, into structured script information. Multiple specialized agents then collaboratively design a professional audiovisual scheme, which is finally converted into prompts executable by the video generation model. This solves the technical problems of poor visualization quality, chaotic narrative logic, and inconsistent visual style caused by directly inputting text-based scripts into the video generation model, significantly improving the quality of the generated video and its fidelity to the original script.

[0075] In one exemplary embodiment, the script structured information includes global information and scene information. The script analysis agent parses the original script to obtain the script structured information, including: the script analysis agent splits the original script into scenes to obtain scene units, and extracts the scene structured information of each scene unit as scene information; the script analysis agent aggregates the scene structured information of each scene unit to generate a global character profile, a global scene profile, a global narrative structure, and a global emotion curve as global information.

[0076] The global information refers to information describing the overall macro-level characteristics of the entire script. After extraction, the global information is written into global shared memory. For example, global information may include global character profiles, global scene profiles, global narrative structure, and global emotional curves. Global character profiles are structured documents recording the characteristics of all characters in the entire script. Global scene profiles are structured documents recording the characteristics of all scenes in the entire script. Global narrative structure describes the overall story progression framework of the script. Global emotional curves describe the changes in the intensity of emotions in the script over time.

[0077] Scene information describes the micro-features of a single scene. After extraction, the scene information is written into task-level local memory. For example, scene information may include the scene structure information of each scene unit. A scene unit is an independent segment obtained by dividing the original script according to scene transitions in the script, and the scene structure information is structured data extracted from a single scene unit and organized according to a preset format.

[0078] Optionally, the script analysis agent in the server first identifies scene boundary markers in the original script, divides the complete script into multiple scene units, and performs an independent element extraction task for each scene unit, extracting information such as the time, location, characters, actions, dialogue, and emotions of the scene to form structured scene information. After completing the extraction of all scene units, the script analysis agent starts an aggregation task to summarize the character information in all scenes by character name to generate a global character profile, and summarize the metadata of all scenes by scene order to generate a global scene profile, and analyze the emotional changes and plot functions of all scenes to generate a global narrative structure and a global emotion curve.

[0079] As can be seen, the script analysis agent's sub-task workflow includes scene boundary segmentation (breaking down the complete script into scene units), scene-by-scene structured element extraction (time, location, characters, actions, dialogue, emotions, plot functions, etc.), global character information aggregation and archive generation, and global scene information aggregation and archive generation. For each scene unit, the script analysis agent performs atomic-level sub-task decomposition: breaking down the originally complex task of "extracting all elements" into multiple sub-tasks with single objectives (such as "extracting only character actions," "extracting only dialogue," and "extracting only emotions"). Each sub-task processes only a finite-length text of the current scene unit, ensuring that the input length is always kept within the model's effective attention window. In this way, by breaking down long scripts into atomic-level sub-tasks, with each sub-task processing only a finite-length segment and focusing on a single extraction objective, the script analysis agent can specifically address the information loss problem caused by the decay of long contextual attention in LLM (Long Contextual Model).

[0080] In this embodiment, by breaking down a long script into scene units for independent analysis and then aggregating the analysis results globally, the problem of information omission caused by the attenuation of attention in a large model when directly processing long texts is avoided. This ensures that the information on characters, scenes, emotions, etc., extracted from the script is both comprehensive and accurate, providing a high-quality data foundation for the collaborative work of the subsequent video creation agent. This solves the problem of poor visualization quality caused by script misunderstanding from the source.

[0081] In practical applications, when a script analysis agent breaks down a scene from the original script to obtain scene units and extracts the structured information of each scene unit, the script analysis agent decomposes the task of extracting the structured information of each scene unit into multiple atomic-level subtasks. Each atomic-level subtask processes a text fragment of limited length and focuses on a single extraction target. For example, for a scene unit containing 3,000 words, the script analysis agent decomposes its element extraction task into multiple independent subtasks such as "extracting characters and their actions," "extracting core dialogue," "extracting scene time, location, and environment," and "extracting emotion tags and plot functions." Each subtask only needs to process the limited text of that scene unit and focus on a single dimension of information, ensuring that the input length is always kept within the effective attention window of the large language model. This avoids the problem of missing key information due to attention decay when the large language model processes long contexts, thus improving the accuracy and completeness of structured element extraction.

[0082] In this embodiment, by further breaking down the feature extraction task into atomic-level subtasks, each subtask focuses on a single extraction target and has a controllable input length. Compared to requiring the model to extract all types of features from a long text at once, this significantly reduces the risk of information omission and improves the accuracy and robustness of the extraction results.

[0083] In one exemplary embodiment, the video creation intelligent agent includes a director intelligent agent, a storyboard design intelligent agent, an art director intelligent agent, and a cinematographer intelligent agent. Multiple video creation intelligent agents collaboratively generate an audiovisual design scheme for the original script based on structured script information. This includes: the director intelligent agent retrieving matching related narrative structure knowledge from a pre-built narrative structure knowledge base based on global information, and generating a director's scheme based on the global information and related narrative structure knowledge; the storyboard design intelligent agent generating a storyboard scheme based on scene information and the director's scheme; the art director intelligent agent generating an art scheme based on the storyboard scheme and global information; the cinematographer intelligent agent generating a cinematography scheme based on the director's scheme, storyboard scheme, and art scheme; and finally, generating an audiovisual design scheme for the original script based on the director's scheme, storyboard scheme, art scheme, and cinematography scheme.

[0084] Among them, the director agent is responsible for the overall narrative rhythm design, style setting, camera language strategy formulation and final shot annotation. It retrieves classic narrative models and rhythm arrangement principles from the narrative structure knowledge base through RAG.

[0085] Among them, the storyboard design intelligent agent is responsible for breaking down scene-level data (i.e., the scene structured information of each scene unit) into shot-by-shot storyboard units, and determining the shot size, composition, and description of the content of the picture. It retrieves professional knowledge of shot size selection and composition design from the film and television shot language knowledge base through RAG.

[0086] Among them, the art director intelligent agent is responsible for refining the art design scheme of character visual description and scene visual. It needs to load global character files and global scene files from global shared memory.

[0087] Among them, the photography director is an intelligent agent responsible for designing camera movement methods (such as push / pull / pan / shift / rise / fall / circle, etc.), motion parameters, subject movement and environmental dynamics for each shot.

[0088] Among them, the narrative structure knowledge base is a pre-built knowledge base that stores classic narrative models and rhythm arrangement principles.

[0089] Among them, the related narrative structure knowledge is the narrative knowledge retrieved from the knowledge base that matches the current script.

[0090] The director's proposal can be a structured document that includes the overall style, narrative rhythm, camera strategy, and annotations of key shots.

[0091] The storyboard can be a structured document that includes the shot size, composition, and content description of each shot.

[0092] The art scheme can be a structured document that includes visual details of the characters and art schemes for the scenes.

[0093] The photography plan can be a structured document that includes the camera movement methods and motion parameters for each shot.

[0094] Optionally, the server first activates the director agent. This agent reads global information from the global shared memory and, using script type and global emotion curve as query criteria, retrieves relevant classic narrative models from the narrative structure knowledge base. It then combines this with the script's structured data to generate a director's plan and writes it back to the global shared memory. Subsequently, the server activates the storyboard design agent for each scene unit. This agent reads scene information from local memory, retrieves the director's plan from the global shared memory, and retrieves shot selection principles from the film and television shot language knowledge base (e.g., "medium shots are often used to establish relationships in dialogue scenes, close-ups capture emotional responses") to generate a storyboard plan for that scene. The art director agent reads the storyboard plan, global character files, and global scene files from the shared memory, refining the visual descriptions of each character and scene to generate an art plan. The cinematographer agent reads the director's plan, storyboard plan, and art plan, designing specific camera movement methods (e.g., push, pull, pan, tilt, track, crane, etc.) for each shot to generate a cinematography plan. Finally, all plans are aggregated to form a complete audiovisual design scheme.

[0095] In this embodiment, by setting up four dedicated intelligent agents—director, storyboard artist, art director, and cinematographer—the division of labor and collaboration mode of a real film and television production team is simulated. Each intelligent agent completes the design work within its respective scope of responsibility under the guidance of professional knowledge. This avoids the lack of professionalism and logical confusion when a single model handles complex audiovisual design tasks, ensuring that the generated audiovisual design scheme is both in line with narrative logic and has a professional standard. This solves the problem of poor visualization quality caused by the lack of professional audiovisual language guidance when directly generating videos from text scripts.

[0096] In practical applications, each of the director's agent, storyboard design agent, art director's agent, and cinematographer's agent employs a task decomposition workflow architecture, rather than being driven by a single prompt. Taking the storyboard design agent as an example, its internal workflow includes: First, loading the director's scheme and current scene information from global shared memory; second, retrieving shot selection principles and composition design knowledge matching the current scene type from the film and television shot language knowledge base through a retrieval-enhanced generation mechanism; third, determining the number of shots and cut points for the current scene based on the rhythm requirements in the director's scheme; fourth, determining the shot size and composition for each shot; fifth, generating a detailed description of the scene content for each shot; and sixth, performing format validation and integrity checks on the output. This multi-step workflow breaks down the complex storyboard design task into a well-focused sequence of sub-tasks, improving the accuracy and robustness of the output. Other intelligent agents also adopt a similar task decomposition workflow architecture, and during execution, they retrieve relevant knowledge from their respective domain-specific professional knowledge bases through a retrieval enhancement generation mechanism to inject into the reasoning context. This includes, but is not limited to: the director intelligent agent retrieving narrative models and rhythm arrangement principles from the narrative structure knowledge base; the art director intelligent agent retrieving descriptions of screen style and color scheme knowledge from the style reference knowledge base; and the cinematographer intelligent agent retrieving camera movement techniques and dynamic expression knowledge from the film and television shot language knowledge base.

[0097] In this way, by adopting a multi-step task decomposition workflow within each agent instead of being driven by a single prompt word, the complex creative task is broken down into a controllable sequence of sub-tasks, improving the accuracy and robustness of the output at each stage. At the same time, by injecting professional knowledge in the corresponding domain into each agent through a retrieval-enhanced generation mechanism, the output of each agent has industry-standard quality, thereby further ensuring the quality of the audiovisual design solution.

[0098] In an exemplary embodiment, the video creation agent further includes a quality inspection agent; based on the director's scheme, storyboard scheme, art scheme, and cinematography scheme, an audiovisual design scheme for the original script is generated, including: the quality inspection agent reviews the director's scheme, storyboard scheme, art scheme, and cinematography scheme respectively; if the review fails, the quality inspection agent provides feedback to the corresponding video creation agent, which then makes corrections based on the feedback, and the quality inspection agent reviews the corrections, iterating until the director's scheme, storyboard scheme, art scheme, and cinematography scheme all meet preset quality standards or reach a preset maximum number of iterations; based on the director's scheme, storyboard scheme, art scheme, and cinematography scheme that meet the preset quality standards, an audiovisual design scheme for the original script is generated.

[0099] Among them, the quality inspection agent is responsible for conducting multi-dimensional quality reviews of the phased outputs of other agents. Its review mechanism can be described as a "mutualistic game," specifically, the quality inspection agent and the reviewed agent engage in multiple rounds of dialogical quality negotiation (questioning → response / correction → review → further questioning…), iteratively improving output quality until it meets the standards or reaches the maximum number of rounds. The quality inspection agent's review dimensions include narrative completeness, role consistency, professional standardization, and element completeness.

[0100] Quality auditing is a process of evaluating the solution from multiple dimensions.

[0101] The iterative correction process involves a multi-round dialogic quality game between the quality inspection agent and the audited agent when the solution output by any audited agent fails the quality inspection agent's review. The audited agent then modifies the solution based on the quality inspection opinions.

[0102] Among them, the preset quality standards are pre-set indicators used to measure whether a plan is qualified or not, including dimensions such as narrative completeness, role consistency, professional standardization, and element completeness.

[0103] Optionally, after the director's agent, storyboard designer's agent, art director's agent, and cinematographer's agent have completed the initial generation of their respective proposals, the server activates the quality inspection agent. The quality inspection agent reads each proposal from the globally shared memory and reviews it according to preset quality standards. If a quality issue is found in a proposal, the quality inspection agent generates specific feedback and sends it to the corresponding agent. The corresponding agent then modifies the proposal based on the feedback and resubmits the modified proposal to the quality inspection agent for review. Multiple rounds of this "questioning-correction" dialogue occur between the quality inspection agent and the reviewed agent until the proposal meets the preset quality standards or reaches the preset maximum iteration round. After all proposals pass quality inspection, the server integrates the director's proposal, storyboard proposal, art director's proposal, and cinematographer's proposal that have all passed quality inspection into the final audiovisual design proposal.

[0104] Figure 2 An iterative cycle of "questioning → response → correction → review" is provided between the quality inspection agent and the reviewed agent. Taking the storyboard design agent as an example, when the storyboard design agent outputs storyboard scheme V1, the quality inspection agent initiates the first round of review. Its review dimensions include narrative completeness, character consistency, professional standards, and element completeness, and it outputs a list of issues: "#1 Scene 5 is missing a shot of character B's entrance; #2 Shot 12 has an inappropriate shot size (should be a close-up); #3 Shot 15's character clothing description is inconsistent with the file." The storyboard design agent responds to the issue list output by the quality inspection agent with the following responses: "#1 Agreed → Add a shot of character B's entrance; #2 Agreed → Correct to a close-up shot; #3 Agreed → Correct the clothing description," and adjusts storyboard scheme V1 to output storyboard scheme V2. Subsequently, the quality inspection agent initiates the second round of review, and its review results are: "#1 Corrected, passed; #2 Corrected, passed; #3 Correction incomplete, cuff details still deviate." Based on the feedback from the quality control agent's review results ("#3 correction is incomplete, cuff details still have deviations"), the storyboard design agent adjusts storyboard scheme V2, supplements the complete cuff detail description, and outputs storyboard scheme V3. Subsequently, the quality control agent conducts a final review, confirms that all issues have been resolved, and determines that storyboard scheme V3 has passed the review, then passes storyboard scheme V3 to the prompt word engineering agent.

[0105] In this embodiment, by introducing a quality inspection agent and its multi-round dialogic quality game mechanism of "mutual competition", the output of the upstream agent is strictly controlled and iteratively optimized. This effectively prevents design defects caused by the failure of a single agent from being transmitted to the downstream, ensuring that the audiovisual design scheme entering the prompt word generation stage has high quality and high consistency. Thus, the quality of the final generated video is guaranteed from the source of design quality, solving the problem of poor visualization quality caused by defects in the design scheme.

[0106] For the convenience of those skilled in the art, Figure 3 It provides a workflow for script visualization through collaboration among script analysis, directing, storyboarding, art direction, cinematography, quality control, and prompting engineering agents. First, the original script is input into the script analysis agent for deconstruction, outputting structured scene data and global character / scene profiles. Then, the directing agent designs narrative pacing and shot strategies based on the global data. The storyboarding agent breaks down scenes into individual shot units. The art direction agent refines the visual descriptions of characters and scenes. The cinematography agent designs camera movement for each shot. During the sequential execution of these five creative agents, the quality control agent reviews each stage of output in real time and iteratively optimizes output quality through a "self-reflection" mechanism (multi-round dialogue-based questioning and correction) until it passes quality control. Finally, the structured data that passes quality control is translated into executable instructions for the generation model by the prompting engineering agent, driving the generation model to complete video generation.

[0107] In an exemplary embodiment, the storyboard scheme includes multiple storyboard units; each storyboard unit corresponds to the scene description of a shot; the video generation prompts include drawing prompts, video prompts, and camera movement parameters for each storyboard unit; based on the video generation prompts, a video generation model is invoked to generate a video, including: based on the drawing prompts of each storyboard unit, an image generation model is invoked to generate keyframe images for each storyboard unit based on the scene description of each storyboard unit; based on the video prompts and camera movement parameters of each storyboard unit, using the keyframe images of each storyboard unit as the first frame, a video generation model is invoked to generate video segments for each storyboard unit; and based on the video segments of each storyboard unit, a video is generated.

[0108] The shot unit is the smallest unit in the shot scheme, corresponding to an independent shot.

[0109] The image description is a textual description of the visual content in the shot.

[0110] Among them, drawing prompts are prompts used to guide the image generation model in generating keyframe images.

[0111] Among them, video prompts are prompts used to guide the video generation model in generating video clips.

[0112] Among them, camera movement parameters are parameterized instructions that describe the manner, speed, and amplitude of camera movement.

[0113] Among them, the image generation model is a generative model used to generate images based on prompt words, including but not limited to text-based graph models or graph-based graph models based on diffusion model architecture.

[0114] Among them, the keyframe image can be the starting frame image of each video segment.

[0115] Among them, video clips are short videos corresponding to a single shot.

[0116] Optionally, the server first extracts drawing prompts from the audiovisual design scheme. For each storyboard unit, it calls the image generation model to generate keyframe images. When generating keyframe images for each storyboard unit, at least one of the following consistency control strategies is employed: First, by loading a pre-trained style model (such as LoRA weight tuning), the inference process of the image generation model is constrained to converge to the target style space, ensuring that the images generated from different shots maintain a unified art style, color tone, and image quality. Second, through a reference image injection mechanism, character reference images and / or scene reference images stored in the globally shared memory are injected into the inference process of the image generation model, guiding the model to generate images that maintain visual consistency with the reference images in terms of character facial features, clothing details, and scene spatial structure. The consistency control strategies ensure stylistic consistency and character visual consistency across shots, solving the problem that the underlying generation model cannot autonomously maintain visual consistency across multiple independent calls. After all keyframes are generated, for each storyboard unit, using the keyframe image of that storyboard unit as the first frame, and combining the corresponding video prompts and camera movement parameters, the image-generated video model is called to generate the video clip for that shot. The server supports parametric control of camera movement type, rate, and amplitude. After all video clips are generated and pass automated quality assessment, the server stitches the clips together into a complete video according to the shot order.

[0117] In this embodiment, the video generation process is decomposed into two stages: "keyframe generation + video segment generation". Fine-grained control is performed on a storyboard unit basis, so that the content, composition and camera movement of each shot can strictly follow the upstream audiovisual design scheme. This avoids problems such as image distortion, unreasonable movement and abrupt shot transitions that are easy to occur when directly generating long videos. This solves the problem of poor visualization quality of text scripts caused by the insufficient understanding of complex instructions by the video generation model.

[0118] In one exemplary embodiment, the method further includes: performing an automated quality assessment on keyframe images and / or video clips for each storyboard unit; the assessment dimensions include one or more of image quality, content matching, character consistency, and style consistency; and adjusting the drawing prompts, video prompts, and camera movement parameters of storyboard units that fail the automated quality assessment to regenerate video generation prompts.

[0119] Optionally, after the keyframe images and / or video clips of each storyboard unit are generated, an automated quality assessment is performed on the generated results. The assessment dimensions include at least: image quality assessment (detecting for image quality issues such as distortion, blurring, and artifacts), content matching assessment (assessing the degree of matching between the generated image and the content described by the drawing prompts), character consistency assessment (comparing the visual features of the characters in the generated image with the global character profile in the global shared memory), and style consistency assessment (assessing the degree of matching between the style of the generated image and the global style configuration). For storyboard units that fail the automated quality assessment, the corresponding video generation prompts and / or control parameters are adjusted based on the assessment feedback (such as adjusting the key descriptions in the prompts, modifying the guidance coefficients, changing the random seed, etc.), and then the generation model is called again for generation until the quality assessment is passed or the preset maximum number of retries is reached. In this way, through the automated quality inspection closed loop after generation, it is ensured that every shot material that finally enters the video compositing stage meets the preset quality standards.

[0120] In this embodiment, by adding an automated quality assessment and iterative correction step after the image and video are generated, the generated results are checked from multiple dimensions, which effectively prevents substandard materials from entering the final synthesis stage and ensures the overall quality of the output video.

[0121] In one exemplary embodiment, the video generation prompts are generated by a prompt word engineering agent based on the audiovisual design scheme, including: retrieving prompt word writing rules that are compatible with the video generation model to be called from a pre-built prompt word engineering knowledge base by the prompt word engineering agent; and converting the audiovisual design scheme into video generation prompts by the prompt word engineering agent based on the prompt word writing rules.

[0122] Among them, the prompt word engineering knowledge base is a pre-built knowledge base that stores the writing specifications and practical cases of prompt words for various video generation models.

[0123] The rules for writing prompts are specifications for the format of prompts, keyword usage, parameter settings, etc., for specific video generation models.

[0124] The video-generated prompts include drawing prompts, video prompts, and camera movement parameters. They may also include model control parameters and reference image information. The video-generated prompts are a structured set of instructions that the model can execute precisely.

[0125] Optionally, after obtaining the complete audiovisual design scheme, the server activates the prompt engineering agent. The prompt engineering agent first identifies the type and version of the video generation model the server will call. Using this model identifier as a query condition, it retrieves the prompt writing rules for that model from the prompt engineering knowledge base, including the keywords supported by the model, weighted syntax, usage of negative prompts, and expression methods for camera movement parameters. Then, the prompt engineering agent traverses each storyboard unit in the audiovisual design scheme. Based on the retrieved rules, it converts the scene descriptions of the storyboard units into standardized drawing prompts, converts scene emotions and camera movement intentions into video prompts, and converts camera movement methods in the cinematography scheme into parametric camera movement instructions. Finally, the prompt engineering agent outputs a structured instruction set containing drawing prompts, video prompts, and camera movement parameters for all storyboard units.

[0126] In this embodiment, through the prompt word engineering agent and its knowledge base, the abstract professional audiovisual design scheme is accurately translated into standardized prompt word instructions that the underlying video generation model can understand and execute. This solves the problem of "inaccurate wording" and uncontrollable generation effects caused by users or general models being unfamiliar with the prompt word specifications of specific models. It ensures that the carefully designed audiovisual schemes upstream can be accurately reproduced by the video generation model, thereby fundamentally opening up the conversion channel from script ideas to high-quality videos and significantly improving the visualization quality of text scripts.

[0127] In an exemplary embodiment, when the original script is a long script, the method further includes a batch scheduling and parallel processing method: the original script is divided into several batches according to narrative integrity and character overlap; within each batch, each agent executes serially according to the cooperative topology order; different batches are executed in parallel and share global data in the global shared memory; after adjacent batches are completed, a quality inspection agent performs a cross-batch consistency check on the connecting shots.

[0128] Among them, the batch is a collection of sub-scripts obtained by dividing the long script according to narrative coherence and character overlap.

[0129] Narrative integrity means ensuring that each batch contains a relatively complete story segment (such as a scene or a plot unit).

[0130] Among them, role overlap is an indicator that measures how many roles are shared between different batches.

[0131] The cooperative topology order is the order of dependencies between agents.

[0132] Among them, the cross-batch consistency verification involves the quality inspection intelligent agent checking the character's face, scene details, style continuity, and other aspects of the shots at the boundary of adjacent batches.

[0133] Optionally, upon receiving a lengthy script (such as a complete script with over 120 scenes), the server first performs batch segmentation: based on narrative structure analysis, the script is divided into several "acts" or "plot units," and the overlap of characters between candidate batches is calculated to ensure that characters within each batch are relatively concentrated and that there are few shared characters between batches, thereby reducing memory conflicts during parallel processing. Then, an independent collaborative process is initiated for each batch: within the same batch, agents such as director, storyboard artist, art director, and cinematographer execute sequentially according to the collaborative topology, ensuring the coherence of narrative logic and visual style within the batch; between different batches, due to shared global memory (global character files, global scene files, style settings, etc.), their respective agent collaborative processes can be initiated simultaneously, achieving parallel processing between batches. Once adjacent batches have completed their design, a quality control agent is activated to perform specific checks on shots at the batch transitions, such as checking whether the facial features of the same character are consistent at the end and beginning of a batch, whether scene details continue, and whether the color tone is consistent. If inconsistencies are found, the quality control agent and related agents perform iterative corrections in a "self-correcting" manner until the transitions meet the consistency requirements.

[0134] Specifically, in the batch segmentation stage, the following factors are considered: narrative integrity of the scene, i.e., avoiding segmentation in the middle of a narrative paragraph to ensure that the scenes within each batch constitute relatively complete narrative fragments; and the overlap of characters between scenes, i.e., prioritizing adjacent scenes sharing a main character and grouping them into the same batch for more precise consistency control within the batch. In the cross-batch consistency verification stage, the quality inspection agent conducts a special inspection of the shots at the junction of adjacent batches. The inspection includes at least: whether the facial features and clothing details of the same character are consistent in the end shot of the previous batch and the beginning shot of the next batch, whether the scene transition is natural and coherent, and whether the color tone and style are consistent. For shots that fail the verification, the system rolls back the state of the shot to the corresponding stage, where the relevant agent performs targeted corrections and regenerates it, without having to reprocess the entire batch.

[0135] In this embodiment, a hybrid scheduling strategy of inter-batch parallelism and intra-batch serialism, along with a batch boundary consistency verification mechanism, maximizes parallel processing efficiency while ensuring the narrative coherence of long scripts, thus resolving the technical contradiction of balancing efficiency and quality in the processing of long scripts.

[0136] For the convenience of those skilled in the art, Figure 4It provides a complete automated workflow from original script input to final video compositing. First, after preprocessing the original script, the script analysis agent performs deep deconstruction through atomic-level subtask decomposition and RAG knowledge enhancement, outputting structured scene data and character / scene profiles. Then, the batch scheduling engine divides the scene into multiple batches according to resources, entering the multi-agent collaborative storyboard design stage. In this stage, the director, storyboard artist, art director, cinematographer, and quality inspector work collaboratively in a "serial within batch, parallel between batches" manner, and the quality inspector's "internal competition" mechanism ensures output quality. Based on the quality inspector's design scheme, the prompt engineering agent generates structured prompts, calls the raw image model (combining style model and reference image injection) to generate keyframe images and the raw video model (using the keyframe image as the first frame and injecting camera movement parameters) to generate dynamic video clips. After the generated video clips undergo post-generation quality inspection and iterative correction, cross-batch boundary consistency verification, and finally video compositing (including splicing, transitions, audio, subtitles, and style post-processing), and can selectively undergo manual review and incremental local regeneration. Figure 4 The process shown supports flexible switching between fully automated and human-machine collaboration modes. In human-machine collaboration mode, an approval node is set after steps S2, S4, S5, S6, S7, and S10. After the user makes modifications, the system automatically assesses the scope of impact and only incrementally regenerates the affected shots, eliminating the need for a complete rework.

[0137] In practical applications, multiple video creation agents interact through a global shared memory. This global shared memory stores at least global character profiles, global scene profiles, and global style configurations. All agents can read the data and write their structured outputs according to their respective responsibilities. The system determines the storage location based on the data type: global data spanning the entire script (such as character appearance descriptions, scene space settings, and overall style parameters) is written to the global shared memory; intermediate data during the execution of a single batch or shot (such as intermediate reasoning results for the current batch, quality control feedback records, candidate outputs, and scores) is written to the task-level local memory. After the task is completed, key results requiring persistence in the task-level local memory (such as the final storyboard that passes quality control and confirmed prompts) are simultaneously written to the global shared memory, while temporary intermediate data is cleaned up and released according to a strategy. In this way, the data diversion strategy not only ensures that each agent can obtain the complete global context to maintain consistency across shots, but also avoids the pollution of the global memory space by task-level temporary data. Through the hierarchical design of global shared memory and task-level local memory, as well as the diversion strategy according to data type, the efficient sharing of global context information and the effective isolation of task-level temporary data are achieved, ensuring the information consistency of multiple agents when collaborating across shots and batches.

[0138] This application also provides a script visualization intelligent generation system based on multi-agent collaboration, comprising four components: global shared memory, task-level local memory, state machine manager, and professional knowledge base. The global shared memory stores character profiles, scene profiles, global style configurations, and generated content indexes throughout the entire script. This component is designed to be readable and writable by all agents, ensuring consistency across shots and batches. The task-level local memory stores intermediate inference data for single batches or shots. This component is designed to synchronize key results to the global memory after task completion and to clean up temporary data. The state machine manager manages the entire state transition for each shot task. This component supports state rollback, breakpoint resumption, and precise rework. The professional knowledge base includes vertical domain knowledge such as film and television language, cue word engineering, narrative structure, and style references. This component is designed to dynamically inject agent inference context through a RAG mechanism.

[0139] For the convenience of those skilled in the art, Figure 5 The system demonstrates the data read / write relationships and flow paths between four components: global shared memory, task-level local memory, state machine manager, and professional knowledge base. The system uses global shared memory as its core, storing persistent data such as character profiles, scene profiles, and global style configurations, and maintaining a generated content index (including character IDs, shot IDs, and corresponding image references, video references, and records of used prompt words). Each agent accesses this memory through read / write interfaces according to its responsibilities—the script analysis agent writes character / scene profiles, the director agent writes global style configurations, the art agent reads the profiles and writes detailed descriptions, the prompt word agent reads all profiles to generate prompt words, and the quality control agent reads all content for consistency verification. Simultaneously, the system maintains task-level local memory for each batch / shot, temporarily... The system stores scene fragments, iterative reasoning results, and quality inspection feedback (such as issues from the first round and corrections from the second round) for the current batch. After the task is completed, key results are synchronized to the global shared memory and temporary data is cleared. The state machine manager is responsible for tracking the state of each stage (including deconstruction, storyboard design, quality inspection, prompt word generation, image generation, video generation, and post-generation quality inspection), supporting state rollback, breakpoint resumption, and precise rework. In addition, the system has a built-in professional knowledge base (covering film language, prompt word engineering, and narrative structure). Through the RAG retrieval process, the task context is transformed into vector queries, and relevant knowledge is injected into the reasoning context of the agent, providing professional support for each stage.

[0140] This application also provides a batch scheduling and parallel processing mechanism, designing a hybrid scheduling strategy of "inter-batch parallelism and intra-batch serial execution" for long scripts. This strategy specifically includes batch segmentation, intra-batch serial execution, inter-batch parallelism, and batch boundary verification. Batch segmentation refers to dividing the complete script into several batches based on narrative integrity and character overlap; intra-batch serial execution refers to executing each agent within the same batch according to the cooperative topology order to ensure narrative coherence within the batch; inter-batch parallelism refers to different batches simultaneously initiating their respective agent collaborative processes, sharing global data such as characters / scenes / styles in global memory; and batch boundary verification refers to a quality control agent performing a cross-batch consistency check on connecting shots after adjacent batches are completed.

[0141] This application provides a script visualization intelligent generation system based on multi-agent collaboration, comprising a multi-agent collaborative decision-making layer, an instruction translation and scheduling layer, and a generation model execution layer. The multi-agent collaborative decision-making layer corresponds to steps S102-S104 and is used to understand the script, decompose decisions, and generate instructions. The instruction translation and scheduling layer corresponds to step S106 and is used to convert decisions into structured instructions executable by the model for management and scheduling. The generation model execution layer corresponds to step S108 and is a pure execution unit that receives structured instructions output by the multi-agent decision-making layer and drives the underlying model to work. The generative model execution layer's image generation module can call text-to-image / image-to-image models based on the drawing prompts output by the prompting agent. It constrains the inference process to converge to the target style space through its own style model (such as LoRA), achieves cross-shot character facial consistency through reference graph injection, and injects composition / pose control conditions through technologies such as ControlNet. The generative model execution layer's video generation module can call the image-to-video model with keyframe images as the first frame, combined with video prompts and camera movement parameters, supporting parameterized control of camera movement type, rate, amplitude, etc. After generation, the generative model execution layer's quality inspection module automatically evaluates the image quality, content matching degree, character consistency, and style consistency. Shots that fail the quality inspection are automatically reverted and regenerated.

[0142] Overall, the beneficial effects of this application include: effectively solving the attention decay problem of large language models when processing long scripts by decomposing workflow architecture and retrieval enhancement generation mechanism within each intelligent agent, and improving the professionalism of outputs at each stage; solving the collaborative problem of multi-professional decision-making (by simulating collaboration among multiple roles such as screenwriter / director / storyboard artist / artist / cinematographer); solving the problem of consistency between roles and styles across shots by using global shared memory and data diversion strategies; ensuring the quality controllability of final video materials by using an automated quality evaluation closed loop after generation, thus solving the problem of uncontrollable quality across the entire chain; significantly improving the processing efficiency of long scripts while ensuring narrative coherence by using a hybrid scheduling strategy of parallelism between batches and serialism within batches, thus solving the problem of processing efficiency for long scripts; and maintaining high efficiency while preserving the creator's fine control over the generated results by using configurable manual intervention nodes and incremental local regeneration strategies.

[0143] In one exemplary embodiment, Figure 6 A flowchart illustrating another script visualization method based on multi-agent collaboration provided in this application embodiment is shown below. Figure 6 As shown, a script visualization method based on multi-agent collaboration is provided. Taking the application of this method to a server as an example, the method includes the following steps S602 to S626. Wherein:

[0144] Step S602: Obtain the original script, and use the script analysis agent to perform scene segmentation based on the original script to obtain each scene unit, and extract the scene structure information of each scene unit as scene information.

[0145] Step S604: The script analysis agent aggregates the scene structure information of each scene unit to generate a global character profile, a global scene profile, a global narrative structure, and a global emotion curve as global information.

[0146] Step S606: The director agent retrieves matching related narrative structure knowledge from the pre-built narrative structure knowledge base based on global information, and generates a director's plan based on global information and related narrative structure knowledge.

[0147] Step S608: The storyboard design agent generates a storyboard scheme based on scene information and director's scheme. The storyboard scheme includes multiple storyboard units. Each storyboard unit corresponds to a shot description.

[0148] Step S610: The art director intelligent agent generates an art scheme based on the storyboard scheme and global information.

[0149] Step S612: The photography director agent generates a photography plan based on the director's plan, storyboard plan, and art plan.

[0150] Step S614: The quality inspection AI performs quality audits on the director's plan, storyboard plan, art plan, and cinematography plan respectively.

[0151] In step S616, if the review fails, the quality inspection agent provides feedback on the question to the corresponding video creation agent. The corresponding video creation agent then makes corrections based on the feedback, and the quality inspection agent reviews the corrections. This process is repeated until the director's plan, storyboard plan, art plan, and cinematography plan all meet the preset quality standards or reach the preset maximum number of iterations.

[0152] Step S618: Based on the director's scheme, storyboard scheme, art scheme, and cinematography scheme that meet the preset quality standards, generate an audiovisual design scheme for the original script.

[0153] Step S620: Based on the audiovisual design scheme, the prompt word engineering agent generates video generation prompt words; the video generation prompt words include drawing prompt words, video prompt words, and camera movement parameters for each storyboard unit.

[0154] Step S622: Based on the drawing prompts for each storyboard unit, call the image generation model to generate keyframe images for each storyboard unit.

[0155] Step S624: Based on the video cues and camera movement parameters of each storyboard unit, and using the keyframe image of each storyboard unit as the first frame, call the video generation model to generate video clips for each storyboard unit.

[0156] Step S626: Generate video based on the video clips of each storyboard unit.

[0157] It should be noted that the specific limitations of the above steps can be found in the above description of the specific limitations of a script visualization method based on multi-agent collaboration.

[0158] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0159] The following describes the script visualization device based on multi-agent collaboration provided in the embodiments of this application. The script visualization device based on multi-agent collaboration has the same inventive concept as the script visualization method based on multi-agent collaboration described above. The solution provided by this device is similar to the solution described in the above method. Therefore, the specific limitations of one or more embodiments of the script visualization device based on multi-agent collaboration provided below can be found in the limitations of the script visualization method based on multi-agent collaboration described above. The script visualization device based on multi-agent collaboration described below and the script visualization method based on multi-agent collaboration described above can be referred to each other, and will not be repeated here.

[0160] In one exemplary embodiment, Figure 7 A schematic diagram of the structure of a script visualization device based on multi-agent collaboration provided in this application embodiment is shown below. Figure 7 As shown, the script visualization device based on multi-agent collaboration includes: a script parsing module 702, an audiovisual design module 704, a prompt word generation module 706, and a model invocation module 708, wherein:

[0161] The script parsing module 702 is used to acquire the original script and parse the original script through the script analysis agent to obtain the script's structured information;

[0162] The audiovisual design module 704 is used to collaboratively generate an audiovisual design scheme for the original script based on the structured information of the script through multiple video creation intelligent agents.

[0163] The prompt word generation module 706 is used to generate video prompt words based on the audiovisual design scheme through the prompt word engineering intelligent agent;

[0164] The model invocation module 708 is used to generate prompts based on the video and then invoke the video generation model to generate the video.

[0165] In an exemplary embodiment, the script structured information includes global information and scene information; the script parsing module 702 is specifically used to split the original script into scenes by the script analysis agent to obtain each scene unit, and extract the scene structured information of each scene unit as scene information; the script analysis agent aggregates the scene structured information of each scene unit to generate a global character profile, a global scene profile, a global narrative structure, and a global emotion curve as global information.

[0166] In an exemplary embodiment, the audiovisual design scheme includes a director's scheme, a storyboard scheme, an art direction scheme, and a cinematography scheme; the video creation intelligent agent includes a director's intelligent agent, a storyboard design intelligent agent, an art direction intelligent agent, and a cinematography intelligent agent; the audiovisual design module 704 is specifically used to: retrieve matching related narrative structure knowledge from a pre-built narrative structure knowledge base based on global information using the director's intelligent agent; generate a director's scheme based on global information and related narrative structure knowledge using the storyboard design intelligent agent; generate a storyboard scheme based on scene information and the director's scheme using the storyboard design intelligent agent; generate an art direction scheme based on the storyboard scheme and global information using the art direction intelligent agent; generate a cinematography scheme based on the director's scheme, storyboard scheme, and art direction scheme using the cinematography intelligent agent; and generate an audiovisual design scheme for the original script based on the director's scheme, storyboard scheme, art direction scheme, and cinematography scheme.

[0167] In an exemplary embodiment, the video creation agent further includes a quality inspection agent; the audiovisual design module 704 is specifically used to conduct quality reviews of the director's scheme, storyboard scheme, art scheme, and cinematography scheme through the quality inspection agent; if the review fails, the quality inspection agent provides feedback to the corresponding video creation agent, which then makes corrections based on the feedback, and the quality inspection agent reviews the corrections, iterating until the director's scheme, storyboard scheme, art scheme, and cinematography scheme all meet the preset quality standards or reach the preset maximum iteration round; based on the director's scheme, storyboard scheme, art scheme, and cinematography scheme that meet the preset quality standards, an audiovisual design scheme for the original script is generated.

[0168] In an exemplary embodiment, the storyboard scheme includes multiple storyboard units; each storyboard unit corresponds to a shot description; the video generation prompts include drawing prompts, video prompts, and camera movement parameters for each storyboard unit; the model invocation module 708 is specifically used to, based on the drawing prompts of each storyboard unit, invoke an image generation model to generate keyframe images for each storyboard unit based on the shot description of each storyboard unit; based on the video prompts and camera movement parameters of each storyboard unit, using the keyframe images of each storyboard unit as the first frame, invoke a video generation model to generate video segments for each storyboard unit; and generate a video based on the video segments of each storyboard unit.

[0169] In an exemplary embodiment, the model invocation module 708 is further configured to perform automated quality assessment on the keyframe images and / or video clips of each storyboard unit; the assessment dimensions include one or more of image quality, content matching, character consistency, and style consistency; and to adjust the drawing prompts, video prompts, and camera movement parameters of storyboard units that fail the automated quality assessment in order to regenerate video generation prompts.

[0170] In an exemplary embodiment, the prompt word generation module 706 is specifically used to retrieve prompt word writing rules that are compatible with the video generation model to be called from a pre-built prompt word engineering knowledge base through a prompt word engineering intelligent agent; and to convert the audiovisual design scheme into video generation prompt words based on the prompt word writing rules through the prompt word engineering intelligent agent.

[0171] In one exemplary embodiment, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of any of the script visualization methods based on multi-agent collaboration described above.

[0172] In one exemplary embodiment, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of any of the script visualization methods based on multi-agent collaboration described in the above embodiments.

[0173] In one exemplary embodiment, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of any of the multi-agent collaborative script visualization methods described in the above embodiments.

[0174] Indicatively, such as Figure 8 As shown, Figure 8 This is a schematic diagram of the internal structure of a computer device 800 provided in an embodiment of this application. The computer device 800 can be provided as a server. (Refer to...) Figure 8 The computer device 800 includes a processing component 802, which further includes one or more processors, and memory resources represented by memory 801 for storing instructions, such as application programs, that can be executed by the processing component 802. The application programs stored in memory 801 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 802 is configured to execute instructions to perform the script visualization method based on multi-agent cooperation of any of the above embodiments.

[0175] The computer device 800 may also include a power supply component 803 configured to perform power management of the computer device 800, a wired or wireless network interface 804 configured to connect the computer device 800 to a network, and an input / output (I / O) interface 805. The computer device 800 may operate on an operating system stored in memory 801, such as Windows Server™, Mac OS X™, Unix™, Linux™, Free BSD™, or similar.

[0176] Those skilled in the art will understand that Figure 8The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0177] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0178] The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as needed, and the same or similar parts can be referred to each other.

[0179] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A script visualization method based on multi-agent collaboration, characterized in that, The method includes: The original script is obtained and parsed by a script analysis agent to obtain the script's structured information; Multiple video creation agents collaboratively generate an audiovisual design scheme for the original script based on the script's structured information. Based on the audiovisual design scheme, the intelligent agent generates video prompts using the prompt word engineering agent. Based on the video prompts, the video generation model is invoked to generate a video.

2. The method according to claim 1, characterized in that, The script structured information includes global information and scene information; the script analysis agent parses the original script to obtain the script structured information, including: The script analysis agent performs scene segmentation based on the original script to obtain each scene unit, and extracts the scene structure information of each scene unit as the scene information. The script analysis agent aggregates the structured information of each scene unit to generate a global character profile, a global scene profile, a global narrative structure, and a global emotion curve, which serve as the global information.

3. The method according to claim 2, characterized in that, The video creation intelligent agent includes a director intelligent agent, a storyboard design intelligent agent, an art director intelligent agent, and a cinematographer intelligent agent; the process of generating an audiovisual design scheme for the original script collaboratively by multiple video creation intelligent agents based on the script's structured information includes: The director agent retrieves matching related narrative structure knowledge from a pre-built narrative structure knowledge base based on the global information, and generates a director's plan based on the global information and the related narrative structure knowledge. The storyboard design agent generates a storyboard based on the scene information and the director's plan. The art direction agent generates an art design based on the storyboard and the global information. The cinematography director agent generates a cinematography plan based on the director's plan, the storyboard plan, and the art plan. Based on the director's scheme, the storyboard scheme, the art scheme, and the cinematography scheme, an audiovisual design scheme for the original script is generated.

4. The method according to claim 3, characterized in that, The video creation intelligent agent also includes a quality inspection intelligent agent; the generation of an audiovisual design scheme for the original script based on the director's scheme, the storyboard scheme, the art scheme, and the cinematography scheme includes: The quality inspection AI agent performs quality audits on the director's plan, the storyboard plan, the art plan, and the cinematography plan, respectively. If the review fails, the quality inspection intelligent agent will provide feedback on the question to the corresponding video creation intelligent agent. The corresponding video creation intelligent agent will then make corrections based on the feedback, and the quality inspection intelligent agent will review the correction results. This process will continue until the director's plan, the storyboard plan, the art plan, and the cinematography plan all meet the preset quality standards or reach the preset maximum number of iterations. Based on the director's scheme, storyboard scheme, art scheme, and cinematography scheme that meet the preset quality standards, an audiovisual design scheme for the original script is generated.

5. The method according to claim 3, characterized in that, The storyboard scheme includes multiple storyboard units; each storyboard unit corresponds to a shot description; the video generation prompts include drawing prompts, video prompts, and camera movement parameters for each storyboard unit; the step of generating video based on the video generation prompts and calling the video generation model includes: Based on the drawing prompts for each storyboard unit, the image generation model is invoked to generate keyframe images for each storyboard unit; Based on the video prompts and camera movement parameters of each storyboard unit, and using the keyframe image of each storyboard unit as the first frame, the video generation model is called to generate video segments for each storyboard unit. A video is generated based on the video clips from each of the aforementioned storyboard units.

6. The method according to claim 5, characterized in that, The method further includes: An automated quality assessment is performed on the keyframe images and / or video segments of each storyboard unit; the assessment dimensions include one or more of the following: image quality, content matching, character consistency, and style consistency. The drawing prompts, video prompts, and camera movement parameters of the storyboard units that failed the automated quality assessment are adjusted to regenerate the video generation prompts.

7. The method according to claim 1, characterized in that, The process of generating video prompts through the intelligent agent based on the audiovisual design scheme includes: The prompt word engineering agent retrieves prompt word writing rules that are compatible with the video generation model to be called from a pre-built prompt word engineering knowledge base; The intelligent agent that generates the prompts converts the audiovisual design scheme into video-generated prompts based on the prompt writing rules.

8. A script visualization device based on multi-agent collaboration, characterized in that, The device includes: The script parsing module is used to acquire the original script and parse the original script through a script analysis agent to obtain the script's structured information; The audiovisual design module is used to collaboratively generate an audiovisual design scheme for the original script based on the script's structured information through multiple video creation intelligent agents. The prompt word generation module is used to generate video prompt words based on the audiovisual design scheme through the prompt word engineering intelligent agent; The model invocation module is used to generate prompts based on the video and invoke the video generation model to generate the video.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.

11. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.