Intelligent agent collaborative document structuring method and system using intermediate representation

By constructing a unified structured intermediate representation and multi-agent collaborative scheduling, the problems of structural mapping and multimodal semantic instability in presentation generation are solved, achieving high-quality consistency between text and graphics and visual interactivity, which is suitable for scenarios such as academic presentations and course teaching.

CN122242477APending Publication Date: 2026-06-19BEIJING INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING INST OF TECH
Filing Date
2026-01-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing presentation generation technologies suffer from instability in structural mapping and multimodal semantics, resulting in inconsistent generation effects and high error rates. They also lack interactive and interpretable intermediate processes, making it difficult to generate high-quality, consistent presentations.

Method used

By constructing a unified structured intermediate representation and adopting a multi-agent collaborative scheduling approach, including parsing, planning, reduction, rendering, and visual optimization agents, narrative planning, semantic reduction, template rendering, and graphic optimization are performed to generate presentations with clear structure and consistent graphics.

🎯Benefits of technology

It significantly improves the interpretability and stability of the presentation generation process, reduces error accumulation, enhances the structural coherence and readability of the generated results, supports editable output and visual interaction, and is suitable for scenarios such as academic presentations and course teaching.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242477A_ABST
    Figure CN122242477A_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for generating structured documents through intelligent agents using intermediate representation, belonging to the interdisciplinary application technology of natural language processing, multimodal content generation, and human-computer interaction. The implementation method of this invention is as follows: 1. Perform structural and element analysis on reducible information sources to form a structured intermediate representation; 2. Utilize data contracts to perform rollback, repair, and update processing on the triggering conditions of the multi-agent set, and initialize the scheduler; 3. Allocate text and multimodal elements to generate the outline structure and page-level organization scheme of the presentation; 4. Generate the final draft of the specification; 5. Select the corresponding template and fill the template slots with fields to generate structured pages; 6. The visual optimization agent optimizes the text style and visual expression of the structured pages; 7. Form a presentation and extract the speech text. Compared with existing technologies, this invention improves the consistency of text and graphics and the visual interactivity of automatically generated presentations.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a method and system for generating structured documents collaboratively by intelligent agents with intermediate representations. It belongs to the cross-application technology field of natural language processing, multimodal content generation and human-computer interaction, and is applied to the automatic generation of presentation documents. Background Technology

[0002] Presentations are widely used in academic exchanges, teaching and training, and business presentations. Current presentation creation typically requires manual work such as document reading, structural analysis, key point extraction, layout design, and chart creation. This process is time-consuming, quality depends on experience, and it is difficult to scale and reuse presentations.

[0003] In recent years, automated presentation generation technologies based on generative models have gradually emerged, but existing technologies still have the following main problems: In terms of structural mapping, input documents such as academic papers and technical reports usually have chapter levels, paragraph logic, and argumentation chains, while presentations require narratives to be organized on a page-by-page basis, and control over the logic between pages and the information density within pages. Existing solutions are prone to problems such as "mechanically splitting by chapters leading to too many pages" or "over-summarizing leading to the loss of key arguments," lacking a stable structural mapping and page count budgeting mechanism; In terms of multimodal semantics, input documents often contain visual elements such as figures, tables, and formulas. Existing solutions are inadequate in aligning text and visual elements. Common problems include: textual key points failing to reference / explain corresponding charts, charts being incorrectly placed on irrelevant pages, and a lack of introductory illustrations to enhance the explanation, resulting in incomplete information delivery. Furthermore, lacking interactive, explainable, and visualized intermediate processes, many solutions directly generate final presentation files or only output natural language drafts. The intermediate decisions during the generation process are invisible and difficult to debug, making it difficult for users to make controllable interventions in the "structure planning—content specification—layout rendering—graphics and text optimization" chain. Simultaneously, the lack of a unified data contract and consistency verification easily leads to instability in multi-stage collaboration, error accumulation, and unpredictable final results.

[0004] The lack of multimodal elements and structured intermediate representations leads to decreased stability and increased error in the generated presentations. Therefore, improving the consistency of text and graphics and the interactivity of automatically generated presentations has become an urgent problem to be solved. Summary of the Invention

[0005] The purpose of this invention is to address the technical problem of decreased stability and increased error in presentation generation due to the lack of multimodal elements and structured intermediate representations, thereby improving the consistency of text and graphics and the interactivity of automatically generated presentations. This invention proposes an intelligent agent collaborative structured presentation generation method and system based on intermediate representations. By constructing a unified structured intermediate representation and employing a multi-agent collaborative scheduling approach, this invention completes narrative planning, semantic reduction, template rendering, text and graphics optimization, and presentation output, thereby generating a clearly structured, consistent, and editable presentation.

[0006] The objective of this invention is achieved through the following technical solution:

[0007] This invention discloses a method for generating intelligent agent collaborative structured documents with intermediate representations, comprising the following steps:

[0008] Step 1: Perform structural and element analysis on reducible information sources containing images, formulas, and tables. Extract summary information and perform multimodal image and text analysis using large models and multimodal large models respectively to form a structured intermediate representation for agent collaboration.

[0009] Step 1.1: Perform structural analysis on reducible information sources containing images, text, formulas, and tables; break down the reducible information sources into document structure trees according to chapter, sub-chapter, and paragraph hierarchical relationships, and extract summary information using a large model;

[0010] Step 1.2: Perform element parsing on the multimodal elements of a reducible information source containing images, text, formulas, and tables, and construct a mapping relationship between multimodal element nodes and the document structure tree;

[0011] Step 1.2.1: Use the multimodal large model to generate text parsing content for multimodal elements, forming multimodal element nodes;

[0012] Step 1.2.2: Construct a document structure tree and map it to multimodal element nodes to form a structured intermediate representation for agent collaboration;

[0013] Step 2: Use data contracts to roll back, repair, and update the triggering conditions of the multi-agent set, and initialize the scheduler;

[0014] Step 2.1: Construct a multi-agent set consisting of a parsing agent, a planning agent, a reducing agent, a rendering agent, a visual optimization agent, and a verification agent, and initialize the scheduler task state coordination;

[0015] Step 2.1.1: Parse the intelligent agent and perform integrity checks and basic corrections on the structured intermediate representation for agent-based collaboration;

[0016] Step 2.1.2: Configure the planning agent to form the outline structure and page-level organization scheme of the presentation using the parsed structured intermediate representation;

[0017] Step 2.1.3: Configure the specification agent to perform information compression, key point aggregation, and structure adjustment of the page-level organization scheme;

[0018] Step 2.1.4: Configure the rendering agent to map the final specification to the presentation template and generate structured pages;

[0019] Step 2.1.5: Configure the visual optimization agent to optimize the text style, visual elements, and graphic layout of the structured page;

[0020] Step 2.1.6: Use the verification agent to perform consistency verification on the output results of the parsing agent, planning agent, reduction agent, rendering agent, and visual optimization agent, and trigger exception handling;

[0021] Step 2.1.7: Use the scheduler to initialize the task state coordination of the multi-agent set in terms of execution order, calling relationship and state switching, and maintain the global execution state of the current presentation generation task;

[0022] Step 2.2: Utilize data contracts to construct triggering conditions for single or simultaneous occurrences of data missing, structural anomalies, and consistency failures, and perform rollback, repair, and update processing on the output of the multi-agent ensemble;

[0023] Step 2.2.1: Establish a cross-agent shared data contract using a structured intermediate representation oriented towards agent collaboration to constrain the input / output format and field integrity of the multi-agent set;

[0024] Step 2.2.1.1: Establish field constraint rules for the output of the multi-agent ensemble, including required fields, optional fields, and data types;

[0025] Step 2.2.1.2: Establish structural constraint rules for the output of the multi-agent set, which include the relationships between chapters, levels, page order, and page type;

[0026] Step 2.2.1.3: Establish semantic constraint rules to ensure the consistency of the reference of the same semantic unit output by a multi-agent set across different pages;

[0027] Step 2.2.1.4: Establish multimodal constraint rules for the reference and correspondence between the text content output by the multi-agent set and the corresponding multimodal elements;

[0028] Step 2.2.2: Establish triggering conditions for data missing, structural anomalies, and consistency failures using data contracts;

[0029] Step 2.2.2.1: When the output of the multi-agent ensemble does not meet the field constraints, trigger the data missing condition;

[0030] Step 2.2.2.2: When the output of the multi-agent set does not meet the structural constraints, trigger the structural anomaly condition;

[0031] Step 2.2.2.3: When the output of the multi-agent set does not satisfy the semantic constraints and multimodal constraints, the consistency failure condition is triggered;

[0032] Step 2.3: When a single or simultaneous data missing, structural anomaly, and consistency failure condition is triggered, the scheduler jumps to Step 2.1 to re-call the corresponding agent to perform rollback processing of the multi-agent set output. If rollback processing is not possible, the multi-agent set output is repaired according to the data contract. If repair processing is not possible, the scheduler jumps to Step 1 to perform update processing of the multi-agent set output.

[0033] Step 3: The scheduler calls the planning agent to allocate text elements and multimodal elements to the structured intermediate representation for agent-oriented collaboration according to the presentation narrative reasoning, generates the outline structure and page-level organization scheme of the presentation, and calls the verification agent to verify the consistency of the outline structure and page-level organization scheme of the presentation.

[0034] Step 4: The scheduler calls the specification agent to modify the outline structure and page-level organization scheme of the presentation document to generate the final specification draft; and calls the verification agent to perform consistency verification on the final specification draft.

[0035] Step 4.1: Perform information redundancy removal, key point aggregation, and paragraph compression operations on the text elements of the page-level organization scheme;

[0036] Step 4.2: Retain multimodal elements in the page-level organization scheme;

[0037] Step 4.3: Perform cross-page merging and density control on the outline structure of the presentation, determine the page type, form the final specification draft, and call the verification agent to perform consistency verification on the final specification draft;

[0038] Step 5: The scheduler calls the rendering agent to select the corresponding template according to the page type in the final draft of the specification, fills the fields in the template slots to generate a structured page, and calls the verification agent to perform consistency verification on the structured page.

[0039] Step 6: Based on the final draft of the specifications, the scheduler calls the visual optimization agent to optimize the text style and visual expression of the structured page;

[0040] Step 6.1: Based on the final draft of the specification, the scheduler calls the visual optimization agent to update the structured page through text style mapping;

[0041] Step 6.2: Based on the final draft of the specification, the scheduler calls the visual optimization agent to automatically fill in the missing original illustrations in a visual representation format;

[0042] Step 6.3: Based on the final draft of the specification, the scheduler calls the visual optimization agent to detect the consistency of text and images and the rationality of the layout of text and images on the structured page and make adaptive corrections to form an optimized structured page;

[0043] Step 7: Combine the optimized structured pages according to the final specification draft to form an HTML presentation, and extract the speech text from the final specification draft; at the same time, visualize the structured intermediate representation for agent collaboration and the final specification draft.

[0044] To achieve the objectives of this invention, based on the above method, this invention further proposes an intelligent agent collaborative structured document generation system with intermediate representation, including a document parsing module, an intermediate representation construction module, a multi-agent collaborative scheduling module, a semantic reduction module, a template rendering module, a graphic optimization module, and an output and intermediate result management module;

[0045] The document parsing module is used to parse the input document structure and extract multimodal elements to construct PaperTree, which will be used as the input to the intermediate representation construction module.

[0046] The intermediate representation construction module is used to generate and maintain PaperTree and SlideSpec, providing a unified data interface and field system, which will serve as the input to the multi-agent collaborative scheduling module;

[0047] The multi-agent collaborative scheduling module is used to schedule the parsing agent, planning agent, reduction agent, rendering agent, visual optimization agent and verification agent, control the execution order and loop strategy, and will serve as the input of the semantic reduction module, template rendering module and image optimization module.

[0048] The semantic reduction module, used to perform page number budgeting, information redundancy removal, key point aggregation, slide_type decision and cross-page merging, will serve as the input to the template rendering module;

[0049] The template rendering module is used to call the TemplatePack template according to the SlideSpec and complete the slot filling to generate a structured page output, which will be used as the input of the image and text optimization module.

[0050] The image and text optimization module is used to generate or select an introductory import image, perform image and text consistency alignment, and optimize text style and layout parameters. It will serve as the input to the output and intermediate result management module.

[0051] The output and intermediate result management module is used to generate HTML Slides, export presentation texts, and output visualized intermediate results and verification reports.

[0052] Compared with existing technologies, it has the following beneficial effects:

[0053] 1. This invention constructs a unified structured intermediate representation, PaperTree or SlideSpec, which connects the document parsing, planning, reduction, rendering, and optimization stages with a computable and traceable data structure, significantly improving the interpretability and stability of the generation process and reducing error accumulation.

[0054] 2. This invention adopts a multi-agent collaborative scheduling mechanism to decompose complex demonstration generation tasks into controllable sub-tasks, and achieves closed-loop repair through data contracts and consistency checks, thereby improving the structural coherence and engineering reliability of the generated results.

[0055] 3. In addition to semantic specification and template rendering, this invention introduces a graphic optimization process, which can generate importable illustrations and optimize text styles while retaining the original text chart evidence, thereby improving the readability, expression efficiency and visual consistency of the presentation.

[0056] 4. This invention supports outputting editable presentation files and simultaneously generating speech transcripts, and provides visualized intermediate results, facilitating user interaction, review, secondary editing, and iterative optimization. It is suitable for scenarios such as academic presentations, course teaching, and corporate technical exchanges. Attached Figure Description

[0057] Figure 1 is a schematic diagram of the process of the present invention;

[0058] Figure 2 is a schematic diagram of the data structure of the structured intermediate representation PaperTree;

[0059] Figure 3 is a schematic diagram of the data structure of the structured intermediate representation SlideSpec; Detailed Implementation

[0060] To better illustrate the purpose and advantages of this invention, the invention will be further described below with reference to the accompanying drawings and examples. It should be noted that the implementation of this invention is not limited to the following embodiments, and any modifications or alterations made to this invention will fall within the scope of protection of this invention.

[0061] This embodiment uses the ACL published paper "Hierarchy-Aware Global Model for Hierarchical Text Classification" as input. By performing text structure parsing and multimodal element extraction on the paper, a structured intermediate representation for agent collaboration is constructed. Based on this, a multi-agent set is configured, and corresponding calling and scheduling strategies are set. Each agent is driven to execute its corresponding processing task sequentially according to the order of "parsing—planning—reduction—rendering—visual optimization." The agents interact and synchronize their states through the structured intermediate representation, and a multi-stage semantic reduction mechanism controls the content quality and consistency during the generation process. Finally, an editable HTML presentation is generated based on the reduction results, achieving automated generation from an academic paper to a presentation.

[0062] Example

[0063] like Figure 1 As shown in the figure, the specific implementation steps of the agent-cooperative structured document generation method with intermediate representation in this embodiment are as follows:

[0064] Step 1: Perform structural and element analysis on reducible information sources containing images, formulas, and tables. Extract summary information and perform multimodal image and text analysis using large models and multimodal large models respectively to form a structured intermediate representation for agent collaboration.

[0065] Step 1.1: Perform structural analysis on reducible information sources containing images, text, formulas, and tables; break down the reducible information sources into document structure trees according to chapter, sub-chapter, and paragraph hierarchical relationships, and extract summary information using a large model;

[0066] Step 1.2: Perform element parsing on the multimodal elements of a reducible information source containing images, text, formulas, and tables, and construct a mapping relationship between multimodal element nodes and the document structure tree;

[0067] Step 1.2.1: Use the multimodal large model to generate text parsing content for multimodal elements, forming multimodal element nodes;

[0068] Step 1.2.2: Construct a document structure tree and map it to multimodal element nodes to form a structured intermediate representation for agent collaboration;

[0069] In this embodiment, the system first automatically identifies and distinguishes different types of information elements from the input academic paper, including chapter titles, paragraph text, formula numbers, table titles, and image captions. For the text content, the system uses a large-scale model to extract summary information from each chapter; for example, it extracts the core research ideas from the "Method Introduction" chapter and key experimental conditions from the "Experimental Setup" chapter. For images, tables, and formulas, the system uses a multimodal large-scale model to analyze the correspondence between text and images, extracting the meaning of the experimental process or results expressed in the images. The above analysis results are uniformly organized into a structured intermediate representation for agent-based collaboration and saved in JSON format.

[0070] Step 2: Use data contracts to roll back, repair, and update the triggering conditions of the multi-agent set, and initialize the scheduler;

[0071] Step 2.1: Construct a multi-agent set consisting of a parsing agent, a planning agent, a reducing agent, a rendering agent, a visual optimization agent, and a verification agent, and initialize the scheduler task state coordination;

[0072] Step 2.1.1: Parse the intelligent agent and perform integrity checks and basic corrections on the structured intermediate representation for agent-based collaboration;

[0073] Step 2.1.2: Configure the planning agent to form the outline structure and page-level organization scheme of the presentation using the parsed structured intermediate representation;

[0074] Step 2.1.3: Configure the specification agent to perform information compression, key point aggregation, and structure adjustment of the page-level organization scheme;

[0075] Step 2.1.4: Configure the rendering agent to map the final specification to the presentation template and generate structured pages;

[0076] Step 2.1.5: Configure the visual optimization agent to optimize the text style, visual elements, and graphic layout of the structured page;

[0077] Step 2.1.6: Use the verification agent to perform consistency verification on the output results of the parsing agent, planning agent, reduction agent, rendering agent, and visual optimization agent, and trigger exception handling;

[0078] Step 2.1.7: Use the scheduler to initialize the task state coordination of the multi-agent set in terms of execution order, calling relationship and state switching, and maintain the global execution state of the current presentation generation task;

[0079] Step 2.2: Utilize data contracts to construct triggering conditions for single or simultaneous occurrences of data missing, structural anomalies, and consistency failures, and perform rollback, repair, and update processing on the output of the multi-agent ensemble;

[0080] Step 2.2.1: Establish a cross-agent shared data contract using a structured intermediate representation oriented towards agent collaboration to constrain the input / output format and field integrity of the multi-agent set;

[0081] Step 2.2.1.1: Establish field constraint rules for the output of the multi-agent ensemble, including required fields, optional fields, and data types;

[0082] Step 2.2.1.2: Establish structural constraint rules for the output of the multi-agent set, which include the relationships between chapters, levels, page order, and page type;

[0083] Step 2.2.1.3: Establish semantic constraint rules to ensure the consistency of the reference of the same semantic unit output by a multi-agent set across different pages;

[0084] Step 2.2.1.4: Establish multimodal constraint rules for the reference and correspondence between the text content output by the multi-agent set and the corresponding multimodal elements;

[0085] Step 2.2.2: Establish triggering conditions for data missing, structural anomalies, and consistency failures using data contracts;

[0086] Step 2.2.2.1: When the output of the multi-agent ensemble does not meet the field constraints, trigger the data missing condition;

[0087] Step 2.2.2.2: When the output of the multi-agent set does not meet the structural constraints, trigger the structural anomaly condition;

[0088] Step 2.2.2.3: When the output of the multi-agent set does not satisfy the semantic constraints and multimodal constraints, the consistency failure condition is triggered;

[0089] Step 2.3: When a single or simultaneous data missing, structural anomaly, and consistency failure condition is triggered, the scheduler jumps to Step 2.1 to re-call the corresponding agent to perform rollback processing of the multi-agent set output. If rollback processing is not possible, the multi-agent set output is repaired according to the data contract. If repair processing is not possible, the scheduler jumps to Step 1 to perform update processing of the multi-agent set output.

[0090] In the embodiments, such as Figure 2As shown, after the structured intermediate representation is constructed, this invention initializes and configures the multi-agent set based on a preset data contract. The data contract is used to constrain the integrity and consistency of the input and output fields of each agent. Trigger conditions are configured, such as triggering rollback or repair processing when a chapter lacks a summary field or the text and image references are incomplete. The scheduler is also initialized, and a scheduling strategy of "parsing-planning-reduction-rendering-visual optimization" is set.

[0091] Step 3: The scheduler calls the planning agent to allocate text elements and multimodal elements to the structured intermediate representation for agent-oriented collaboration according to the presentation narrative reasoning, generates the outline structure and page-level organization scheme of the presentation, and calls the verification agent to verify the consistency of the outline structure and page-level organization scheme of the presentation.

[0092] In the embodiments, such as Figure 3 As shown, the scheduler invokes the planning agent to process the structured intermediate representation formed in step 1. Based on the overall structure of the paper and following the narrative logic of the presentation, the planning agent allocates textual and multimodal elements. For example, the planning agent plans the "Research Background" and "Research Motivation" sections of the paper as the introduction of the presentation, the "Methodological Framework" section as several method description pages, and key experimental results as result display pages. Simultaneously, the planning agent determines the number of key points and corresponding chart references for each slide. A validation agent then verifies the consistency of the number of pages, page types, and chapter order to ensure a reasonable presentation structure.

[0093] Step 4: The scheduler calls the specification agent to modify the outline structure and page-level organization scheme of the presentation document to generate the final specification draft; and calls the verification agent to perform consistency verification on the final specification draft.

[0094] Step 4.1: Perform information redundancy removal, key point aggregation, and paragraph compression operations on the text elements of the page-level organization scheme;

[0095] Step 4.2: Retain multimodal elements in the page-level organization scheme;

[0096] Step 4.3: Perform cross-page merging and density control on the outline structure of the presentation, determine the page type, form the final specification draft, and call the verification agent to perform consistency verification on the final specification draft;

[0097] In this example, the scheduler invokes the reduction agent to reduce the presentation outline structure and page-level organization scheme generated in step 3. During the reduction process, the system compresses information and aggregates key points for the text content of each page. For example, for long experimental description paragraphs in the paper, the reduction agent retains only the experimental objectives, core parameters, and main conclusions, merging or deleting redundant descriptions. After reduction, a final reduction draft is generated, which clearly records the title, list of key points, and required multimodal elements for each slide. After the final reduction draft is generated, a verification agent performs consistency verification to ensure that the reduced content remains semantically consistent with the original paper.

[0098] Step 5: The scheduler calls the rendering agent to select the corresponding template according to the page type in the final draft of the specification, fills the fields in the template slots to generate a structured page, and calls the verification agent to perform consistency verification on the structured page.

[0099] In this example, the scheduler invokes the rendering agent based on the final specification draft. The rendering agent selects the corresponding presentation template based on the page type (e.g., introduction page, method page, result page) indicated in the final specification draft. The rendering agent populates the template slots with fields, filling in the titles, key text, and multimodal references from the final specification draft into the template, generating a structured page representation in HTML format. The validation agent then performs consistency verification on the completeness of the page fields and the rationality of the layout.

[0100] Step 6: Based on the final draft of the specifications, the scheduler calls the visual optimization agent to optimize the text style and visual expression of the structured page;

[0101] Step 6.1: Based on the final draft of the specification, the scheduler calls the visual optimization agent to update the structured page through text style mapping;

[0102] Step 6.2: Based on the final draft of the specification, the scheduler calls the visual optimization agent to automatically fill in the missing original illustrations in a visual representation format;

[0103] Step 6.3: Based on the final draft of the specification, the scheduler calls the visual optimization agent to detect the consistency of text and images and the rationality of the layout of text and images on the structured page and make adaptive corrections to form an optimized structured page;

[0104] In this example, the scheduler invokes the visual optimization agent based on the final specification to further optimize the structured page generated in step 5. The visual optimization agent adjusts the text styles on the page, such as differentiating the font levels of titles and body text, and emphasizing key conclusions; it also optimizes the visualization of charts and diagrams on the page, such as adding image descriptions to pages lacking visual representation, and further optimizing the layout of pages containing images. Through this step, the structured page achieves a better visual presentation while maintaining content consistency.

[0105] Step 7: Combine the optimized structured pages according to the final specification draft to form an HTML presentation, and extract the speech text from the final specification draft; at the same time, visualize the structured intermediate representation for agent collaboration and the final specification draft.

[0106] In this embodiment, the optimized structured pages are concatenated in the order determined by the final specification draft to generate an HTML presentation file. Based on the page content in the final specification draft, corresponding speech text is automatically extracted and generated to assist in the presentation. This invention provides a visual output of the structured intermediate representation for agent-based collaboration and the final specification draft, allowing users to intuitively view the paper analysis results, presentation planning process, and final generated content, thereby improving the interpretability of the generation process.

[0107] To achieve the objectives of this invention, based on the above method, this embodiment further proposes an intelligent agent collaborative structured document generation system with intermediate representation, including a document parsing module, an intermediate representation construction module, a multi-agent collaborative scheduling module, a semantic reduction module, a template rendering module, a graphic optimization module, and an output and intermediate result management module;

[0108] The document parsing module is used to parse the input document structure and extract multimodal elements to construct PaperTree, which will be used as the input to the intermediate representation construction module.

[0109] The intermediate representation construction module is used to generate and maintain PaperTree and SlideSpec, providing a unified data interface and field system, which will serve as the input to the multi-agent collaborative scheduling module;

[0110] The multi-agent collaborative scheduling module is used to schedule the parsing agent, planning agent, reduction agent, rendering agent, visual optimization agent and verification agent, control the execution order and loop strategy, and will serve as the input of the semantic reduction module, template rendering module and image optimization module.

[0111] The semantic reduction module, used to perform page number budgeting, information redundancy removal, key point aggregation, slide_type decision and cross-page merging, will serve as the input to the template rendering module;

[0112] The template rendering module is used to call the TemplatePack template according to the SlideSpec and complete the slot filling to generate a structured page output, which will be used as the input of the image and text optimization module.

[0113] The image and text optimization module is used to generate or select an introductory import image, perform image and text consistency alignment, and optimize text style and layout parameters. It will serve as the input to the output and intermediate result management module.

[0114] The output and intermediate result management module is used to generate HTML Slides, export presentation texts, and output visualized intermediate results and verification reports.

[0115] To further illustrate the advantages of the present invention, an ablation experiment is conducted.

[0116] (1) Removal of IR: Replacing structured IR with natural language processing to connect the various stages, i.e., parsing directly produces a text outline for subsequent use, and the generated results are then polished by GPT-4 according to a template. Under this setting, multiple stages lack unified data interfaces and constraints, resulting in poor information connection. Experiments showed that without IR, content coverage decreased significantly, and coherence scores also decreased. This confirms the value of unified IR in maintaining consistency across stages. Without IR, some key points generated in the planning were lost or distorted in subsequent text expansion, and chart references were also omitted. With IR, information from each stage was completely transmitted, improving fidelity.

[0117] (2) De-reduction: Skipping the specification agent, without fine-tuning compression and type adjustment, directly filling the template with the outline text from the planning phase. As a result, presentation pages are often overloaded or loosely structured, with the average number of words per page exceeding the optimal range, leading to a heavy reading burden and loose logic. In the automatic evaluation, the page coherence score decreased by 0.15, and human feedback indicated "some pages have too many words, making them difficult to read." Clearly, the lack of specification will impair the readability and highlighting of key points in the slides. Although the coverage may increase slightly, such slides do not meet the requirements of actual reporting. The specification agent, by simplifying and streamlining, improves the efficiency of information presentation.

[0118] (3) Removing Collaborative Constraints: Data contracts and validation agents are disabled, allowing each agent to execute according to its own optimal behavior without being forced to comply with global rules. For example, allowing the regulation agent to delete charts or display multiple charts on a single page, and not limiting the number of pages. Results show a significant decrease in multimodal consistency and structural rationality: some charts disappear from the final result, and some pages have two charts crammed together, resulting in a chaotic layout. Content coverage also decreased by approximately 5 percentage points because some information considered secondary was removed. This demonstrates that collaborative constraints are crucial for ensuring information integrity and layout standardization. When each module optimizes its local aspects independently without global constraints, the overall quality suffers.

[0119] The above detailed description further illustrates the purpose, technical solution, and beneficial effects of the invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for generating structured documents collaboratively by intelligent agents using intermediate representations, characterized in that: Includes the following steps, Step 1: Perform structural and element analysis on reducible information sources containing images, formulas, and tables. Extract summary information and perform multimodal image and text analysis using large models and multimodal large models respectively to form a structured intermediate representation for agent collaboration. Step 2: Use data contracts to roll back, repair, and update the triggering conditions of the multi-agent set, and initialize the scheduler; Step 2.1: Construct a multi-agent set consisting of a parsing agent, a planning agent, a reducing agent, a rendering agent, a visual optimization agent, and a verification agent, and initialize the scheduler task state coordination; Step 2.2: Utilize data contracts to construct triggering conditions for single or simultaneous occurrences of data missing, structural anomalies, and consistency failures, and perform rollback, repair, and update processing on the output of the multi-agent ensemble; Step 2.2.1: Establish a cross-agent shared data contract using a structured intermediate representation oriented towards agent collaboration to constrain the input / output format and field integrity of the multi-agent set; Step 2.2.1.1: Establish field constraint rules for the output of the multi-agent ensemble, including required fields, optional fields, and data types; Step 2.2.1.2: Establish structural constraint rules for the output of the multi-agent set, which include the relationships between chapters, levels, page order, and page type; Step 2.2.1.3: Establish semantic constraint rules to ensure the consistency of the reference of the same semantic unit output by a multi-agent set across different pages; Step 2.2.1.4: Establish multimodal constraint rules for the reference and correspondence between the text content output by the multi-agent set and the corresponding multimodal elements; Step 2.2.2: Establish triggering conditions for data missing, structural anomalies, and consistency failures using data contracts; Step 2.2.2.1: When the output of the multi-agent ensemble does not meet the field constraints, trigger the data missing condition; Step 2.2.2.2: When the output of the multi-agent set does not meet the structural constraints, trigger the structural anomaly condition; Step 2.2.2.3: When the output of the multi-agent set does not satisfy the semantic constraints and multimodal constraints, the consistency failure condition is triggered; Step 2.3: When a single or simultaneous data missing, structural anomaly, and consistency failure condition is triggered, the scheduler jumps to Step 2.1 to re-call the corresponding agent to perform rollback processing of the multi-agent set output. If rollback processing is not possible, the multi-agent set output is repaired according to the data contract. If repair processing is not possible, the scheduler jumps to Step 1 to perform update processing of the multi-agent set output. Step 3: The scheduler calls the planning agent to allocate text elements and multimodal elements to the structured intermediate representation for agent-oriented collaboration according to the presentation narrative reasoning, generates the outline structure and page-level organization scheme of the presentation, and calls the verification agent to verify the consistency of the outline structure and page-level organization scheme of the presentation. Step 4: The scheduler calls the specification agent to modify the outline structure and page-level organization scheme of the presentation document to generate the final specification draft; and calls the verification agent to perform consistency verification on the final specification draft. Step 5: The scheduler calls the rendering agent to select the corresponding template according to the page type in the final draft of the specification, fills the fields in the template slots to generate a structured page, and calls the verification agent to perform consistency verification on the structured page. Step 6: Based on the final draft of the specifications, the scheduler calls the visual optimization agent to optimize the text style and visual expression of the structured page; Step 7: Combine the optimized structured pages with the final version of the specification to form an HTML presentation, and extract the speech text from the final version of the specification; at the same time, visualize the structured intermediate representation for agent collaboration and the final version of the specification.

2. The method for generating an agent-based collaborative structured document with intermediate representation as described in claim 1, characterized in that: Step 1 is implemented as follows: Step 1.1: Perform structural analysis on reducible information sources containing images, text, formulas, and tables; break down the reducible information sources into document structure trees according to chapter, sub-chapter, and paragraph hierarchical relationships, and extract summary information using a large model; Step 1.2: Perform element parsing on the multimodal elements of a reducible information source containing images, text, formulas, and tables, and construct a mapping relationship between multimodal element nodes and the document structure tree.

3. The method for generating an agent-cooperative structured document with intermediate representation as described in claim 2, characterized in that: Step 1.2 is implemented as follows: Step 1.2.1: Use the multimodal large model to generate text parsing content for multimodal elements, forming multimodal element nodes; Step 1.2.2: Construct a document structure tree and map it to multimodal element nodes to form a structured intermediate representation for agent collaboration.

4. The method for generating an agent-cooperative structured document with intermediate representation as described in claim 1, characterized in that: Step 2.1 is implemented as follows: Step 2.1.1: The parser performs integrity checks and basic corrections on the structured intermediate representation for agent-based collaboration; Step 2.1.2: Configure the planning agent to form the outline structure and page-level organization scheme of the presentation using the parsed structured intermediate representation; Step 2.1.3: Configure the specification agent to perform information compression, key point aggregation, and structure adjustment of the page-level organization scheme; Step 2.1.4: Configure the rendering agent to map the final specification to the presentation template and generate structured pages; Step 2.1.5: Configure the visual optimization agent to optimize the text style, visual elements, and graphic layout of the structured page; Step 2.1.6: Use the verification agent to perform consistency verification on the output results of the parsing agent, planning agent, reduction agent, rendering agent, and visual optimization agent, and trigger exception handling; Step 2.1.7: Use the scheduler to initialize the task state coordination of the multi-agent set in terms of execution order, calling relationship and state switching, and maintain the global execution state of the current presentation generation task.

5. The method for generating an agent-cooperative structured document with intermediate representation as described in claim 1, characterized in that: Step 4 is implemented as follows: Step 4.1: Perform information redundancy removal, key point aggregation, and paragraph compression operations on the text elements of the page-level organization scheme; Step 4.2: Retain multimodal elements in the page-level organization scheme; Step 4.3: Perform cross-page merging and density control on the outline structure of the presentation, determine the page type, form the final specification draft, and call the validation agent to perform consistency verification on the final specification draft.

6. The method for generating an agent-cooperative structured document with intermediate representation as described in claim 1, characterized in that: Step 6 is implemented as follows: Step 6.1: Based on the final draft of the specification, the scheduler calls the visual optimization agent to update the structured page through text style mapping; Step 6.2: Based on the final draft of the specification, the scheduler calls the visual optimization agent to automatically fill in the missing original illustrations in a visual representation format; Step 6.3: Based on the final draft of the specification, the scheduler calls the visual optimization agent to detect the consistency of text and images and the rationality of the layout of text and images on the structured page and make adaptive corrections to form an optimized structured page.

7. A collaborative structured document generation system for intelligent agents, implementing an intermediate representation as described in claim 1, characterized in that: It includes a document parsing module, an intermediate representation construction module, a multi-agent collaborative scheduling module, a semantic reduction module, a template rendering module, a text and image optimization module, and an output and intermediate result management module; The document parsing module is used to parse the input document structure and extract multimodal elements to construct PaperTree, which will be used as the input to the intermediate representation construction module. The intermediate representation construction module is used to generate and maintain PaperTree and SlideSpec, providing a unified data interface and field system, which will serve as the input to the multi-agent collaborative scheduling module; The multi-agent collaborative scheduling module is used to schedule the parsing agent, planning agent, reduction agent, rendering agent, visual optimization agent and verification agent, control the execution order and loop strategy, and will serve as the input of the semantic reduction module, template rendering module and image optimization module. The semantic reduction module, used to perform page number budgeting, information redundancy removal, key point aggregation, slide_type decision and cross-page merging, will serve as the input to the template rendering module; The template rendering module is used to call the TemplatePack template according to the SlideSpec and complete the slot filling to generate a structured page output, which will be used as the input of the image and text optimization module. The image and text optimization module is used to generate or select an introductory import image, perform image and text consistency alignment, and optimize text style and layout parameters. It will serve as the input to the output and intermediate result management module. The output and intermediate result management module is used to generate HTML Slides, export presentation texts, and output visualized intermediate results and verification reports.