A complex layout file keyword extraction method, device and electronic equipment

CN121960802BActive Publication Date: 2026-06-26SHANGYANG TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGYANG TECH CO LTD
Filing Date
2026-04-02
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies cannot effectively extract keywords from complex layout files, resulting in low accuracy, poor full-format compatibility, low error correction efficiency, and insufficient versatility. They are insufficient to meet the high precision, high reliability, and high adaptability requirements of industries such as government and finance.

Method used

The source file content is parsed using a combination of multiple tools. By integrating multimodal large models with spatial coordinates, a one-to-one mapping between text content and file layout is achieved. A three-dimensional cross-validation mechanism is also constructed to ensure the accurate extraction and error correction of key fields.

Benefits of technology

It improves the accuracy of keyword extraction in complex layout files, enhances full format compatibility and error correction efficiency, and meets the high precision, high reliability and high adaptability requirements of government, finance and other industries.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121960802B_ABST
    Figure CN121960802B_ABST
Patent Text Reader

Abstract

The application discloses a complex layout file keyword extraction method and device and electronic equipment, and the method comprises the following steps: acquiring a source file of a complex layout; after acquiring the complex layout source file, redundant check text with spatial coordinates is obtained by using multi-tool parallel analysis, the source file is converted into a picture retaining the original layout and used as a reference coordinate system, and a fusion text set is generated through spatial coordinate conflict resolution; after accurate alignment of the picture and the text, the picture and the text are input into a multi-modal large model, layout classification, picture-text cross verification and content reasoning are completed relying on system prompt words, and a key field with a layout anchor point is output; and through deduplication, normalization, standardization formatting and three-dimensional cross verification, abnormal fields are directionally re-reasoned, and finally, a standardized structured text is generated. Through multi-tool redundant analysis, picture-text deep fusion and closed-loop directional error correction, the application greatly improves the keyword extraction accuracy, adapts to a multi-format mixed complex layout scene, and significantly improves the generality and error correction efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of document processing technology, and in particular to a method, apparatus and electronic device for extracting keywords from complex layout documents. Background Technology

[0002] As enterprises deepen their digital transformation, industries such as government affairs, finance, law, and engineering generate massive amounts of electronic documents. These documents come in various formats, including Word, Excel, TXT, PDF, and images. They also commonly feature complex structures such as mixed text and images, tables, signatures, and watermarks, nested tables across pages, multi-column layouts, and non-standard heading levels. It is necessary to accurately extract key fields from these documents to meet core business needs such as automated document classification, content retrieval, information aggregation, and compliance verification.

[0003] Currently, existing technologies for extracting keywords from documents mainly fall into two categories:

[0004] The first type is a keyword extraction scheme based on a single large language model. This type of scheme can only process plain text content and cannot be compatible with complex layout files where images and text coexist. It also cannot accurately identify the layout logic of the file and is prone to problems such as incorrect correspondence of key fields and disconnection of contextual semantics.

[0005] The second category is keyword extraction schemes based on a single multimodal model. Although these schemes can recognize both image and text content, they still have many unavoidable technical shortcomings:

[0006] First, the file parsing process has inherent shortcomings. Existing technologies mostly use a single parsing tool to process files, failing to take into account the technical advantages of different parsing tools. For example, for PDF files, using only the pdfplumber library to extract text is prone to losing layout information, and using only OCR recognition is easily affected by image clarity, resulting in recognition errors. Even if some solutions use multiple parsing tools, they are simply a simple superposition of results, failing to resolve the conflict between the parsing results of multiple tools. This leads to incomplete and inaccurate text content extraction from the source, directly affecting the effectiveness of subsequent keyword extraction.

[0007] Secondly, the layout recognition and image-text fusion capabilities are insufficient. Existing technologies have not established a precise mapping relationship between text content and document layout visual information, and cannot achieve spatial alignment of text and images. Multimodal models struggle to accurately capture the inherent logic of complex layouts and cannot establish associations between layout elements such as logos and company names, titles and core content, table headers and corresponding data, and signatures and dates. In complex scenarios such as mixed text and images, cross-page tables, and nested layouts, issues such as missing keywords, incorrect field correspondences, and semantic deviations are highly likely to occur.

[0008] Third, there is a lack of a sound quality control and error correction mechanism. Existing technologies rely solely on the model's single output results, lacking a multi-dimensional keyword verification mechanism. Even if some solutions include basic verification steps, these are merely post-hoc verifications based on fixed rules. Once an error is discovered, the entire file needs to be re-analyzed and reasoned, resulting in low processing efficiency and an inability to achieve targeted error correction. Furthermore, most existing solutions require pre-training or fine-tuning of large models for specific file types and business scenarios to ensure extraction accuracy, failing to balance versatility and extraction precision, limiting adaptability to various scenarios, and exhibiting poor general applicability.

[0009] In summary, existing technologies cannot simultaneously address the core pain points of keyword extraction in complex layout files, such as low accuracy, poor full-format compatibility, low error correction efficiency, and insufficient versatility. They are insufficient to meet the high precision, high reliability, and high adaptability requirements for keyword extraction in actual business applications. Summary of the Invention

[0010] The purpose of this invention is to provide a method, apparatus, and electronic device for extracting keywords from complex layout files, which has the advantages of improving keyword extraction accuracy, enhancing full-format compatibility, improving error correction efficiency, and increasing versatility.

[0011] In a first aspect, the present invention provides a method for extracting keywords from complex layout files, comprising:

[0012] Obtain the source file for a complex layout;

[0013] The source file content is parsed using a combination of multiple tools. For source files of different formats, corresponding dedicated parsing tools are selected. At least two parsing tools are used to parse the same source file in parallel to obtain multiple text contents that are redundantly verified. Each text content is extracted synchronously and carries the spatial coordinate information of the corresponding content in the source file, so as to realize a one-to-one mapping between the text content and the layout position of the source file.

[0014] Convert the source file page by page into an image format that retains the original layout information;

[0015] Using the image converted from the source file as the reference coordinate system, perform multi-source text conflict resolution processing based on spatial coordinates on multiple text contents carrying spatial coordinate information, and generate a fused text set after dual-dimensional verification of coordinate matching degree and recognition confidence.

[0016] The converted image is precisely aligned with the fused text set in spatial position. Then, the aligned image and the fused text set are synchronously input into the multimodal large model. Guided by built-in layout rules, text priority rules, and system prompts that match the source file type for the directional key field extraction rules, the multimodal large model is guided to complete intelligent hierarchical classification of layout areas, cross-validation of visual layout information and text content, layout logic recognition, and content fusion reasoning. The model outputs unified key field content that carries the corresponding layout position anchor information.

[0017] The key field content output by the multimodal large model is initially deduplicated and normalized. Then, the processed key field content is standardized and formatted using a large language model. A three-dimensional cross-validation is performed based on the layout position anchor information, contextual semantic association, and preset industry rules of the key fields. For key fields that fail the validation, a directional feedback instruction carrying the abnormal position coordinates, error type, and correction reference direction is generated and sent back to the layout analysis stage to guide the multimodal large model to perform directional re-inference for abnormal areas until all key fields pass the three-dimensional cross-validation. Finally, standardized structured text matching the preset field template is generated.

[0018] Furthermore, the source file format includes one or more of Word, Excel, TXT, PDF, and images, and the source file supports complex layout content that mixes text with images, tables, signatures, and watermarks.

[0019] The method of parsing the source file using a combination of tools includes:

[0020] For source files of different formats, corresponding dedicated parsing tools are selected. Word files are parsed using the docx library, Excel files are parsed using the openpyxl library, image files are parsed using the pytesseract tool for OCR recognition, and TXT files are parsed directly using a text reading tool.

[0021] For PDF source files, the pdfplumber library is first used to extract the text content, and each page of the PDF file is converted into an image. The text content in the images is then extracted using an OCR recognition tool, resulting in two redundant and mutually verifying text contents of the PDF source file. The pdfplumber library preserves the text layout, paragraph structure, and table row and column relationships in the PDF file during the text extraction process. Before extracting the text content, the OCR recognition tool performs tilt correction, contrast enhancement, and noise reduction preprocessing on the converted images. Finally, the pytesseract tool is used to recognize the text content carrying coordinate information.

[0022] Furthermore, the multi-source text conflict resolution process based on spatial coordinates includes:

[0023] Using the image converted from the source file as the reference coordinate system, the spatial coordinates of each text fragment in multiple text contents are mapped and matched with the image pixel coordinates, and a two-dimensional score of coordinate matching degree and text recognition confidence degree is generated for each text fragment.

[0024] For conflicting segments with overlapping coordinates and inconsistent text content, a hierarchical resolution rule is triggered: for conflicting segments with a coordinate matching difference greater than a preset threshold, the text with the higher matching degree is used; for conflicting segments with a coordinate matching difference less than or equal to the preset threshold, the advantageous result of the corresponding parsing tool is matched based on the layout type of the area where the segment is located.

[0025] For conflict segments that cannot be resolved by rules, they are marked as high-risk areas. During the layout analysis stage, the multimodal large model is guided to perform key reasoning and verification on the high-risk areas.

[0026] Furthermore, the intelligent hierarchical classification of the layout area includes:

[0027] The multimodal large model is used to classify the layout regions of the converted image, dividing the source file into core high-value regions, complex layout regions, and low-value redundant regions.

[0028] Different processing rules are set for different regions: the highest precision parsing and dual-weighted reasoning are used for the core high-value regions to increase the weight coefficient of text priority; dual parsing tools are used for verification and multimodal model cross-reasoning for the complex layout regions, while retaining complete coordinate anchor point information; low-value redundant regions are only lightly filtered and do not enter the core keyword reasoning stage.

[0029] For complex layout areas spanning multiple pages, the system identifies one or more of the tables, paragraphs, and key fields that are split across pages by leveraging the continuity of table headers between pages, the continuity of coordinate positions, and the semantic coherence of the context, thereby merging cross-page content and completing keyword information.

[0030] Furthermore, the layout rules built into the system prompts include layout logic that corresponds the company name to the area below the logo in the image, the core content to the area immediately below the title, the data fields below the table header, the date and signature information to the signature area, and the exclusion of non-core content in the header and footer areas. These rules are used to guide the multimodal large model to establish the corresponding relationship between layout elements and key fields in the source file.

[0031] The system's built-in text priority rules include prioritizing text content that conforms to layout rules, has high recognition clarity, and has no obvious semantic deviation as the basis for reasoning. For text fragments with inconsistent recognition among multiple text contents, cross-comparison and verification are performed by combining layout logic and contextual semantics to eliminate incorrectly recognized text and retain the correct content with the highest matching degree.

[0032] Furthermore, the preliminary deduplication and normalization process specifically includes:

[0033] A deduplication algorithm based on string matching and semantic similarity calculation is adopted to remove completely identical keywords, semantically identical synonyms, and redundant fields without actual business significance. At the same time, different expressions of the same key field are normalized, and the deduplicated and normalized key field content is then passed into the subsequent formatting process.

[0034] Furthermore, the three-dimensional cross-validation includes integrity validation, accuracy validation, and reasonableness validation;

[0035] The integrity verification specifically involves: pre-setting a list of required key fields for the corresponding file type; comparing the key fields output by the multimodal large model with the pre-set list of required key fields one by one to determine whether there are any missing required key fields; if there are any omissions, generating a directional feedback instruction containing the name of the missing field and the corresponding layout area prompt, feeding it back to the layout analysis stage, and readjusting the system prompt words to guide the multimodal large model to perform layout analysis and content reasoning again until all required key fields are obtained;

[0036] The accuracy verification specifically involves: combining the full-text context of the source file and the anchor information of the layout position, determining the matching degree between the extracted key fields and the original text content of the corresponding area of ​​the source file; if the matching degree is lower than a preset threshold, generating a directional feedback instruction containing the erroneous fields and the corresponding original text position, and feeding it back to the layout analysis stage for reprocessing.

[0037] The rationality verification specifically involves: based on preset industry standard rules, determining whether the format, semantics, and numerical range of key fields conform to the standard rules; if there are key fields that do not conform to the rules, generating a targeted feedback instruction for rationality verification anomalies, and feeding it back to the layout analysis stage for re-verification and screening.

[0038] Furthermore, after generating standardized structured text that matches the preset field template, the process also includes a self-learning iterative processing of validation rules, specifically:

[0039] Build a local sample iteration library and store the key field results, corresponding layout features, anomaly correction records, and file type tags of each successful validation in the sample library;

[0040] Based on sample database data, the layout rules and text priority rules, the threshold and judgment criteria of three-dimensional cross-validation, and the classification rules of layout area hierarchy in the system prompts are periodically and automatically iterated and optimized.

[0041] Secondly, the present invention provides a keyword extraction device for complex layout documents, comprising:

[0042] The source file acquisition module is used to acquire the source files of complex layouts;

[0043] The source file parsing module is used to parse the content of the source file using a combination of multiple tools. For source files of different formats, corresponding dedicated parsing tools are selected. At least two parsing tools are used to parse the same source file in parallel to obtain multiple text contents that are redundantly verified. Each text content is extracted synchronously and carries the spatial coordinate information of the corresponding content in the source file, so as to realize a one-to-one mapping between the text content and the layout position of the source file.

[0044] The image conversion module is used to convert the source file page by page into an image format that retains the original layout information;

[0045] The conflict resolution processing module is used to perform multi-source text conflict resolution processing based on spatial coordinates on multiple text contents carrying spatial coordinate information, using the image converted from the source file as the reference coordinate system, and generate a fused text set after dual-dimensional verification of coordinate matching degree and recognition confidence.

[0046] The key field content output module is used to accurately align the converted image with the fused text set in spatial position, and then synchronously input the aligned image and the fused text set into the multimodal large model. Through built-in layout rules, text priority rules, and system prompts for targeted key field extraction rules that match the source file type, the multimodal large model is guided to complete intelligent hierarchical classification of layout areas, cross-validation of visual layout information and text content, layout logic recognition, and content fusion reasoning, and output unified key field content that carries the corresponding layout position anchor information.

[0047] The standardized structured text output module is used to perform preliminary deduplication and normalization processing on the key field content output by the multimodal large model. Then, the processed key field content is standardized and formatted using a large language model. A three-dimensional cross-validation is performed based on the layout position anchor information, contextual semantic association, and preset industry rules of the key fields. For key fields that fail the validation, a directional feedback instruction carrying the abnormal position coordinates, error type, and correction reference direction is generated and sent back to the layout analysis stage to guide the multimodal large model to perform directional re-inference for abnormal areas until all key fields pass the three-dimensional cross-validation. Finally, standardized structured text matching the preset field template is generated.

[0048] Thirdly, the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the steps of the complex layout file keyword extraction method described in any of the preceding claims.

[0049] Beneficial effects:

[0050] This invention provides a method, apparatus, and electronic device for extracting keywords from complex layout files. By employing parallel parsing with multiple tools and resolving spatial coordinate conflicts, it ensures the integrity and accuracy of text extraction from complex layout files from the source. Through precise image-text alignment and deep fusion reasoning with a multimodal large model, it effectively improves the extraction accuracy of key fields in complex layout scenarios, solving problems such as keyword omissions and correspondence errors. Simultaneously, it constructs a fully closed-loop directional error correction mechanism based on three-dimensional cross-validation, significantly improving processing efficiency and extraction accuracy, meeting the needs of government, finance, and other industries for high-precision, high-reliability, and highly adaptable keyword extraction. Attached Figure Description

[0051] Figure 1 This is a flowchart illustrating a method for extracting keywords from complex layout files, as provided by the present invention.

[0052] Figure 2 This is a schematic diagram of a complex layout file keyword extraction device provided by the present invention.

[0053] Figure 3 This is a schematic diagram of the structure of an electronic device provided by the present invention. Detailed Implementation

[0054] The technical solutions of this invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are merely some, not all, of the embodiments of this invention. The components of this invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without inventive effort are within the scope of protection of this invention.

[0055] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this invention, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0056] It should be noted that the method of this embodiment can be executed by a single device, such as a computer or server. The method of this embodiment can also be applied to a distributed scenario, where multiple devices cooperate to complete the task. In such a distributed scenario, one of these devices may execute only one or more steps of the method of this embodiment, and the multiple devices will interact with each other to complete the method described.

[0057] It should be noted that the above description describes some embodiments of the present invention. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than that shown in the above embodiments and still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0058] Traditional methods for extracting keywords from documents, whether based on a single large language model or a single multimodal model, all have limitations. Specifically, these limitations manifest in: incomplete and inaccurate text extraction during document parsing; insufficient layout recognition and image-text fusion capabilities, leading to keyword omissions, incorrect correspondences, and semantic deviations in complex layout scenarios; and a lack of robust quality control and targeted error correction mechanisms, resulting in low processing efficiency and poor versatility, making it difficult to meet the business requirements of high precision, high reliability, and high adaptability.

[0059] In this regard, such as Figure 1 As shown, this invention proposes a method for extracting keywords from complex layout files, including:

[0060] S100, Obtain the source file for the complex layout;

[0061] S200. The source file content is parsed using a combination of multiple tools. For source files of different formats, corresponding dedicated parsing tools are selected. At least two parsing tools are used to parse the same source file in parallel to obtain multiple text contents that are redundantly checked. Each text content is extracted synchronously and carries the spatial coordinate information of the corresponding content in the source file to achieve a one-to-one mapping between the text content and the layout position of the source file.

[0062] S300: Convert the source file page by page into an image format that retains the original layout information;

[0063] S400. Using the image converted from the source file as the reference coordinate system, perform multi-source text conflict resolution processing based on spatial coordinates on multiple text contents carrying spatial coordinate information, and generate a fused text set after dual-dimensional verification of coordinate matching degree and recognition confidence.

[0064] S500. The converted image and the fused text set are precisely aligned in spatial position. Then, the aligned image and the fused text set are synchronously input into the multimodal large model. Through the system prompts of the built-in layout rules, text priority rules and the targeted key field extraction rules that match the source file type, the multimodal large model is guided to complete the intelligent classification of layout areas, cross-validation of visual layout information and text content, layout logic recognition and content fusion reasoning, and output unified key field content that carries the corresponding layout position anchor point information.

[0065] S600. Perform preliminary deduplication and normalization on the key field content output by the multimodal large model, and then use the large language model to standardize and format the processed key field content. Perform three-dimensional cross-validation based on the layout position anchor information, context semantic association, and preset industry rules of the key fields. For key fields that fail the validation, generate directional feedback instructions carrying abnormal position coordinates, error type, and correction reference direction and send them back to the layout analysis stage to guide the multimodal large model to perform directional re-inference for abnormal areas until all key fields pass the three-dimensional cross-validation. Finally, generate standardized structured text that matches the preset field template.

[0066] The source file formats processed by this invention are diverse, including one or more of Word, Excel, TXT, PDF, and image formats. These source files are not limited to plain text content; they also support complex layouts that mix text with images, tables, signatures, watermarks, and other elements, while adapting to batch processing of both single-page and multi-page files. This diversity and complexity are common document characteristics in real-world business scenarios, placing high demands on the robustness and accuracy of the parsing tools.

[0067] In step S100, the source file for the complex layout can be obtained through various means. For example, it can be obtained by the user manually uploading the file, downloading the file from a specified network path, or reading the file from the local file system.

[0068] In step S200, the present invention employs a series of dedicated parsing tools to parse source files of different formats.

[0069] The method of parsing the source file using a combination of tools includes:

[0070] For source files of different formats, corresponding dedicated parsing tools are selected. Word files are parsed using the docx library, Excel files are parsed using the openpyxl library, image files are parsed using the pytesseract tool for OCR recognition, and TXT files are parsed directly using a text reading tool.

[0071] For PDF source files, the pdfplumber library is first used to extract the text content, and each page of the PDF file is converted into an image. The text content in the images is then extracted using an OCR recognition tool, resulting in two redundant and mutually verifying text contents of the PDF source file. The pdfplumber library preserves the text layout, paragraph structure, and table row and column relationships in the PDF file during the text extraction process. Before extracting the text content, the OCR recognition tool performs tilt correction, contrast enhancement, and noise reduction preprocessing on the converted images. Finally, the pytesseract tool is used to recognize the text content carrying coordinate information.

[0072] Specifically, for common formats such as Word, Excel, TXT, and images, dedicated parsing tools are selected to ensure accurate acquisition of the initial text content. For Word files, the docx library is used for parsing, which can deeply parse the XML structure of Word documents and accurately extract text content and its formatting information. For Excel files, the openpyxl library is used, which is specifically designed for handling Excel's .xlsx format and can accurately obtain cell data, formulas, and table structure. For pure image files, the pytesseract tool is used for optical character recognition (OCR) to convert the text content in the image into editable text. For TXT files, due to their simple structure, text reading tools can be used directly to efficiently acquire their content.

[0073] Given the complexity of the PDF format, a dual-parse strategy is adopted to ensure the comprehensiveness and accuracy of text extraction.

[0074] First, the pdfplumber library is used to extract the text content of the PDF file. While extracting the text, this library can simultaneously preserve the text layout, paragraph structure, and table row and column relationships in the PDF file, which is crucial for understanding the original layout of the document.

[0075] Secondly, to address the potential presence of image-based text or complex graphic elements in PDFs, each page of the PDF file is converted into an image, and OCR tools are used to extract text from these images. Before OCR recognition, a series of preprocessing operations are performed on the converted images to improve recognition accuracy. These include tilt correction to correct angular deviations during document scanning or photography, contrast enhancement to make the text and background more distinct, and noise reduction to eliminate noise and interference in the image. After preprocessing, the text content carrying precise coordinate information is then recognized using the pytesseract tool. By using the pdfplumber library and the OCR recognition tool in parallel, two redundant and mutually verifying sets of text content are obtained from the PDF source file, greatly improving the reliability of the initial text extraction.

[0076] Therefore, this invention can effectively address the challenges of keyword extraction from various complex layout files. This strategy of combining multiple tools, targeted optimization, and redundant verification significantly improves the accuracy, completeness, and robustness of the initial text extraction, ensuring high quality of the acquired text content and its spatial coordinate information. This lays a solid foundation for subsequent multi-source text conflict resolution, multimodal large-scale model processing, and the final standardized output of key fields.

[0077] In step S300, the source file is converted into an image format to preserve its original visual layout. For example, a multi-page PDF file can be rendered page by page into a high-resolution PNG or JPEG image. A Word document can also be converted into an image using a virtual printer or a specific library. During the conversion process, it is necessary to ensure that the image clearly presents all the original layout visual elements of the original document, including text, images, tables, lines, backgrounds, etc., and maintains their relative positions on the page.

[0078] In step S400, after acquiring multiple text contents, these contents need to be integrated and verified. For example, text fragments extracted by different parsing tools can be compared with a reference image based on their respective spatial coordinate information. If multiple text contents claim the presence of text in a certain area of ​​the image, the degree of coordinate overlap and text content similarity of these text fragments need to be evaluated. An initial confidence score can be assigned to each text fragment, and the coordinate matching degree can be adjusted according to its fit with the image layout. When texts from different sources conflict in the same area, a selection can be made according to preset conflict resolution rules, such as selecting longer, more complete, or more visually consistent text fragments. Conflict fragments that cannot be resolved by the rules are marked as high-risk areas, guiding the multimodal large model to perform focused reasoning and verification on high-risk areas in subsequent layout analysis. Finally, the text fragments that have undergone preliminary screening and integration are gathered together to form a set containing text content and corresponding spatial coordinates.

[0079] In step S500, the images generated page by page from the source files, preserving the original layout and visual elements, are first used as a unified spatial reference coordinate system. The fused text set, which has undergone multi-source text conflict resolution and carries standardized calibration spatial coordinates, is then mapped point by point according to the page number, bounding box coordinates, and image pixel coordinates of the text fragments. This involves offset calibration and anchor point binding to achieve a precise one-to-one correspondence between the image visual layout and the text content in spatial position, thus completing the precise spatial alignment of the two. Subsequently, the aligned image data and fused text data are synchronously input into the multimodal large model, and the built-in... The system provides layout rule guidance, text priority rules, and targeted key field extraction rules that match the current source file type. Based on the targeted guidance of these prompts, the multimodal big data model first intelligently classifies the document into core high-value areas, complex layout areas, and low-value redundant areas. Then, it cross-validates and mutually verifies the visual layout information of the images with the text content of the fused text set, accurately identifies the overall layout logic of the document, and completes deep fusion reasoning of multimodal information. Finally, it outputs key field content with a unified format, and each key field is accompanied by anchor information of its corresponding layout position in the source file.

[0080] Specifically, the multimodal large model is a large model with the ability to understand images and text and recognize layouts; the large language model is a large model with the ability to standardize and verify text.

[0081] The spatial coordinates of each text fragment in the fused text set are precisely mapped and aligned with the pixel coordinates of the image. This matches the corresponding text content to each visual region in the image, ensuring a one-to-one spatial correspondence between visual layout information and text content. This mechanism fundamentally prevents image-text misalignment and positional mismatch issues. The specific implementation method is as follows:

[0082] The images obtained by converting each page of the source file serve as a unified reference coordinate system. This coordinate system uses the top-left corner of each page image as the origin, with image pixels as the smallest unit, the horizontal axis as the X-axis, and the vertical axis as the Y-axis. Each page image independently corresponds to a unique page number coordinate system. Each text segment in the fused text set carries standardized spatial coordinate information including page number, top-left corner coordinates, and bottom-right corner coordinates. This coordinate system has been calibrated during the multi-source text conflict resolution process and is fully compatible with the reference coordinate system. The standardized spatial coordinates of each text segment in the fused text set are mapped and matched point-by-point with the pixel coordinate system of the corresponding page number image. Position verification is completed by calculating the intersection-union ratio (IUU) of the text segment bounding box and the visual region of the image. For tiny offsets of less than 1 pixel, coordinate deviation calibration is automatically performed to ensure that the coordinate range of the text segment completely coincides with the visual region on the image. Finally, each calibrated text segment is spatially anchored to the corresponding visual region of the image, forming a precise one-to-one mapping relationship of "image visual region - text segment - spatial coordinates," thereby achieving precise spatial alignment between the image and the fused text set.

[0083] The multimodal large model has built-in exclusive system prompts, which contain three core rules: layout rule guidance, text priority rules, and targeted key field extraction rules. These three rules work together to guide the multimodal large model to accurately extract key fields from complex layout files.

[0084] Layout rule guidance is a set of predefined instructions embedded in system prompts to provide multimodal large models with prior knowledge about the relationship between the document's visual layout and content semantics, thereby more accurately identifying key information.

[0085] Text priority rules are a series of predefined instructions embedded in system prompts. They guide multimodal large models on how to select and judge text content or recognition results provided by multiple parsing tools when there are inconsistencies, so as to ensure that the model can prioritize higher quality and more reliable text information as the basis for inference.

[0086] The targeted key field extraction rule precisely limits the multimodal large model to extract only the core key fields relevant to the current business scenario, rather than extracting all text content indiscriminately. This achieves business-oriented, precise, and standardized keyword extraction. The rule is independently configured for different document types, such as bidding documents, contracts, financial statements, and government documents. It includes a clearly defined list of required key fields and works in conjunction with layout rules and text priority rules to ensure that the key fields output by the multimodal large model accurately match the preset field templates. This provides a basis for subsequent three-dimensional cross-validation and standardized structured output.

[0087] During the process, the multimodal large model, guided by the layout rules in the system prompts, intelligently classifies the layout regions of the aligned images, dividing the document into multiple regions, such as core high-value regions, complex layout regions, and low-value redundant regions, and performing differentiated processing on different regions. More important regions are processed effectively, while less important regions are processed simply.

[0088] Next, based on the hierarchical results, the multimodal large model cross-validates the visual layout information in the image with the text content of the fused text set, checking whether the spatial coordinates of the text fragments match the visual regions of the image, and whether the functional attributes of the text content are consistent with the layout regions, thus verifying the accuracy and layout matching degree of the multi-source text.

[0089] Then, combining the cross-validation results, the multimodal large model identifies the overall layout logic of the document, sorts out the inherent relationship between layout elements such as titles and body text, table headers and data, cross-page content, and mixed text and image areas, and clarifies the function and field affiliation of each area.

[0090] The multimodal large model, based on the rules for extracting directional key fields and the layout logic completed by recognition, performs content fusion reasoning on visual, textual, and layout information to integrate and filter out key fields that meet the requirements, eliminating redundant and erroneous information. After reasoning, the model binds each extracted key field to its corresponding layout position anchor information in the source file, and finally outputs key field content with a uniform format, accurate fields, and position anchors.

[0091] In step S600, the original key field content output by the multimodal large model is first subjected to preliminary deduplication and normalization processing. Redundant key fields that are completely duplicated or semantically identical are removed by string matching and semantic similarity calculation. At the same time, the expression form of the fields is unified and the basic format is standardized to eliminate the inconsistency caused by abbreviations, special symbols, capitalization, and format differences.

[0092] After initial processing, the standardized key field content is input into the large language model, which then performs standardized formatting according to preset field standards, uniformly adjusting field names, data formats, and arrangement order, removing meaningless characters, and forming a set of key fields to be verified that are formatted and clearly structured.

[0093] Next, based on the layout anchor information carried by the key fields themselves, the contextual semantic association of the source file, and the preset industry rules of the corresponding business domain, the set of key fields to be verified is cross-validated in three dimensions: completeness, accuracy, and rationality. Each key field is checked one by one to see if there are any issues such as missing fields, inconsistencies with the original content, or non-compliance with industry standards. After the verification is completed, the results are judged. For key fields that fail the verification, their abnormal position coordinates in the source file are accurately located, the specific error type is identified, and the correction reference direction is determined. Based on this, a directional feedback instruction carrying the abnormal position coordinates, error type, and correction reference direction is generated.

[0094] The directional feedback instruction is then sent back to the earlier layout analysis stage. Based on the instruction information, the multimodal large model is guided to perform localized directional re-inference only for the marked abnormal areas. The layout identification, image-text cross-validation and key field extraction of the area are then performed again, without having to perform full-process repeated inference on the entire source file.

[0095] After the multimodal large model completes the re-inference of the abnormal region and outputs the updated key field content, it performs a preliminary deduplication and normalization, standardization and formatting, and three-dimensional cross-validation process on the updated content to verify whether the corrected key fields meet the requirements.

[0096] Repeat the above-described process of verification, instruction generation, feedback transmission, and targeted re-inference until all key fields in the source file successfully pass the three-dimensional cross-validation without any abnormalities.

[0097] Finally, all qualified key fields are integrated, encapsulated, and sorted according to the preset field template to generate standardized structured text with uniform format, complete fields, accurate data, and fully matching the requirements of the preset field template.

[0098] This invention ensures the integrity and accuracy of text extraction from complex layout files from the source by employing parallel parsing with multiple tools and resolving spatial coordinate conflicts. Through precise image-text alignment and deep fusion reasoning with a multimodal large model, it effectively improves the extraction accuracy of key fields in complex layout scenarios, resolving issues such as keyword omissions and miscordation. Simultaneously, it constructs a fully closed-loop directional error correction mechanism based on three-dimensional cross-validation, significantly improving processing efficiency and extraction accuracy, meeting the needs of government, finance, and other industries for high-precision, high-reliability, and highly adaptable keyword extraction.

[0099] In one embodiment, step S400, the multi-source text conflict resolution process based on spatial coordinates, includes:

[0100] Using the image converted from the source file as the reference coordinate system, the spatial coordinates of each text fragment in multiple text contents are mapped and matched with the image pixel coordinates, and a two-dimensional score of coordinate matching degree and text recognition confidence degree is generated for each text fragment.

[0101] For conflicting segments with overlapping coordinates and inconsistent text content, a hierarchical resolution rule is triggered: for conflicting segments with a coordinate matching difference greater than a preset threshold, the text with the higher matching degree is used; for conflicting segments with a coordinate matching difference less than or equal to the preset threshold, the advantageous result of the corresponding parsing tool is matched based on the layout type of the area where the segment is located.

[0102] For conflict segments that cannot be resolved by rules, they are marked as high-risk areas. During the layout analysis stage, the multimodal large model is guided to perform key reasoning and verification on the high-risk areas.

[0103] Specifically, when performing multi-source text conflict resolution based on spatial coordinates, the image converted from the source file is first used as a unified reference coordinate system. This provides a common spatial reference standard for all text fragments from different parsing tools, ensuring consistency in subsequent coordinate comparisons and conflict judgments. Subsequently, the spatial coordinates of each text fragment in multiple text contents, such as its bounding box information, are precisely mapped and matched with the image pixel coordinates. By calculating the overlap (such as the Intersection over Union (IoU)) or center point distance between the bounding box of the text fragment and the corresponding region on the image, its visual alignment can be quantified, thus generating a coordinate matching score for each text fragment. Simultaneously, combining the text recognition confidence scores provided by each parsing tool during the recognition process, a two-dimensional score including coordinate matching score and text recognition confidence score is constructed for each text fragment to comprehensively evaluate its quality.

[0104] When a conflicting segment with overlapping coordinates and inconsistent text content is detected, the system will trigger a pre-defined hierarchical resolution rule. The first-level rule is based on the difference in coordinate matching degree:

[0105] If the difference in coordinate matching between two conflicting segments exceeds a preset threshold, it indicates that one segment is significantly better aligned with the image content in terms of spatial location than the other. In this case, the text segment with the higher coordinate matching degree will be taken as the more reliable result. This preset threshold can be adjusted based on actual application scenarios and experience to balance alignment accuracy and error tolerance.

[0106] If the difference in coordinate matching is less than or equal to a preset threshold, meaning the two conflicting fragments are similarly aligned in space, then the second level of resolution rules is invoked. At this point, the system considers the layout type of the area where the fragment is located (e.g., whether the fragment is in a table, a heading area, or a regular paragraph) to match the optimal result of the corresponding parsing tool. For example, some parsing tools may be more accurate in processing tabular data, while others may perform better in recognizing specific fonts or plain text. By pre-configuring or learning the advantages of each parsing tool under different layout types, the system can intelligently select the most suitable parsing result for the current layout type, thereby improving the accuracy of resolution.

[0107] For conflicting fragments that cannot be clearly resolved even after following the aforementioned grading rules, the system will mark them as high-risk areas. The coordinates of these high-risk areas, along with all conflicting text fragments, will be passed to the subsequent layout analysis stage. During layout analysis, the system will guide the multimodal big data model to perform focused reasoning and verification on these high-risk areas. The multimodal big data model will combine its powerful visual understanding capabilities, contextual semantic analysis capabilities, and built-in layout logic to conduct deeper and more detailed cross-validation and reasoning on areas that are difficult to determine, ultimately aiming to identify the most accurate text content and its precise layout position.

[0108] Therefore, this invention effectively solves the problem of text content and spatial coordinate conflicts caused by recognition differences when multiple tools parse source files in parallel. This method provides a quantitative basis for text fragment quality assessment by establishing a unified reference coordinate system and a two-dimensional scoring mechanism. The introduction of hierarchical resolution rules enables the system to intelligently select the optimal parsing result based on the nature of the conflict, avoiding errors that may be introduced by simple merging. In particular, for high-risk areas that are difficult to resolve automatically through rules, they are fed back to a multimodal large model for focused inference and verification, greatly improving the accuracy and robustness of key field extraction in complex layout files, ensuring the high quality of the fused text set, and laying a solid foundation for subsequent key field extraction and standardization processing.

[0109] In some of the above embodiments, although key fields in complex layout files can be initially extracted by combining multiple tools for parsing, resolving conflicts between multiple sources of text, and processing large multimodal models, in actual processing, the information value and layout complexity of different regions vary greatly. If they are processed uniformly without distinction, it may lead to a waste of computing resources, low efficiency in extracting key information, or even affect the accuracy of the final extraction due to interference from redundant information.

[0110] In response, in step S500, the present invention further proposes intelligent hierarchical classification of layout areas, specifically including:

[0111] The multimodal large model is used to classify the layout regions of the converted image, dividing the source file into core high-value regions, complex layout regions, and low-value redundant regions.

[0112] Different processing rules are set for different regions: the highest precision parsing and dual-weighted reasoning are used for the core high-value regions to increase the weight coefficient of text priority; dual parsing tools are used for verification and multimodal model cross-reasoning for the complex layout regions, while retaining complete coordinate anchor point information; low-value redundant regions are only lightly filtered and do not enter the core keyword reasoning stage.

[0113] For complex layout areas spanning multiple pages, the system identifies one or more of the tables, paragraphs, and key fields that are split across pages by leveraging the continuity of table headers between pages, the continuity of coordinate positions, and the semantic coherence of the context, thereby merging cross-page content and completing keyword information.

[0114] Among them, layout region classification refers to using a multimodal large model to analyze the visual layout information of a document, identify and distinguish the semantics and importance of different regions in the document.

[0115] Core high-value areas typically refer to areas containing core business information such as contract subject, key terms, amount, and date; complex layout areas may include tables, charts, signatures, multi-column text, and other areas that require special handling; low-value redundant areas may include headers and footers, decorative images, blank areas, and other parts that contribute little to the extraction of key fields.

[0116] This classification can be achieved by training a multimodal large model to recognize the visual features of a document, such as font size, color, position, borders, and background, as well as text content features. For example, the model can learn to recognize common layout elements such as headings, paragraphs, lists, and tables, and categorize them according to preset rules or learned patterns.

[0117] Differentiated processing rules are used to allocate different processing strategies and resources based on the classification results of a region.

[0118] For core, high-value areas, due to the critical nature of their information, the highest precision parsing will be employed. This may involve calling more sophisticated OCR models or text parsing algorithms, and performing dual-weighted inference. This means that when performing inference using a multimodal, large-scale model, the text content and visual layout information of this area will be given higher weight, ensuring that it is processed preferentially and accurately. Increasing the weight coefficient of text priority means that the text information in this area has the highest decision priority during conflict resolution or information fusion.

[0119] For complex layout areas, parsing errors are prone to occur due to their intricate structures. Therefore, a dual-parsing tool verification method is employed. For example, two different parsing tools are used simultaneously to parse the area, and multimodal model cross-reasoning is performed. This involves combining visual and textual information for multiple verifications to improve the accuracy of extracting complex structural content. Meanwhile, preserving complete coordinate anchor point information is crucial for subsequent accurate alignment and verification.

[0120] For low-value redundant areas, since they contribute little to the extraction of key fields, only lightweight filtering is performed, such as ignoring them directly or performing preliminary screening, and they are not included in the core keyword reasoning process, thereby saving computing resources and improving overall processing efficiency.

[0121] Handling complex layouts spanning multiple pages addresses common pagination issues in documents, such as a table or paragraph split across two or more pages. Table header continuity involves identifying repeating or continuing patterns in table headers on adjacent pages to determine if they belong to the same table. Coordinate position continuity involves comparing the relative positions and sizes of text or layout elements on adjacent pages to determine if they represent the same content across different pages. Contextual semantic coherence involves analyzing the semantic relationships between text content on adjacent pages to determine if they constitute a complete logical unit. The multimodal large model can utilize these clues to identify split tables, paragraphs, or key fields and logically merge them, thereby completing the extraction of cross-page content and supplementing key fields, avoiding information loss or incompleteness caused by pagination.

[0122] Therefore, this invention effectively solves the problem of significant differences in information value and processing difficulty among different regions in complex layout files. By using a multimodal large model to intelligently classify the layout regions of the source file, the document content is divided into core high-value regions, complex layout regions, and low-value redundant regions. Differentiated processing rules are then set accordingly, allowing the system to concentrate limited computing resources and high-precision parsing capabilities on core high-value regions, ensuring accurate extraction of key information. Simultaneously, multi-tool verification and multimodal cross-reasoning are employed for complex layout regions, effectively improving the robustness of parsing complex structural content. Lightweight filtering of low-value redundant regions significantly improves overall processing efficiency. Furthermore, the identification and merging mechanism for cross-page complex layout regions ensures the integrity of cross-page content, avoiding the omission of key information due to pagination. This refined processing strategy not only significantly improves the accuracy and efficiency of keyword extraction from complex layout files but also reduces unnecessary computational overhead, making the entire extraction process more intelligent, efficient, and reliable.

[0123] In some embodiments of the present invention described above, when the converted image and the fused text set are simultaneously input into a multimodal large model for key field extraction, the multimodal large model needs to perform cross-validation, layout logic recognition, and content fusion reasoning on complex visual layout information and text content. However, without a clear and detailed guidance mechanism, the multimodal large model may have difficulty accurately understanding the semantic functions of different layout regions, and may also have difficulty effectively selecting and judging when the recognition results of multi-source text are inconsistent, thereby affecting the accuracy and efficiency of key field extraction.

[0124] In response, in step S500, the present invention further proposes to provide refined guidance for multimodal large models through system prompts, so as to improve the accuracy and robustness of key field extraction.

[0125] Specifically, the layout rules built into the system prompts include layout logic that corresponds the company name to the area below the logo in the image, the core content to the area immediately below the title, the data fields below the table header, the date and signature information to the signature area, and the exclusion of non-core content in the header and footer areas. These rules are used to guide the multimodal large model to establish the corresponding relationship between layout elements and key fields in the source file.

[0126] The layout rules guide the following:

[0127] The area below the logo in the image corresponds to the company name. This rule clearly states that in document images, the text area usually located directly below the company logo is very likely to contain the company name information. The system prompts provide this empirical rule to the multimodal large model, guiding the model to prioritize recognizing the text content in the area below the logo as the company name when it recognizes the logo.

[0128] The area immediately below the title is the core content. This rule indicates that in a multimodal large model, after the title is identified in the document, the area immediately following it usually carries the core theme or main content of the document. The system prompts pass this rule to the model, so that when processing such areas, it gives them higher semantic importance and prioritizes the extraction of key fields related to the main idea of ​​the document.

[0129] The table header corresponds to the data fields below. This rule is used to guide multimodal large models to understand the table structure. It explicitly states that the header row or column of a table usually defines the semantic category of the data fields below or to the right of it. By embedding this rule into the system prompt words, the model can establish a logical relationship between the header and the corresponding data, thereby accurately identifying and extracting the structured key data in the table.

[0130] The signature area corresponds to the date and signature information. This rule applies to the end of the document, especially the signature area. It tells the multimodal model that this area usually contains the document's signing date, the signer's name, or signature information. When processing the signature area, the model will prioritize identifying and extracting these specific types of key fields according to this rule.

[0131] The layout logic that excludes non-core content in the header and footer areas is used to optimize the efficiency and accuracy of key field extraction. It guides multimodal large models to identify header and footer areas and understand that these areas usually contain auxiliary information such as page numbers, document names, and company websites, rather than the core key fields of the document. Therefore, when processing these areas, the model will reduce the priority of their content as key fields, or even exclude them from the scope of core key field extraction, in order to reduce interference from redundant information.

[0132] These layout rules work together to guide multimodal large models in establishing the corresponding relationship between layout elements and key fields in the source file, thereby achieving intelligent and accurate extraction of key fields.

[0133] In addition, the built-in text priority rules of the system prompts include prioritizing text content that conforms to layout rules, has high recognition clarity, and has no obvious semantic deviation as the basis for reasoning. For text fragments with inconsistent recognition among multiple text contents, cross-comparison and verification are performed by combining layout logic and contextual semantics to eliminate incorrectly recognized text and retain the correct content with the highest matching degree.

[0134] This text prioritization rule provides three main criteria for text selection in multimodal large models: First, the text content should conform to known layout rules; second, the text's recognition clarity, such as OCR confidence, should reach a high level; and finally, the text content should not contain obvious semantic errors or be inconsistent with the context. The model comprehensively evaluates these factors and prioritizes text segments that meet these conditions for subsequent inference.

[0135] For inconsistent text fragments identified across multiple text documents, the model performs cross-checking based on layout logic and contextual semantics, eliminating incorrectly identified text and retaining the correct content with the highest match. When multiple parsing tools or OCR results show discrepancies in their recognition of the same text fragment, this rule guides the multimodal large model to perform in-depth verification. The model utilizes its understanding of document layout logic and the contextual semantic information of surrounding text to cross-check these inconsistent text fragments. Through this multi-dimensional verification, the model can identify and eliminate erroneous recognition results, ultimately selecting the correct text content that best matches the layout and semantics and has the highest confidence level.

[0136] In one embodiment, the targeted key field extraction rule includes 7 mandatory key fields, which explicitly require the model to extract the project name, tenderer, bidder, total bid price, bid validity period, project duration, and qualification level, while also marking the layout anchor point information corresponding to each key field.

[0137] Therefore, when the converted images and fused text sets are input into the multimodal large model for key field extraction, the multimodal large model no longer blindly infers. Instead, it is guided by preset layout rules to accurately understand the semantic functions of different regions, such as recognizing company names, core content, table data, dates, and signatures, thereby establishing a clear association between visual layout elements and key fields. Simultaneously, through built-in text priority rules, the model can intelligently filter and verify multi-source text content, prioritizing high-quality, high-confidence text that conforms to layout logic as the basis for inference, effectively solving the problem of inconsistent multi-source text recognition. Furthermore, through targeted key field extraction rules, the model can precisely limit the multimodal large model to extract only the target core key fields within the current business scenario, rather than extracting all text content indiscriminately. This refined guidance and verification mechanism significantly improves the accuracy and robustness of the multimodal large model in identifying key fields in complex layout files, reducing false positives and omissions caused by layout misunderstandings or text recognition errors, ensuring that the final output key field content is more accurate and reliable, and better adaptable to various complex and changing document layouts.

[0138] In one embodiment, in step S600, the key field content output by the multimodal large model undergoes preliminary deduplication and normalization processing, specifically as follows:

[0139] A deduplication algorithm based on string matching and semantic similarity calculation is adopted to remove completely identical keywords, semantically identical synonyms, and redundant fields without actual business significance. At the same time, different expressions of the same key field are normalized, and the deduplicated and normalized key field content is then passed into the subsequent formatting process.

[0140] The algorithm employs a deduplication mechanism based on string matching and semantic similarity calculation to identify and eliminate duplicate or semantically equivalent information in key fields of the multimodal large model output. String matching is a method that directly compares text content and is suitable for identifying identical keywords, such as through hash value comparison or precise string lookup. Semantic similarity calculation analyzes the meaning of the text to determine whether different expressions refer to the same concept. For example, pre-trained word vector models (such as Word2Vec and BERT) can be used to convert keywords into high-dimensional vectors, and then the cosine similarity between these vectors can be calculated. If the similarity exceeds a preset threshold, they are considered semantically equivalent. Furthermore, domain knowledge graphs or ontology can be combined to perform more precise semantic matching of terms in specific domains.

[0141] Deduplication algorithms using string matching and semantic similarity calculations remove identical keywords, ensuring that only one copy of each unique piece of information is retained, thus avoiding data redundancy. Semantically identical synonyms are also removed, such as "Limited Liability Company" and "Limited Company," identified through semantic similarity calculations or a pre-defined thesaurus, retaining only the standard expression. Redundant fields without practical business significance are removed; this refers to removing content that, although extracted, has no value to the final structured data, such as page numbers, date separators, or auxiliary text defined as non-critical information according to business rules. This can be achieved by maintaining a stop word list or using regular expression-based filtering rules.

[0142] Different representations of the same key field are normalized to unify various expressions of the same key field into a standard format. For example, a date field may appear in multiple forms such as "January 1, 2023", "2023 / 01 / 01", and "Jan. 1, 2023". Normalization will unify them into the standard format "YYYY-MM-DD". For fields such as amount, phone number, and address, corresponding normalization rules can also be preset. In one embodiment, this can be achieved through a rule engine, regular expression replacement, or by utilizing the text conversion capabilities of a large language model, ensuring the consistency of data format in subsequent processing and storage.

[0143] The deduplicated and normalized key field content is then fed into the subsequent formatting process. This step is a crucial link in the data processing flow. After the initial deduplication and normalization, the data quality and consistency of the key field content are significantly improved, providing high-quality input for the subsequent standardized formatting of the large language model. This ensures that subsequent processing steps can run efficiently on a clean and standardized dataset, avoiding additional processing burdens or error propagation caused by data quality issues.

[0144] Therefore, after the multimodal large model outputs the initial key field content, this embodiment of the invention can effectively perform preliminary cleaning and standardization of these fields. Specifically, by employing a deduplication algorithm based on string matching and semantic similarity calculation, it can accurately identify and eliminate completely identical keywords and semantically equivalent synonyms, greatly reducing data redundancy. Simultaneously, by filtering redundant fields without practical business significance, the purity of the data is further improved. Furthermore, different expressions of the same key field are normalized to ensure data format consistency, providing a unified input standard for subsequent standardized formatting. This preprocessing mechanism significantly improves the data quality and consistency of key fields, avoiding the complexity and error rate of subsequent processing caused by redundant or non-standard data. This allows the large language model to complete the task more efficiently and accurately during standardized formatting, providing a more reliable foundation for subsequent three-dimensional cross-validation, ultimately ensuring the accuracy and usability of the generated standardized structured text.

[0145] When extracting keywords from complex layout files, although key field content can be obtained through multimodal large models and preliminary processing, the lack of a comprehensive and refined verification mechanism may still lead to omissions, inaccuracies, or non-compliance with business logic in the extraction results. This not only affects the efficiency and quality of subsequent data processing but may also require significant manual costs for verification and correction, making it difficult to ensure the reliability of the final output standardized structured text.

[0146] In one embodiment, the three-dimensional cross-validation in step S600 includes integrity validation, accuracy validation, and reasonableness validation.

[0147] The three-dimensional cross-validation constitutes a comprehensive, multi-layered verification system to ensure the quality of extracted key fields from different perspectives. Completeness verification focuses on whether fields are missing, accuracy verification focuses on whether field content is consistent with the original text, and reasonableness verification focuses on whether fields conform to business logic and specifications. This multi-dimensional design effectively compensates for the shortcomings of a single verification method and improves the overall reliability of the extraction results.

[0148] The integrity verification specifically involves: pre-setting a list of required key fields for the corresponding file type; comparing the key fields output by the multimodal large model with the pre-set list of required key fields one by one to determine whether there are any missing required key fields; if there are any omissions, generating a directional feedback instruction containing the name of the missing field and the corresponding layout area prompt, feeding it back to the layout analysis stage, and readjusting the system prompt words to guide the multimodal large model to perform layout analysis and content reasoning again until all required key fields are obtained;

[0149] The accuracy verification specifically involves: combining the full-text context of the source file and the anchor information of the layout position, determining the matching degree between the extracted key fields and the original text content of the corresponding area of ​​the source file, including verifying whether the extracted company name is consistent with the full name that appears multiple times in the source file, whether the extracted date field matches the date of the signature in the source file, and whether the extracted numerical field corresponds to the actual data in the table. If the matching degree is lower than a preset threshold, a targeted feedback instruction containing the error field and the corresponding original text position is generated and fed back to the layout analysis stage for reprocessing.

[0150] The rationality verification specifically involves: based on preset industry rules, determining whether the format, semantics, and numerical range of key fields conform to the rules, including verifying whether the date format is standardized, whether the number of digits in the phone number meets the requirements, whether the company name conforms to the industrial and commercial naming standards, and whether the numerical fields conform to the logical range of the corresponding business. If there are key fields that do not conform to the rules, a targeted feedback instruction for rationality verification anomalies is generated and fed back to the layout analysis stage for re-verification and screening.

[0151] Specifically, integrity verification ensures that all pre-defined, essential key fields for a specific document type have been successfully extracted. This is typically achieved by maintaining a "list of required key fields" for different document types (e.g., contracts, invoices, reports). During verification, the system compares the key fields output by the multimodal large model against this list. If any required field is missing, a targeted feedback instruction is generated containing the name of the missing field and its possible layout area in the source file. This instruction is then fed back to the layout analysis stage to readjust the system prompts, guiding the multimodal large model to focus its analysis and content reasoning on the area containing the missing field until all required fields have been successfully extracted. For example, for a contract document, required fields might include "Contract Number," "Party A's Name," "Party B's Name," and "Signing Date."

[0152] The core of the accuracy verification lies in verifying whether the content of the extracted key fields exactly matches their original text content in the source file, including verifying whether the extracted company name is consistent with the full name that appears multiple times in the source file, whether the extracted date field matches the signing date of the source file, and whether the extracted numerical field corresponds to the actual data in the table. This is usually achieved by combining the layout position anchor information of the key fields and the full text context of the source file. For example, for the extracted company name, the system will check whether it is consistent with the full company name that appears multiple times in the source file; for the date field, it will check whether it matches the date information at the signing place of the source file; for the numerical field, it will verify whether it is consistent with the corresponding actual data in the table. If it is found that the matching degree of the extracted content with the original text is lower than the preset threshold (for example, calculated by the string similarity algorithm), the system will generate a targeted feedback instruction containing the error field and its corresponding position in the original text, and feedback it to the layout analysis link, prompting the multi-modal large model to reprocess this area to correct the recognition error.

[0153] The rationality verification is used to evaluate whether the extracted key fields comply with the preset industry general rules and business logics, including verifying whether the date format is standardized, whether the number of digits of the phone number meets the requirements, whether the company name complies with the industrial and commercial naming norms, and whether the numerical field complies with the logical range of the corresponding business. This judges the format, semantics and numerical range of the fields. For example, verifying whether the format of the date field complies with specifications such as "YYYY-MM-DD" or "YYYY year MM month DD day"; verifying whether the number of digits of the phone number meets the national or regional standards; verifying whether the company name complies with the naming norms for industrial and commercial registration; and verifying whether the numerical field (such as amount, quantity) falls within the reasonable logical range of the corresponding business. If it is found that any key field does not comply with these preset rules, the system will generate a targeted feedback instruction indicating an abnormality in the rationality verification and feedback it to the layout analysis link, guiding the multi-modal large model to re-verify and screen this abnormal area to ensure that the finally output data is not only accurate but also meets the actual business requirements.

[0154] By introducing a three-dimensional cross-validation mechanism encompassing integrity, accuracy, and rationality checks, this invention comprehensively and systematically evaluates the key field content output by the multimodal large model. Integrity check ensures that all business-essential key fields are successfully extracted, preventing data omissions. Accuracy check significantly improves the authenticity and reliability of the extracted content through precise comparison with the original source text, effectively avoiding identification errors. Rationality check, based on industry rules and business logic, further filters fields with non-standard formats, inappropriate semantics, or abnormal values, ensuring data availability and compliance. More importantly, when a check fails, the system generates a directional feedback instruction carrying the coordinates of the abnormal location, error type, and correction reference direction, and sends it back to the layout analysis stage, guiding the multimodal large model to perform targeted re-inference. This closed-loop feedback correction mechanism enables the system to self-correct and optimize until all key fields meet high standards, significantly improving the automation, accuracy, and robustness of keyword extraction from complex layout files. Ultimately, it generates high-quality, standardized structured text that meets business needs, greatly reducing manual intervention and subsequent correction costs.

[0155] In some of the embodiments of the present invention described above, although multimodal large models and three-dimensional cross-validation can effectively extract key fields from complex layout files and perform standardization processing, in actual applications, in the face of constantly changing document types, layout styles and business rules, the preset validation rules, prompt word parameters and classification standards may be difficult to maintain optimal performance. This may cause the system to lose accuracy and robustness when processing new types or complex variant files, requiring a lot of manpower for maintenance and adjustment.

[0156] In one embodiment, after generating standardized structured text that matches the preset field template, i.e., after step S600, a self-learning iterative process for validation rules is also included, specifically:

[0157] Build a local sample iteration library and store the key field results, corresponding layout features, anomaly correction records, and file type tags of each successful validation in the sample library;

[0158] Based on sample database data, the layout rules and text priority rules, the threshold and judgment criteria of three-dimensional cross-validation, and the classification rules of layout area hierarchy in the system prompts are periodically and automatically iterated and optimized.

[0159] The self-learning iterative processing of validation rules refers to the system's ability to automatically adjust and optimize its internal validation logic, parameters, and strategies based on data generated during actual operation, thereby improving the accuracy and adaptability of processing complex layout files. Specifically, this processing mechanism identifies the shortcomings of the current rules by collecting various data generated by the system during file processing, such as successfully extracted key fields, validation failure exceptions, and corresponding correction records. Based on this, it generates new rules or adjusts the parameters of existing rules, thus achieving system self-improvement.

[0160] Building a local sample iteration library refers to establishing a database or dataset to store historical data and feedback information from the system's processing, providing a data foundation for subsequent self-learning iterations. This sample library can be a structured database that records the key field results of each successful validation, its corresponding layout features (e.g., text position on the page, font size, color, and other visual attributes), anomaly correction records identified by the system (e.g., error type, correction suggestions), and source file type tags (e.g., contracts, invoices, reports, etc.). This data represents the accumulated "experience" for the system's self-learning and optimization.

[0161] The key field results, corresponding layout features, anomaly correction records, and file type tags for each successful validation are stored in a sample library to ensure that the sample library contains sufficiently rich and diverse data so that the system can learn from both successful and failed cases. When a key field is confirmed to be correct after three-dimensional cross-validation, its extraction result, spatial location in the source file, surrounding visual layout information, and the file type, among other metadata, are packaged and stored in the sample library. If a field initially fails validation but is successfully corrected after targeted re-inference, both the anomaly record before correction and the correct result after correction are recorded, forming a complete learning loop.

[0162] Based on sample database data, the core mechanism of self-learning is to periodically and automatically iterate and optimize the layout and text priority rules, the thresholds and judgment criteria of the three-dimensional cross-validation, and the classification rules of the layout region hierarchy in the system prompt words. By analyzing sample database data, the system can identify and correct its own shortcomings. The system can initiate an iterative optimization process periodically (e.g., daily, weekly, or after reaching a certain amount of data). This process analyzes the data accumulated in the sample database, for example, identifying which layout rules perform poorly under specific file types, which text priority rules cause misjudgments, or which thresholds of the three-dimensional cross-validation are too strict or too lenient. Through statistical analysis, pattern recognition, or reinforcement learning algorithms, the system can automatically generate new system prompt word content (including more refined layout and text priority rules), adjust various thresholds (such as matching degree threshold and confidence degree threshold) and judgment criteria of the three-dimensional cross-validation, and even optimize the classification logic of the layout region hierarchy to better adapt to the actual data distribution.

[0163] The entire iterative process requires no manual intervention and eliminates the need for pre-training or fine-tuning of large multimodal or language models. High-precision extraction is achieved solely through rule-guided iterative optimization, emphasizing the automation and efficiency of this self-learning mechanism and reducing maintenance costs and technical barriers. The optimization process is entirely automated, requiring no manual review or parameter adjustments. Furthermore, this optimization targets system prompts and validation rules, rather than directly modifying or retraining the underlying large multimodal or language models. This avoids the significant computational resources and time required for model training, enabling the system to respond quickly to changes and maintain its versatility. Moreover, this self-learning mechanism can handle various common document formats and cope with complex layout elements (such as mixed text, images, tables, signatures, watermarks, etc.) within these formats. Whether it's a single-page or multi-page document, it can process and learn from it in batches. This ensures the broad applicability and practical value of the self-learning iterative processing.

[0164] In summary, compared with the prior art, the present invention has the following outstanding advantages:

[0165] This invention ensures the integrity and accuracy of text extraction from the source, resolving the pain points of conflict in multi-tool parsing: It employs a parallel parsing approach combining multiple tools, matching dedicated parsing tools to different file formats, fully leveraging the technical advantages of various tools and compensating for the inherent limitations of single tools. Simultaneously, it achieves a one-to-one mapping between text content and the layout of the source file through spatial coordinate information. Combined with multi-source text conflict resolution based on spatial coordinates, and through dual-dimensional scoring and hierarchical resolution rules, it addresses the industry pain point of inconsistent parsing results from multiple tools, generating a high-quality fused text set. This fundamentally avoids keyword extraction failures caused by errors or omissions in source text extraction.

[0166] This invention achieves deep fusion reasoning of images and text, significantly improving the extraction accuracy of complex layout scenarios: By precisely aligning the spatial positions of source file images and fused text sets, visual layout information is deeply bound to text content. Combined with built-in layout rule guidance and system prompts based on text priority rules, it guides a multimodal large model to accurately identify the internal logic of complex layouts and establish corresponding relationships between layout elements and key fields. Simultaneously, through intelligent hierarchical classification of layout regions, differentiated processing rules are applied to layout regions of different values ​​and complexities, specifically addressing issues such as keyword extraction omissions, correspondence errors, and semantic deviations in extremely complex scenarios such as cross-page nested tables, mixed image and text layouts, and overlapping watermark text.

[0167] This invention constructs a fully closed-loop directional error correction mechanism that balances extraction accuracy and processing efficiency: It achieves comprehensive quality control of key fields from position matching, semantic matching to compliance matching through three-dimensional cross-validation based on layout anchor point information, contextual semantic association, and preset industry rules. Simultaneously, for fields that fail validation, it generates directional feedback instructions carrying abnormal position coordinates, error type, and correction reference direction. This guides the multimodal large model to perform directional re-inference only on the abnormal areas, eliminating the need to reprocess the entire file. While ensuring extraction accuracy, this effectively shortens the time required for single-file error correction processing, significantly improving processing efficiency.

[0168] Universally compatible with all formats, achieving high-precision extraction without model fine-tuning: This invention supports all types of file formats, including Word, Excel, TXT, PDF, and images, and is compatible with various complex layouts of text mixed with images, tables, signatures, and watermarks. It also adapts to batch processing of single-page and multi-page files. Throughout the process, no pre-training or fine-tuning of multimodal large models or large language models is required for specific file types or business scenarios. High-precision keyword extraction can be achieved across all scenarios simply through rule guidance and iterative optimization. This breaks the technical prejudice in the field that "model fine-tuning is necessary to improve extraction accuracy," making it highly versatile and seamlessly integrated with various file processing business scenarios.

[0169] Full-process automation significantly reduces business costs: This invention achieves full-process automation from file parsing, layout analysis, keyword extraction to verification output, eliminating the need for manual intervention in template configuration, rule setting, model fine-tuning, etc., effectively reducing the omission rate of key fields, greatly reducing the labor costs of enterprise document processing, while avoiding negligence and errors caused by manual operation, and ensuring the stability and reliability of keyword extraction results.

[0170] like Figure 2 As shown, based on the same inventive concept, and corresponding to the methods of any of the above embodiments, this embodiment of the invention also discloses a complex layout file keyword extraction device, comprising:

[0171] The source file acquisition module is used to acquire the source files of complex layouts;

[0172] The source file parsing module is used to parse the content of the source file using a combination of multiple tools. For source files of different formats, corresponding dedicated parsing tools are selected. At least two parsing tools are used to parse the same source file in parallel to obtain multiple text contents that are redundantly verified. Each text content is extracted synchronously and carries the spatial coordinate information of the corresponding content in the source file, so as to realize a one-to-one mapping between the text content and the layout position of the source file.

[0173] The image conversion module is used to convert the source file page by page into an image format that retains the original layout information;

[0174] The conflict resolution processing module is used to perform multi-source text conflict resolution processing based on spatial coordinates on multiple text contents carrying spatial coordinate information, using the image converted from the source file as the reference coordinate system, and generate a fused text set after dual-dimensional verification of coordinate matching degree and recognition confidence.

[0175] The key field content output module is used to accurately align the converted image with the fused text set in spatial position, and then synchronously input the aligned image and the fused text set into the multimodal large model. Through built-in layout rules, text priority rules, and system prompts for targeted key field extraction rules that match the source file type, the multimodal large model is guided to complete intelligent hierarchical classification of layout areas, cross-validation of visual layout information and text content, layout logic recognition, and content fusion reasoning, and output unified key field content that carries the corresponding layout position anchor information.

[0176] The standardized structured text output module is used to perform preliminary deduplication and normalization processing on the key field content output by the multimodal large model. Then, the processed key field content is standardized and formatted using a large language model. A three-dimensional cross-validation is performed based on the layout position anchor information, contextual semantic association, and preset industry rules of the key fields. For key fields that fail the validation, a directional feedback instruction carrying the abnormal position coordinates, error type, and correction reference direction is generated and sent back to the layout analysis stage to guide the multimodal large model to perform directional re-inference for abnormal areas until all key fields pass the three-dimensional cross-validation. Finally, standardized structured text matching the preset field template is generated.

[0177] The apparatus described above is used to implement the corresponding complex layout file keyword extraction method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0178] like Figure 3As shown, based on the same inventive concept, corresponding to any of the above embodiments, this embodiment of the invention also discloses an electronic device, including a memory, a processor, and a computer program stored on the memory and running on the processor. When the processor executes the computer program, it implements the above-described method for extracting keywords from complex layout files.

[0179] Specifically, the device includes: a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected within the device via the bus 1050.

[0180] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this specification.

[0181] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.

[0182] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device or connected externally to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.

[0183] The communication interface 1040 is used to connect the communication module to enable communication and interaction between this device and other devices. The communication module can communicate via wired means (such as USB (Universal Serial Bus), network cable, etc.) or wireless means (such as mobile network, WIFI (Wireless Fidelity), Bluetooth, etc.).

[0184] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.

[0185] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this specification, and not necessarily all the components shown in the figures.

[0186] The electronic device described above is used to implement the corresponding complex layout file keyword extraction method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0187] Based on the same inventive concept, corresponding to any of the above embodiments, this invention also discloses a non-transitory computer-readable storage medium that stores computer instructions for causing a computer to execute the above-described method for extracting keywords from complex layout files.

[0188] The computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.

[0189] The computer instructions stored in the storage medium of the above embodiments are used to cause the computer to execute the complex layout file keyword extraction method as described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0190] The above description is merely a preferred embodiment of the present invention and the technical principles employed. The present invention is not limited to the specific embodiments described herein, and various obvious changes, readjustments, and substitutions that can be made by those skilled in the art will not depart from the scope of protection of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and may include many other equivalent embodiments without departing from the concept of the present invention, the scope of which is determined by the scope of the claims.

Claims

1. A method for extracting keywords from complex layout files, characterized in that, include: Obtain the source file for a complex layout; The source file content is parsed using a combination of multiple tools. For source files of different formats, corresponding dedicated parsing tools are selected. At least two parsing tools are used to parse the same source file in parallel to obtain multiple text contents that are redundantly verified. Each text content is extracted synchronously and carries the spatial coordinate information of the corresponding content in the source file, so as to realize a one-to-one mapping between the text content and the layout position of the source file. Convert the source file page by page into an image format that retains the original layout information; Using the image converted from the source file as the reference coordinate system, perform multi-source text conflict resolution processing based on spatial coordinates on multiple text contents carrying spatial coordinate information, and generate a fused text set after dual-dimensional verification of coordinate matching degree and recognition confidence. The converted image is precisely aligned with the fused text set in spatial position. Then, the aligned image and the fused text set are synchronously input into the multimodal large model. Guided by built-in layout rules, text priority rules, and system prompts that match the source file type for the directional key field extraction rules, the multimodal large model is guided to complete intelligent hierarchical classification of layout areas, cross-validation of visual layout information and text content, layout logic recognition, and content fusion reasoning. The model outputs unified key field content that carries the corresponding layout position anchor information. The key field content output by the multimodal large model is initially deduplicated and normalized. Then, the processed key field content is standardized and formatted using a large language model. A three-dimensional cross-validation is performed based on the layout position anchor information, contextual semantic association, and preset industry rules of the key fields. For key fields that fail the validation, a directional feedback instruction carrying the abnormal position coordinates, error type, and correction reference direction is generated and sent back to the layout analysis stage to guide the multimodal large model to perform directional re-inference for abnormal areas until all key fields pass the three-dimensional cross-validation. Finally, standardized structured text matching the preset field template is generated.

2. The method for extracting keywords from complex layout files according to claim 1, characterized in that, The source file format includes one or more of Word, Excel, TXT, PDF, and images, and the source file supports complex layout content that mixes text with images, tables, signatures, and watermarks. The method of parsing the source file using a combination of tools includes: For source files of different formats, corresponding dedicated parsing tools are selected. Word files are parsed using the docx library, Excel files are parsed using the openpyxl library, image files are parsed using the pytesseract tool for OCR recognition, and TXT files are parsed directly using a text reading tool. For PDF source files, the pdfplumber library is first used to extract the text content, and each page of the PDF file is converted into an image. The text content in the images is then extracted using an OCR recognition tool, resulting in two redundant and mutually verifying text contents of the PDF source file. The pdfplumber library preserves the text layout, paragraph structure, and table row and column relationships in the PDF file during the text extraction process. Before extracting the text content, the OCR recognition tool performs tilt correction, contrast enhancement, and noise reduction preprocessing on the converted images. Finally, the pytesseract tool is used to recognize the text content carrying coordinate information.

3. The method for extracting keywords from complex layout files according to claim 1, characterized in that, The multi-source text conflict resolution process based on spatial coordinates includes: Using the image converted from the source file as the reference coordinate system, the spatial coordinates of each text fragment in multiple text contents are mapped and matched with the image pixel coordinates, and a two-dimensional score of coordinate matching degree and text recognition confidence degree is generated for each text fragment. For conflicting segments with overlapping coordinates and inconsistent text content, a hierarchical resolution rule is triggered: for conflicting segments with a coordinate matching difference greater than a preset threshold, the text with the higher matching degree is used; for conflicting segments with a coordinate matching difference less than or equal to the preset threshold, the advantageous result of the corresponding parsing tool is matched based on the layout type of the area where the segment is located. For conflict segments that cannot be resolved by rules, they are marked as high-risk areas. During the layout analysis stage, the multimodal large model is guided to perform key reasoning and verification on the high-risk areas.

4. The method for extracting keywords from complex layout files according to claim 1, characterized in that, The intelligent hierarchical classification of the layout area includes: The multimodal large model is used to classify the layout regions of the converted image, dividing the source file into core high-value regions, complex layout regions, and low-value redundant regions. Different processing rules are set for different regions: the highest precision parsing and dual-weighted reasoning are used for the core high-value regions to increase the weight coefficient of text priority; dual parsing tools are used for verification and multimodal model cross-reasoning for the complex layout regions, while retaining complete coordinate anchor point information; low-value redundant regions are only lightly filtered and do not enter the core keyword reasoning stage. For complex layout areas spanning multiple pages, the system identifies one or more of the tables, paragraphs, and key fields that are split across pages by leveraging the continuity of table headers between pages, the continuity of coordinate positions, and the semantic coherence of the context, thereby merging cross-page content and completing keyword information.

5. The method for extracting keywords from complex layout files according to claim 1, characterized in that, The layout rules built into the system prompts include the following: the area below the logo in the image corresponds to the company name; the area immediately below the title is the core content; the table header corresponds to the data fields below; the signature area corresponds to the date and signature information; and the header and footer areas exclude non-core content. These layout rules are used to guide the multimodal large model to establish the corresponding relationship between layout elements and key fields in the source file. The system's built-in text priority rules include prioritizing text content that conforms to layout rules, has high recognition clarity, and has no obvious semantic deviation as the basis for reasoning. For text fragments with inconsistent recognition among multiple text contents, cross-comparison and verification are performed by combining layout logic and contextual semantics to eliminate incorrectly recognized text and retain the correct content with the highest matching degree.

6. The method for extracting keywords from complex layout files according to claim 1, characterized in that, The preliminary deduplication and normalization process is as follows: A deduplication algorithm based on string matching and semantic similarity calculation is adopted to remove completely identical keywords, semantically identical synonyms, and redundant fields without actual business significance. At the same time, different expressions of the same key field are normalized, and the deduplicated and normalized key field content is then passed into the subsequent formatting process.

7. The method for extracting keywords from complex layout files according to claim 1, characterized in that, The three-dimensional cross-validation includes integrity validation, accuracy validation, and reasonableness validation; The integrity verification specifically involves: pre-setting a list of required key fields for the corresponding file type; comparing the key fields output by the multimodal large model with the pre-set list of required key fields one by one to determine whether there are any missing required key fields; if there are any omissions, generating a directional feedback instruction containing the name of the missing field and the corresponding layout area prompt, feeding it back to the layout analysis stage, and readjusting the system prompt words to guide the multimodal large model to perform layout analysis and content reasoning again until all required key fields are obtained; The accuracy verification specifically involves: combining the full-text context of the source file and the anchor information of the layout position, determining the matching degree between the extracted key fields and the original text content of the corresponding area of ​​the source file; if the matching degree is lower than a preset threshold, generating a directional feedback instruction containing the erroneous fields and the corresponding original text position, and feeding it back to the layout analysis stage for reprocessing. The rationality verification specifically involves: based on preset industry standard rules, determining whether the format, semantics, and numerical range of key fields conform to the standard rules; if there are key fields that do not conform to the rules, generating a targeted feedback instruction for rationality verification anomalies, and feeding it back to the layout analysis stage for re-verification and screening.

8. The method for extracting keywords from complex layout files according to claim 1, characterized in that, After generating standardized structured text that matches the preset field template, the process also includes a self-learning iterative processing of validation rules, specifically: Build a local sample iteration library and store the key field results, corresponding layout features, anomaly correction records, and file type tags of each successful validation in the sample library; Based on sample database data, the layout rules and text priority rules, the threshold and judgment criteria of three-dimensional cross-validation, and the classification rules of layout area hierarchy in the system prompts are periodically and automatically iterated and optimized.

9. A device for extracting keywords from complex layout documents, characterized in that, include: The source file acquisition module is used to acquire the source files of complex layouts; The source file parsing module is used to parse the content of the source file using a combination of multiple tools. For source files of different formats, corresponding dedicated parsing tools are selected. At least two parsing tools are used to parse the same source file in parallel to obtain multiple text contents that are redundantly verified. Each text content is extracted synchronously and carries the spatial coordinate information of the corresponding content in the source file, so as to realize a one-to-one mapping between the text content and the layout position of the source file. The image conversion module is used to convert the source file page by page into an image format that retains the original layout information; The conflict resolution processing module is used to perform multi-source text conflict resolution processing based on spatial coordinates on multiple text contents carrying spatial coordinate information, using the image converted from the source file as the reference coordinate system, and generate a fused text set after dual-dimensional verification of coordinate matching degree and recognition confidence. The key field content output module is used to accurately align the converted image with the fused text set in spatial position, and then synchronously input the aligned image and the fused text set into the multimodal large model. Through built-in layout rules, text priority rules, and system prompts for targeted key field extraction rules that match the source file type, the multimodal large model is guided to complete intelligent hierarchical classification of layout areas, cross-validation of visual layout information and text content, layout logic recognition, and content fusion reasoning, and output unified key field content that carries the corresponding layout position anchor information. The standardized structured text output module is used to perform preliminary deduplication and normalization processing on the key field content output by the multimodal large model. Then, the processed key field content is standardized and formatted using a large language model. A three-dimensional cross-validation is performed based on the layout position anchor information, contextual semantic association, and preset industry rules of the key fields. For key fields that fail the validation, a directional feedback instruction carrying the abnormal position coordinates, error type, and correction reference direction is generated and sent back to the layout analysis stage to guide the multimodal large model to perform directional re-inference for abnormal areas until all key fields pass the three-dimensional cross-validation. Finally, standardized structured text matching the preset field template is generated.

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method for extracting keywords from a complex layout file as described in any one of claims 1 to 8.