A free layout recognition method, system, intelligent terminal and storage medium

The free layout recognition method, which integrates visual and textual features and employs multi-round confidence assessment, solves the problem of low accuracy in complex layouts and multimodal documents, achieving efficient automated processing and continuous evolution.

CN122244886APending Publication Date: 2026-06-19BEIJING KEZHI NETWORK TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING KEZHI NETWORK TECHNOLOGY CO LTD
Filing Date
2026-03-11
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing rule-based methods and simple text recognition technologies suffer from low accuracy and poor generalization ability when processing documents with complex layouts and multimodal information, making it difficult to meet the demand for efficient and automated processing of unstructured data.

Method used

A free layout recognition method is adopted, which extracts image and text features through visual encoder and text encoder, performs dynamic weight fusion to generate multimodal joint representation, and combines multi-round confidence assessment and adaptive repair mechanism to generate reliable structured data.

🎯Benefits of technology

It improves the accuracy of document information extraction and the efficiency of automated processing, enhances the robustness and self-correction capabilities of the system, reduces the reliance on manual review, and enables the system to continuously evolve and generalize.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244886A_ABST
    Figure CN122244886A_ABST
Patent Text Reader

Abstract

This application relates to the technical field of artificial intelligence, and in particular to a method, system, smart terminal, and storage medium for free layout recognition. The method includes acquiring an image; extracting visual feature sequences and text feature sequences from the image; performing dynamic weight fusion to generate a multimodal joint representation of the document; outputting basic recognition results and associated confidence scores; generating initial structured output data; calculating overall confidence scores and individual item confidence scores; if the overall confidence score or the individual item confidence score of any key data item is lower than a first preset threshold, an adaptive repair process is triggered; if the evaluation result is not lower than a second preset threshold, the repaired structured data is output as the final structured data. This application improves the accuracy and generalization ability of documents with complex layouts and multimodal information, thereby increasing the accuracy of document information extraction and the efficiency of automated processing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a free layout recognition method, system, smart terminal and storage medium. Background Technology

[0002] In the field of unstructured document processing, rule-based methods and simple text recognition techniques are no longer sufficient to handle complex documents, especially those containing multiple formats such as images, tables, and charts. The development of deep learning, computer vision, and natural language processing technologies has offered new possibilities for solving this problem.

[0003] In related technologies, document information extraction mainly relies on optical character recognition (OCR) technology and template matching methods. However, these technologies are lacking in accuracy and efficiency when faced with unstructured documents with variable layouts and diverse information types.

[0004] The aforementioned technologies suffer from low accuracy, poor generalization ability, and strong dependence on specific document formats when processing documents with complex layouts and multimodal information, making it difficult to meet the needs of various industries for efficient and automated processing of unstructured data. Summary of the Invention

[0005] To improve the accuracy and generalization ability of document information extraction and to enhance the accuracy and efficiency of automated processing when dealing with complex layouts and multimodal information documents, this application provides a free layout recognition method, system, smart terminal and storage medium.

[0006] Firstly, this application provides a free layout recognition method, which adopts the following technical solution: A free layout recognition method includes: Obtain the image of the document to be processed; Visual feature sequences of images are extracted using a visual encoder, and text feature sequences of documents are extracted using a text encoder. The visual feature sequence and the text feature sequence are dynamically weighted and fused to generate a multimodal joint representation of the document. Based on multimodal joint representation, document layout understanding and element relationship parsing are performed, and basic recognition results and association confidence scores containing element category and location information are output. Based on the basic identification results, element relationship analysis results, and multimodal joint representation, initial structured output data is generated; The initial structured output data, multimodal joint characterization, and element relationship analysis results are input into the multimodal quality assessment engine to calculate the overall confidence of the initial structured output data and the sub-item confidence of key data items. If the overall confidence level or the confidence level of any key data item is lower than the first preset threshold, the adaptive repair process is triggered to generate repaired structured data. Confidence assessment of the repaired structured data; If the evaluation result is not lower than the second preset threshold, the repaired structured data will be used as the final structured data output.

[0007] By adopting the above technical solution, this application constructs a complete closed loop of document understanding process from multimodal perception to reliable output. First, it integrates visual and textual information to form a unified semantic understanding of the document content, overcoming the limitations of single-modal analysis. Based on this, it not only performs layout and element parsing but also introduces multi-round confidence assessment and adaptive repair mechanisms to ensure that the system can self-perceive the uncertainty of the output results and proactively attempt corrections for low-confidence situations, thereby improving the accuracy and robustness in processing complex layouts and multimodal documents. Finally, it outputs structured data that has undergone confidence verification, improving the reliability of automated processing and the trustworthiness of the results, and reducing reliance on manual review.

[0008] Optionally, the steps to trigger the adaptive repair process and generate repaired structured data include: Based on confidence distribution, element relationships, and multimodal joint representation, potential error sources or areas of missing information are located. The initial structured output data is repaired by invoking the repair strategy corresponding to the error type, and repaired structured data is generated.

[0009] By adopting the above technical solution, an intelligent fault location and repair mechanism is established. This solution is not a global, indiscriminate reprocessing, but rather locates potential error sources or missing information points based on confidence anomalies, element correlations, and contextual semantics. Then, it selectively invokes corresponding repair strategies to intervene, improving the efficiency and effectiveness of the repair process, avoiding unnecessary consumption of computational resources, and enabling the system to self-diagnose and repair when faced with localized identification errors or information interference, thereby enhancing the intelligence and practicality of the overall solution.

[0010] Optionally, the adaptive repair process may also include: If the confidence assessment result of the repaired structured data is still lower than the second preset threshold, or if the repair strategy is unavailable, an interactive correction request is generated. The correction request should include at least the low-confidence data item, its corresponding original document image region, the system recommended value, and alternative interpretations; Receive and record user feedback on correction requests; The feedback results, the corresponding multimodal joint representations, and the document images are used as a new training sample pair and stored in the incremental learning sample library.

[0011] By adopting the above technical solution, this application integrates human-machine collaboration into the automated process. When the system's self-repair capability reaches its limit, it proactively generates a structured correction request, presenting the problem along with contextual evidence and recommended solutions to the user. This significantly reduces the difficulty for users to troubleshoot and correct problems. The system can transform every user feedback into a valuable learning sample, constructing a closed loop of learning from practice. This allows the system's knowledge base to continuously evolve with usage, providing a data foundation for solving rare cases and adapting to new document formats, thus enabling the system to continuously evolve.

[0012] Optionally, an online evolution step for the model may also be included: When the number of samples in the incremental learning sample library reaches the scale threshold, the data in the sample library is used. Incremental fine-tuning training is performed on at least one of the visual encoder, text encoder, dynamic weight fusion module, or layout understanding network to achieve online optimization of model parameters.

[0013] By adopting the above technical solution, the application achieves online continuous optimization and evolution of system performance. This solution utilizes feedback samples accumulated during daily use to periodically perform incremental fine-tuning of the core processing model, enabling the system to gradually adapt to specific user scenarios, newly emerging document formats, or professional terminology, dynamically improving its generalization ability and domain adaptability. The self-evolution mechanism can alleviate the performance bottleneck problem caused by insufficient coverage of initial training data, allowing the system to become smarter with use and maintain its advanced processing performance and competitiveness in the long term.

[0014] Optionally, the steps for the multimodal quality assessment engine to calculate confidence levels include: Based on the element relationship parsing results, a document semantic consistency graph is constructed to check for logical contradictions between different data items; The initial structured output data is reverse-encoded into reconstructed features, and the semantic reconstruction error between the features and the original multimodal joint representation is calculated. By integrating the correlation confidence scores of logical contradiction check results, semantic reconstruction errors, and basic recognition results, a lightweight evaluation network is used to generate overall confidence scores and sub-item confidence scores.

[0015] By adopting the above technical solution, this application provides a multi-dimensional and in-depth quality assessment method. This method does not rely solely on the initial identification confidence level, but also comprehensively judges from two high-level dimensions: the inherent semantic logical consistency of the document and the completeness of information reconstruction. By examining logical contradictions between data items, hidden errors that violate common sense or business rules can be discovered. By calculating semantic reconstruction error, it can measure whether the structured output retains all the key information of the original document. The assessment system, which integrates low-level feature confidence and high-level semantic consistency, makes the judgment of output quality more comprehensive, accurate, and reliable, providing multi-dimensional basis for subsequent decisions such as outputting or triggering repairs.

[0016] Optionally, the repair strategy corresponding to the error type can be invoked, including: When a conflict is detected in data format, type, or preset business logic, the predefined rule base is invoked for automatic correction. When missing information or semantic ambiguity is detected, the multimodal context information of the problem area is converted into a prompt and input into an inference model for content completion or disambiguation. When there is a conflict between textual and visual information, the parsing results of the low-confidence modality are calibrated based on the information of the high-confidence modality.

[0017] By adopting the above technical solutions, this application defines a hierarchical and complementary repair strategy library to address errors of different types and natures. For explicit formatting or logical conflicts, efficient rule-based repair is employed to ensure accuracy and timeliness. For missing information or semantic ambiguity, context-based reasoning completion is initiated, utilizing the overall semantics of the document for intelligent inference. For multimodal information contradictions, cross-modal verification is performed, arbitrating and calibrating based on information sources with higher credibility. Through this divide-and-conquer strategy set, the repair mechanism possesses good flexibility and problem coverage, selecting the most suitable solution path for specific error types, thereby improving the comprehensive handling capability for complex anomalies.

[0018] Optionally, after features are extracted by the visual encoder and the text encoder, and before dynamic weight fusion is performed, the following steps are also included: A noise-sensing module analyzes the visual feature sequence to generate a noise mask image that characterizes the image quality. Based on the noise mask image, the visual feature sequence is adaptively enhanced, and the confidence of the corresponding low-quality image regions in the text feature sequence is reduced.

[0019] By adopting the above technical solution, this application adds an adaptive preprocessing stage to the feature fusion front-end. The noise perception module actively diagnoses the quality of the input image, identifying degraded areas such as blurriness, occlusion, and shadows, and generating corresponding quality masks. Based on the masks, the system can enhance the feature representation of clear areas and suppress the influence of unreliable text or visual features that may be caused by low-quality areas. The front-end quality perception and feature enhancement processing is equivalent to equipping the system with a preprocessing filter, improving the robustness of subsequent fusion and recognition modules to non-ideal original inputs such as low-quality scanned documents and photographed documents, thus reducing the interference of noise on the overall processing flow from the source.

[0020] Secondly, this application provides a free layout recognition system, which adopts the following technical solution: A free layout recognition system, including The acquisition module is used to acquire images of the document to be processed. A memory for storing programs for free layout recognition methods as described above; A control method that enables the processor to load and execute programs in memory and implement any of the free layout recognition methods described above.

[0021] Thirdly, this application provides a smart terminal, which adopts the following technical solution: A smart terminal includes a memory and a processor, wherein the memory stores a computer program that can be loaded by the processor and executed as described above.

[0022] Fourthly, this application provides a computer storage medium, which adopts the following technical solution: A computer-readable storage medium storing a computer program that can be loaded by a processor and executed as described in any of the methods above.

[0023] In summary, this application includes at least one of the following beneficial technical effects: By deeply integrating visual and textual multimodal information, a unified and in-depth understanding of document layout and content is achieved, overcoming the limitations of single-modal analysis. Through the introduction of a closed-loop mechanism of multi-round confidence assessment and adaptive repair, the system can proactively identify uncertainties and potential errors in the output and intelligently locate and correct them. This shift from unidirectional identification to assessment and repair ensures that the final output structured data is a validated, high-quality result, thereby improving the reliability and usability of automated processing. Faced with unstructured documents with varied formats and types, this solution is not a fixed processing flow. Instead, it transforms difficult cases that cannot be automatically processed into learning opportunities by embedding a human-computer collaboration interface and an incremental learning mechanism. It continuously optimizes its own model by utilizing user feedback, enabling the system to adapt to new document formats, professional terminology, and specific user scenarios during use. This dynamically improves its generalization ability and solves the pain point of traditional methods that rely heavily on preset templates and are difficult to adapt to new scenarios, thus achieving a leap from a static model to continuous evolution. This solution forms a resilient processing chain from input preprocessing and core recognition to post-processing optimization. The front end enhances robustness to low-quality input through noise perception, the middle end achieves accurate decision-making through multi-dimensional quality assessment, and the back end is equipped with a hierarchical repair strategy library and learning and evolution capabilities. Through closed-loop design, it systematically ensures stable performance under various non-ideal conditions such as document corruption, complex layout, and information contradictions. At the same time, it avoids unnecessary global reprocessing by intelligently scheduling repair resources, thereby improving both effectiveness and processing efficiency. Attached Figure Description

[0024] Figure 1 This is a flowchart of a free layout recognition method according to an embodiment of this application.

[0025] Figure 2 This is a flowchart of the steps in this application embodiment to trigger the adaptive repair process and generate repaired structured data.

[0026] Figure 3 This is a flowchart of the steps in the adaptive repair process of the embodiments of this application.

[0027] Figure 4 This is a flowchart of the online evolution steps of the model in the embodiments of this application.

[0028] Figure 5 This is a flowchart illustrating the steps involved in calculating confidence levels using the multimodal quality assessment engine in this application embodiment.

[0029] Figure 6 This is a flowchart illustrating the invocation of the repair strategy corresponding to the error type in an embodiment of this application.

[0030] Figure 7 This is a flowchart of the process after features are extracted by the visual encoder and the text encoder in this embodiment of the application, and before dynamic weight fusion is performed.

[0031] Figure 8 This is a module diagram of a free layout recognition method according to an embodiment of this application. Detailed Implementation

[0032] The present application will be further described in detail below with reference to the accompanying drawings.

[0033] This specific embodiment is merely an explanation of this application and is not intended to limit it. After reading this specification, those skilled in the art can make modifications to this embodiment without contributing any inventive step, but such modifications are protected by patent law as long as they fall within the scope of the claims of this application.

[0034] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the appendices in the embodiments of this application will be described below. Figure 1-8 The technical solutions in the embodiments of this application are clearly and completely described. Obviously, the described embodiments are only some, not all, of the embodiments of this application. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0035] This application discloses a method for recognizing free layouts.

[0036] Reference Figure 1 Free layout recognition methods include: Step S100: Obtain the image of the document to be processed.

[0037] The image of the document to be processed refers to the digital image of the document that needs to be content-recognized and structured. The source can be an electronic file generated by a scanner, digital camera, screenshot or other image acquisition device. Common formats include but are not limited to JPEG, PNG, PDF, etc.

[0038] The general process is as follows: The system calls the image input interface to receive image files uploaded or specified from external devices (such as document scanners or folder monitoring services). For example, if a user uploads a scanned copy (PDF format) of a paper contract to the system, the system will obtain the image data of the document and load it into memory or a specified storage location in preparation for subsequent processing.

[0039] Step S101: Extract the visual feature sequence of the image using a visual encoder, and extract the text feature sequence of the document using a text encoder.

[0040] A visual encoder is a model based on a deep convolutional neural network (CNN) or a visual transformer (ViT) to automatically extract multi-level visual features from document images, such as edges, textures, shapes, and layout structures, and encode them into a sequence of feature vectors. A text encoder is a model based on a recurrent neural network (RNN) or a transformer (such as BERT) to encode text recognized from a document (usually pre-provided by an OCR engine), extracting lexical, syntactic, and contextual semantic features, and representing them as a sequence of text features. Visual feature sequences and text feature sequences represent continuous numerical representations of local image regions and text fragments, respectively, and are fundamental to deep understanding.

[0041] The general process is as follows: The image is input into a pre-trained visual encoder. The encoder segments the image and abstracts its features through its stacked convolutional or attention layers, ultimately outputting a sequence of visual features, where each feature vector corresponds to a spatial region or visual semantic unit in the image.

[0042] Optical character recognition (OCR) technology is used to detect and recognize text in the same document image. The recognized text (including text content and location) is input into a pre-trained text encoder to generate the corresponding text feature sequence.

[0043] For example, for a report document, a visual encoder may extract special font features of the title area and grid line features of the table, while a text encoder may extract semantic vectors of technical terms in the report.

[0044] Step S102: Dynamically weight-fuse the visual feature sequence and the text feature sequence to generate a multimodal joint representation of the document.

[0045] Dynamic weighted fusion is an adaptive feature fusion technique that dynamically calculates and assigns fusion weights based on the relevance between visual and textual content through attention mechanisms or other learnable network modules, rather than performing fixed concatenation or addition. Multimodal joint representation is a unified and compact semantic representation that integrates visual and textual information, capturing the correlation between images and text, and providing more accurate context for subsequent layout understanding and content analysis.

[0046] The general process is as follows: Visual feature sequences and text feature sequences are input into a dynamic weight fusion module (e.g., a fusion network based on a cross-attention mechanism). The module calculates the interaction attention score between visual and text features and dynamically assigns fusion weights to each pair of visual-text feature combinations based on the score. Then, the features are weighted, combined, and aggregated according to the weights to generate a unified feature representation, i.e., a multimodal joint representation. For example, when processing a paragraph with an illustration, the fusion module may assign higher fusion weights to the descriptive text near the illustration, so that the joint representation can better associate the image content with the text description.

[0047] Step S103: Based on multimodal joint representation, perform document layout understanding and element relationship parsing, and output basic recognition results and association confidence scores containing element category and location information.

[0048] Document layout understanding refers to identifying and classifying different elements in a document, such as headings, paragraphs, tables, images, headers, and footers. Element relationship analysis refers to analyzing and determining the logical and spatial relationships between these elements, such as containment relationships (a table contains multiple cells), sequential relationships (a heading is followed by the main text), and hierarchical relationships (chapter and sub-chapter). Basic recognition results and associated confidence scores refer to the system's category label (e.g., "table"), its position coordinates (e.g., bounding box), and the model's quantitative score (confidence score) for each identified document element, reflecting its confidence in the recognition result.

[0049] The general process is as follows: A multimodal joint representation is input into a layout understanding and relationship parsing network (e.g., a graph neural network or sequence-to-sequence model). The network first predicts the category and precise bounding box of each element in the document. By analyzing the spatial layout and semantic relationships between elements (utilizing fused information from the joint representation), a relationship graph between elements is constructed. The network outputs basic information (category, location) and its confidence score for each identified element, as well as a description of the relationships between elements. For example, the system might identify an element as "invoice form" with a confidence score of 0.98 and parse out a "remarks" paragraph immediately following the form.

[0050] Step S104: Based on the basic identification results, element relationship analysis results, and multimodal joint representation, generate initial structured output data.

[0051] Initial structured output data refers to data with a certain format and schema, which is initially organized based on the document content parsing results. This data may include JSON, XML, or database records, and may contain errors or be incomplete.

[0052] The general process is as follows: Based on element categories, positions, relationships, and multimodal joint representations containing rich semantics, the system converts unstructured document content into structured data format according to a preset data template or through a sequence generation model. For example, for a resume document, the system will generate a preliminary JSON object based on the identified areas such as "Name," "Education," and "Work Experience" and their content. The keys are the field names, and the values ​​are the identified text content.

[0053] Step S105: Input the initial structured output data, multimodal joint characterization and element relationship analysis results into the multimodal quality assessment engine to calculate the overall confidence of the initial structured output data and the sub-item confidence of key data items.

[0054] The multimodal quality assessment engine is a module specifically designed to evaluate the reliability of structured output. It comprehensively examines the inherent consistency of the data, its conformity with the original document, and its logical rationality. Overall confidence is a comprehensive evaluation score of the quality of the entire structured output data. Itemized confidence is an independent confidence score for each key field or data item in the output data (such as "amount" or "date" in a contract).

[0055] The general process is as follows: The quality assessment engine receives initial structured data, the original multimodal joint representation of the document, and an element relationship graph. It performs multi-dimensional checks: 1. Semantic consistency check: using the joint representation to verify whether there are logical contradictions between extracted data items (e.g., a negative contract amount); 2. Reconstruction verification: attempting to reconstruct key layouts or semantic fragments of the document from the structured data and comparing the differences with the original joint representation; 3. Rule and context verification: based on predefined business rules and element relationships, checking whether the data conforms to conventions (e.g., the format of invoice numbers). Finally, the engine synthesizes all the check results and outputs an overall confidence score (e.g., a score between 0 and 1) and the sub-confidence score for each key data item through an evaluation model. For example, the engine might find that the extracted "signing date" is later than the "effective date," thus significantly reducing the sub-confidence score and overall confidence score of these two data items.

[0056] Step S106: If the overall confidence level or the confidence level of any key data item is lower than the first preset threshold, the adaptive repair process is triggered to generate repaired structured data.

[0057] The first preset threshold is a pre-defined critical value used to determine whether the output quality is acceptable; if it falls below this value, intervention and repair are considered necessary. The adaptive repair process refers to a series of intelligent data correction and completion steps. Post-repair structured data refers to the structured data version whose quality has been improved after processing through the repair process.

[0058] The general process is described as follows: The system compares the confidence level calculated in step S105 with a first preset threshold (e.g., the overall confidence threshold is set to 0.85, and the sub-item confidence threshold is set to 0.9). Once any confidence index falls below its corresponding threshold, the system determines that the current initial output quality is insufficient and automatically triggers an adaptive repair process. The process locates low-confidence data items and uses various strategies (such as contextual reasoning, rule base matching, and cross-modal information re-verification) to correct or complete them, ultimately generating an optimized "repaired structured data". For example, if the sub-item confidence level of "total invoice price" is only 0.7 (below the threshold of 0.9), the repair process may recalculate the product of unit price and quantity, or extract and verify it again from obvious locations in the original image.

[0059] Step S107: Confidence assessment of the repaired structured data.

[0060] The general process is as follows: The repaired structured data is re-invoked or referenced through the multimodal quality assessment engine mechanism to perform a new confidence assessment. The assessment process is similar to step S105, but the input data is the repaired version. This step quantifies the effect of the repair operation and confirms whether the quality of the repaired data has reached an acceptable level.

[0061] In step S108, if the evaluation result is not lower than the second preset threshold, the repaired structured data will be output as the final structured data.

[0062] The second preset threshold is an acceptance threshold used to determine whether the repaired data meets the standards. It can be the same as or different from the first preset threshold, and usually represents the system's minimum requirement for the quality of the final output. The final structured data refers to the output result that has been identified, evaluated, repaired, and re-evaluated by the system and is confirmed to be reliable and meets the quality standards.

[0063] The general process is as follows: The system compares the assessment result of the repaired data (overall or key component confidence level) with a second preset threshold. If the assessment result is not lower than the threshold, it indicates that the repair is successful and the data quality meets the final output requirements. The system then identifies this repaired structured data as the "final structured data" for this processing and outputs it for direct use by downstream systems (such as databases and business analysis platforms). If the assessment result is still lower than the threshold, it is processed according to preset strategies (such as marking it as requiring manual review or initiating a more complex repair round). For example, if the confidence assessment of the "total invoice price" increases to 0.95 after repair, which is higher than the second preset threshold of 0.9, then this data item is included in the final output.

[0064] Reference Figure 2 The steps that trigger the adaptive repair process and generate repaired structured data include: Step S200: Based on confidence distribution, element relationships, and multimodal joint characterization, locate potential error sources or areas with missing information.

[0065] The confidence distribution refers to the set formed by the overall confidence level and the individual confidence levels of each key data item calculated by the multimodal quality assessment engine. Its value directly reflects the system's judgment on the credibility of the corresponding data. Element relationships refer to the spatial location, logical hierarchy, and semantic associations between document elements parsed in step S103. Potential error sources refer to data items with significantly low confidence levels (e.g., below a preset threshold) or their dependent original identification areas, which may be due to identification errors, information interference, or noise. Information-missing regions refer to document areas that should exist based on document layout, element relationships, and conventional logic but have not been successfully identified or from which valid information has not been extracted.

[0066] The general process is as follows: The system first analyzes the confidence distribution, marking data items with an overall confidence level or individual item confidence levels below a first preset threshold as "suspicious items." Combining the element relationship analysis results with multimodal joint representation, the system traces the root causes of suspicious items. For example, through the element relationship graph, it locates a low-confidence "amount" data item originating from a visually blurry table cell; or using the semantic information of the joint representation, it discovers that a continuous text passage lacks the expected "date" field, thus determining that this is an area of ​​missing information. The system integrates these analyses to accurately mark the corresponding problem areas in the document image and their mapping positions in structured data, providing input for targeted repair.

[0067] Step S201: Invoke the repair strategy corresponding to the error type to repair the initial structured output data and generate repaired structured data.

[0068] Error types are categories categorized based on problem characteristics, primarily including data format / logical conflicts, missing or ambiguous semantic information, and inconsistencies in multimodal information. Repair strategies refer to automated correction and completion methods pre-designed or trained for different error types.

[0069] The general process is described as follows: Based on the problem features located in step S200, the system classifies them into specific error types and automatically invokes the corresponding repair strategy module. The repair strategies mainly include one or more of the following combinations: 1. Invocation of rule repair strategy. When it is determined that the error stems from abnormal data formats or violation of preset business logics (such as incorrect number of digits in an ID number, end date earlier than start date), the system invokes a predefined rule library. The rule library contains format verification rules, logical consistency rules, etc., to automatically correct the problem data items or fill in default values. For example, the identified chaotic date string "January 13, 2023" is corrected to "2023-12-01" according to the rules; 2. Invocation of context reasoning repair strategy. When it is determined that the error is due to missing information or ambiguous semantics (such as missing content in some cells of a table, incomplete sentence components), the system starts context reasoning repair. This strategy utilizes multimodal context information around the problem area (including text content, visual layout, and element relationships), constructs it into a targeted prompt, and inputs it into a trained large language model or a dedicated reasoning model. The model reasons based on the context to generate the most likely completion content or disambiguate the ambiguous semantics. For example, using the patterns in other rows of the same table to reason and complete a missing "product unit price"; 3. Invocation of cross-modal verification repair strategy. When it is determined that the error is due to conflicts in the same information obtained from different modalities (vision and text) (such as the text recognized by OCR does not match the obvious number visually), the system performs cross-modal verification repair. This strategy compares the confidence levels of information from different modalities, preferentially adopts or fuses the information of the high-confidence modality, and calibrates or replaces the parsing results of the low-confidence modality. For example, when the confidence level of the amount number "1000" recognized by text is low, and the number area is clear and has significant features visually, and the visual feature analysis shows that the confidence level of "1000" is higher, then the visual verification result is used to correct the text recognition result.

[0070] Refer to Figure 3 , the adaptive repair process further includes: Step S300, when the confidence evaluation result of the repaired structured data is still lower than the second preset threshold, or the repair strategy is unavailable, generate an interactive rectification request.

[0071] The repair strategy being unavailable means that for the currently located error type or problem feature, there is no corresponding rule repair strategy pre-configured in the system, or the context reasoning model and cross-modal verification model cannot give effective correction suggestions. An interactive rectification request is a structured human-machine interaction interface data, aiming to clearly present difficult problems that the system cannot automatically solve to the user (such as an auditor) and guide the user to provide correct information or make decisions.

[0072] The general process is described as follows: After assessing the confidence level of the repaired structured data, if the assessment result (overall or key component confidence level) is still lower than the second preset threshold, or if the system determines in step S201 that there is no suitable repair strategy available for the current error type, the system determines that manual intervention is required. At this time, the system automatically generates an interactive correction request. This request is a data packet containing a problem description, evidence materials, and optional operations, which is sent to the human-computer interaction interface (such as a web management backend or audit client) for user processing.

[0073] Step S301: The correction request includes at least the low-confidence data item, its corresponding original document image region, the system recommended value, and alternative interpretations.

[0074] Low-confidence data items refer to specific fields or data content in the repaired data whose confidence assessment is still below the second preset threshold. The corresponding original document image region refers to the precise location (usually represented by bounding box coordinates) in the original document image from which the low-confidence data item originates. The system recommended value refers to one or more of the most likely corrected values ​​that the system attempts to generate through the repair strategy. Alternative explanations refer to supplementary explanations of the possibilities of the current recognition result or the system recommended value, such as listing other possible OCR recognition candidate results or pointing out specific points of contradiction in the data conflict.

[0075] The general process is described as follows: When constructing a correction request, the system extracts and encapsulates the following core information: 1. Clearly identifies which data item (e.g., "invoice date") has insufficient confidence; 2. Highlights the data item on the user interface or provides a screenshot or coordinates of its specific location in the original document image; 3. Displays one or more suggested values ​​given by the system itself through reasoning or rules (e.g., recommending "2023-02-30" to be corrected to "2023-02-28"); 4. Provides additional contextual information, such as the original OCR-recognized text, and comparisons of conflicting information from different modalities, to help users understand the root cause of the problem. For example, for a blurry invoice, the correction request will display the "Total Amount" field, locate the blurry numerical area in the image, recommend a calculated amount, and note "Original OCR recognition is '1,500', visual feature analysis suggests '1,800'".

[0076] Step S302: Receive and record the user's feedback on the correction request.

[0077] User feedback results refer to the actions and results performed by users (auditors) through the interactive interface after reviewing the correction request. These typically include confirming the system's recommended value, manually entering a correction value, or marking the data as correct and requiring no modification.

[0078] The general process is as follows: The system's human-computer interface receives user actions regarding correction requests. Users may perform one of the following actions: directly adopt a recommended value from the system, manually enter a completely new correct value, or mark the current data item as correct (i.e., a system misjudgment). After the user submits their action, the system accurately records the feedback result. This record not only includes the final determined data value but also is linked to the original correction request ID, user identifier, timestamp, and any notes the user may add. For example, after seeing a correction request regarding "supplier name," the user manually enters the correct full company name and submits; the system records this final value confirmed by the user.

[0079] Step S303: The feedback result, the corresponding multimodal joint representation, and the document image are used as a new training sample pair and stored in the incremental learning sample library.

[0080] The incremental learning sample repository is a dedicated database for storing high-quality, human-verified difficult samples and their contextual features. These samples are used for incremental training of the model to improve the system's ability to handle similar problems. Training sample pairs refer to paired data consisting of "input features" and "target output." In this scenario, "input features" include the original multimodal joint representation and document images, while the "target output" is the final correct feedback result provided by the user.

[0081] The general process is described as follows: After receiving valid feedback from the user, the system transforms this interaction instance into a learning sample. The system uses the correct data value finally confirmed by the user, recorded in step S302, as the standard for this data item. The system extracts the multimodal joint representation (or its key parts) closely related to the problem area generated during document processing, as well as the original document image (or image patch of the problem area). This information (the correct value of the user feedback as the label, the multimodal joint representation and the image as the input features) is structured and packaged to form training sample pairs. The system stores this sample pair in a dedicated incremental learning sample library. For example, for the case where the user corrected the blurred "amount", a sample will be stored in the sample library, with the input being the visual and text fusion features containing the blurred area, and the label being the correct amount number finally confirmed by the user.

[0082] Reference Figure 4 The free layout recognition method also includes an online model evolution step: Step S400: When the number of samples in the incremental learning sample library reaches the scale threshold, use the data in the sample library.

[0083] The scale threshold is a pre-defined critical value used to determine whether the accumulated, manually validated high-value samples in the incremental learning sample library are sufficient to initiate an effective round of model retraining. The threshold can be dynamically configured based on computational resources, model complexity, and sensitivity to performance improvements. Using data from this sample library means extracting all or a portion of the samples from the incremental learning sample library as the training dataset for a new round of model training.

[0084] The general process is described as follows: The system continuously monitors the total number of samples in the incremental learning sample library. When the cumulative number of samples reaches or exceeds a preset size threshold (e.g., reaching 1000 valid sample pairs), the system automatically triggers the online model evolution process. The system locks the current state of the sample library to ensure the consistency of the training data. All samples are extracted from the library, or a representative subset of samples is extracted according to a certain sampling strategy (e.g., by time or question type). The samples contain "input features" (e.g., multimodal joint representation of the problem region, document image fragments) and "target labels" (correct results reported by users), forming a high-quality supervised learning dataset for model optimization.

[0085] Step S401: Perform incremental fine-tuning training on at least one of the visual encoder, text encoder, dynamic weight fusion module, or layout understanding network to achieve online optimization of model parameters.

[0086] Incremental fine-tuning training is a training method that uses new, domain-specific data (incremental learning samples) to perform small-scale, targeted parameter updates on an existing pre-trained model. The aim is to adapt the model to emerging patterns or correct systematic biases while retaining most of its original general knowledge. Online optimization refers to the training process being performed in the system's deployed runtime environment, without requiring complete offline retraining, thus achieving dynamic updates and improvements to model performance.

[0087] The general process is described as follows: The system initiates a controlled training task, inputting the dataset prepared in step S400 into the training process. During training, based on system configuration and sample characteristics, one or more core modules that have the greatest impact on overall performance bottlenecks or error-prone areas are selected for updating. Common training objectives include: 1. Fine-tuning the encoder: If the samples reflect recognition problems under a large number of low-quality images, samples containing problem image regions and their corresponding correct text may be used to fine-tune the last few layers or specific adaptation layers of the visual encoder to enhance its robustness to degradation conditions such as blurring and noise; 2. Fine-tuning the fusion module: If the samples mostly reflect contradictory or incorrectly related image and text information, the parameters of the dynamic weight fusion module may be optimized using the multimodal joint representation in the samples (as input) and the correct semantics that should be achieved after user correction (as the target) to make it more accurately learn the relevance weights of cross-modal features; 3. Fine-tuning the layout understanding network: If the samples are concentrated on errors in element recognition or relationship parsing of a specific layout (such as a new version of a financial statement), the layout understanding network may be fine-tuned using the document image corresponding to the samples, multimodal features, and the correct layout annotation confirmed by the user to improve its ability to parse specific complex layouts.

[0088] During training, the system typically employs a small learning rate, making "mild" parameter adjustments while maintaining the original model weights to prevent catastrophic forgetting. After training, a new version of the model parameters is generated. Once the system verifies its performance improvements, mechanisms such as hot updates, shadow deployments, or version switching can be used to safely and smoothly update the optimized model parameters to the production environment, thereby achieving online optimization of the core components of the entire processing pipeline. For example, after incremental fine-tuning hundreds of handwritten annotated contract samples, the system's ability to extract visual features from the "handwritten signature" area and its ability to fuse these features with the printed "signatory" text have been significantly enhanced.

[0089] Reference Figure 5 The steps for calculating confidence levels in a multimodal quality assessment engine include: Step S500: Based on the element relationship parsing results, construct a document semantic consistency graph and check for logical contradictions between different data items.

[0090] A document semantic consistency graph is a model represented by a graph data structure. Its nodes represent identified key data items or entities, and edges represent semantic, logical, or mathematical relationships between data items (such as "equal to", "greater than", "belongs to", "derived from"). Logical contradictions refer to situations where, when reasoning based on the relationships and rules defined in the graph, the values ​​of different data items cannot be simultaneously true, violating common sense or specific business rules.

[0091] The general process is described as follows: Based on the element relationship parsing results output in step S103, the system extracts key data items (such as "total amount," "unit price," "quantity," and "date") and the identified explicit or implicit relationships between them (such as "total amount should equal unit price multiplied by quantity" and "signing date should be earlier than effective date"). This information is used to construct an initial semantic consistency graph. The system traverses this graph, applying predefined logical rules and constraints to verify the consistency of data item values. For example, it checks whether mathematical formulas are valid, compares data items with temporal relationships (such as dates in a contract) to ensure they are in the correct order, or verifies whether classification information is mutually exclusive. Once a rule violation is found (such as the sum of "subtotals" in an invoice not equaling the "total"), it is recorded as a specific logical contradiction instance and associated with the data items involved. For example, in a purchase order, if the "delivery address" is identified as "Shanghai", but the associated "delivery area rule" is restricted to "Jiangsu, Zhejiang and Shanghai only", and "Shanghai" is marked as compliant in the rule base, then there is no contradiction; if it is identified as "Beijing", the system will detect a logical contradiction.

[0092] Step S501: The initial structured output data is reverse encoded into reconstructed features, and the semantic reconstruction error between the features and the original multimodal joint representation is calculated.

[0093] Inverse encoding refers to the process of remapping structured text / numerical data back into a dense vector representation aligned with the original multimodal joint representation space using an encoder or transformation model. Reconstructed features are the new feature vectors obtained through inverse encoding, representing the document semantics that can be "reconstructed" from the current structured output data. Semantic reconstruction error is a metric that quantifies the difference between the reconstructed features and the original multimodal joint representation, reflecting the extent to which the structured output data completely and accurately captures the core semantic information of the original document.

[0094] The general process is as follows: The system processes the initial structured output data (e.g., a JSON object) generated in step S104 through a specially trained "reverse encoder" network. The network learns to map the structured key-value pair sequence or fields into a fixed-dimensional feature vector, i.e., reconstructed features. The system obtains the original multimodal joint representation of the same document generated in step S102. Within a shared semantic space, the distance or dissimilarity between the reconstructed feature vector and the original multimodal joint representation vector is calculated as the semantic reconstruction error. Commonly used metrics include cosine distance, Euclidean distance, or discriminant scores based on neural networks. The larger the error value, the more semantic information of the original document is lost or distorted in the current structured output. For example, if the structured output omits an important note in the document, the features reconstructed from the simplified output data cannot contain the semantics of that note, resulting in a large reconstruction error between it and the context-rich original joint representation.

[0095] Step S502: The correlation confidence of logical contradiction check results, semantic reconstruction error and basic recognition results are integrated and generated through a lightweight evaluation network to generate overall confidence and sub-item confidence.

[0096] Lightweight evaluation networks are neural networks with relatively small parameter sizes (such as multilayer perceptrons, MLPs) used to synthesize multi-dimensional evaluation signals and output a final confidence score. Fusion refers to standardizing, weighting, and integrating evaluation metrics of different types and scales into a unified computational framework. Overall confidence and item confidence refer to the final comprehensive quality score; overall confidence assesses the credibility of the entire document structuring result, while item confidence refines the score to each key data item.

[0097] The general process is described as follows: The system first characterizes and vectorizes the logical contradiction check results from step S500 (which may be converted into a contradiction count or severity score), the semantic reconstruction error from step S501 (after normalization), and the associated confidence of the basic recognition results output in step S103 (the original model's confidence in each recognition element). These vectors are then concatenated or fused and input into a pre-trained lightweight evaluation network. The network learns the complex mapping relationship between these intermediate signals and the final data quality (obtainable through historical manual annotation) and outputs scores on two levels: 1. Overall confidence, a scalar between 0 and 1, representing the overall evaluation of the quality of the entire document structured output; 2. Item confidence, a vector where each element corresponds to an independent confidence score for a key data item (such as "invoice number," "amount," "date," etc.). During training, the evaluation network learns how to allocate different "attention" to different types of problems. For example, a serious logical contradiction may have a greater negative impact on the overall confidence than a moderate reconstruction error. These confidence scores provide a quantitative, multi-dimensional, and comprehensive basis for downstream remediation decisions (step S106).

[0098] Reference Figure 6 The repair strategies corresponding to the error type include: Step S600: When a conflict in data format, type, or preset business logic is detected, a predefined rule base is invoked for automatic correction.

[0099] Data format and type conflicts refer to extracted data content that does not conform to the expected format (e.g., a date should be "YYYY-MM-DD" but is identified as "2026-13-01") or data type (e.g., a "age" field contains non-numeric characters). Pre-defined business logic conflicts refer to relationships between data items that violate common sense or hard rules in a specific business domain (e.g., "total invoice price" is not equal to the sum of all "individual prices," or "end date" is earlier than "start date"). A predefined rule base refers to a structured collection of rules, including format regular expressions, data type validation functions, business logic constraints (e.g., calculation formulas, logical assertions), and corresponding correction or completion actions.

[0100] The general process is described as follows: First, the system determines the specific type of conflict based on the error location results. Then, it retrieves matching correction rules from a predefined rule base indexed by the conflict type. Each rule typically contains a "conditional pattern" and a "correction action." The system applies the problem data and its context to the conditional pattern; if a match is found, the corresponding correction action is triggered. Correction actions may include: formatting conversion (e.g., converting "2026 / 02 / 30" to "2026-02-28"), type casting (e.g., converting the string "25" to the integer 25), formula-based recalculation (e.g., recalculating the total price based on unit price and quantity), or validating replacement based on enumerated values ​​(e.g., replacing gender inputs other than "male" / "female" with the default value). For example, when identifying a contract, if the "Start Date" of the "Contract Validity Period" field is "2026-12-01" and the "End Date" is identified as "2026-11-30", the "Date Order" rule in the rule base will be triggered, and the system may automatically correct the "End Date" to "2026-12-31" based on the context of the terms or the default rule.

[0101] Step S601: When missing information or semantic ambiguity is detected, the multimodal context information of the problem area is converted into a prompt and input into a reasoning model for content completion or disambiguation.

[0102] Information gaps refer to the failure to extract relevant content from a document area where information is expected to be present (such as a cell in a table or a field in a form). Semantic ambiguity refers to the ambiguity of the extracted text content, making it impossible to directly determine its specific meaning (e.g., "Beijing" could refer to a city name or company name, while "2026" could refer to a year or product model). Multimodal contextual information refers to the comprehensive information surrounding the visual features, text content, layout structure, and element relationships of the problem area. A reasoning model refers to a trained Large Language Model (LLM) or a specialized sequence generation model capable of reasoning, completion, or interpretation based on contextual cues.

[0103] The process is described as follows: The system first extracts visual and textual features of the problem area and its surrounding areas from the multimodal joint representation and element relationship analysis results, and organizes them into a structured natural language description or a specific prompt template. The prompt clearly describes the problem (e.g., "In the table below, the content of the cell in the third row and second column is missing. The column title is 'Unit Price', and the content of the first column in the same row is 'Product A'"), providing rich contextual clues. This prompt is then input into the inference model. Based on prior knowledge and contextual understanding, the model generates one or more possible completions or disambiguates ambiguous semantics. The system may combine the confidence level of the model output or use a majority voting strategy to select the most reasonable result as the correction value. For example, for a missing "unit of measurement" in a technical report, the system generates a prompt based on the context "Length: 10.5": "What might be the missing unit after 'Length: 10.5'?", and the inference model may output "centimeter" or "meter", then select one to complete the text based on the unit usage habits of other parts of the document.

[0104] Step S602: When there is a conflict between text and visual information, the parsing results of the low-confidence modality are calibrated based on the information of the high-confidence modality.

[0105] Text-visual information conflict refers to inconsistencies between the text content recognized by the OCR engine and the information obtained from visual feature analysis of the same region (e.g., the OCR recognizes the number "100", but the visual shape of the number is closer to "700"). High-confidence modalities and low-confidence modalities refer to the results obtained from different data sources (visual and text) for the same information point and their respective confidence assessments. Confidence may originate from the score output by the recognition model itself or be calculated through a dedicated quality assessment module. Calibration refers to the process of modifying, replacing, or weighting and fusing the parsing results of the low-confidence modal based on the information from the high-confidence modal.

[0106] The general process is as follows: The system extracts the specific conflicting information points and corresponding text recognition and visual feature analysis results from the error localization results. It then evaluates (or acquires) the confidence scores of the text modality and the visual modality for that information point. The evaluation can be based on the original confidence score of the recognition model or performed through a lightweight cross-modal consistency verification module. The two confidence scores are compared to determine the high-confidence modality. Finally, the information from the high-confidence modality is used as an "anchor" to calibrate the results of the low-confidence modality. The calibration method can be direct replacement or a weighted average (if both have some confidence but a compromise is needed). For example, in invoice recognition, the OCR identifies the "total amount" as "1,500 yuan" with a confidence score of 0.7; while visual analysis of the shape features of the digital image in the same area infers the result as "1,800 yuan" with a confidence score of 0.9. The system determines that the visual modality has a higher confidence score, so it calibrates the text recognition result to "1,800 yuan" and updates the corresponding field in the structured data.

[0107] Reference Figure 7 After features are extracted by the visual encoder and the text encoder, and before dynamic weight fusion is performed, the process also includes: Step S700: Analyze the visual feature sequence through a noise perception module to generate a noise mask map that characterizes the image quality.

[0108] The noise perception module is a deep learning-based image quality assessment or defect detection model. It can analyze visual feature sequences and identify various degraded areas in an image introduced during the shooting, scanning, or transmission process, such as blurring, uneven lighting, occlusion, dirt, and wrinkles. The noise mask image is a matrix or feature map corresponding to the spatial dimensions of the original image. The value at each position represents the quality score or noise intensity of the corresponding image region; a lower value (or a higher value, depending on the definition) indicates a poorer quality region and stronger noise interference.

[0109] The general process is described as follows: After extracting the visual feature sequence of the image through the visual encoder in step S101, the system inputs this sequence into a specially trained noise perception module. This module typically consists of a lightweight convolutional neural network or Transformer layer, and by learning the feature differences between a large number of clean and degraded images, it has the ability to evaluate the "cleanliness" of local features. The module analyzes the input visual feature sequence region by region and outputs a noise mask. In the mask, each pixel or feature region is assigned a continuous value (or discrete level) between 0 and 1, intuitively indicating the reliability of each part of the document image. For example, for a photographed document with a finger obstructing the view, the noise perception module will generate a low score (e.g., 0.2) in the area covered by the finger, indicating that the information in that area is highly unreliable; while in areas where the text is clear and the background is clean, it will generate a high score (e.g., 0.95).

[0110] Step S701: Based on the noise mask image, the visual feature sequence is adaptively enhanced, and the confidence of the corresponding low-quality image region in the text feature sequence is reduced.

[0111] Adaptive enhancement refers to the targeted strengthening of visual features in high-quality regions based on quality guidance provided by the noise mask image, while suppressing or cleaning noise components in low-quality regions. Weight reduction refers to lowering the initial confidence of text recognition results originating from low-quality image regions in the text feature sequence to reflect the decreased confidence due to image issues.

[0112] The general process is described as follows: Based on the noise mask generated in step S700, the system performs two key adjustments: 1. Adaptive enhancement of the visual feature sequence: The noise mask is used as a spatial attention weight and multiplied or weighted with the original visual feature sequence. For visual feature vectors corresponding to high-score (high-quality) regions in the mask, the values ​​are preserved or enhanced (e.g., multiplied by a coefficient greater than 1); for feature vectors corresponding to low-score (low-quality) regions, the values ​​are suppressed (e.g., multiplied by a coefficient less than 1) or smoothly fused with adjacent high-quality features to reduce the negative impact of noise features. This process generates an "enhanced visual feature sequence" with more prominent effective information. 2. Weighting of text feature sequence confidence: The system aligns the spatial information of the noise mask with the positional information (bounding box) of the OCR text recognition result. For each recognized text segment, the average quality score of the image region in the noise mask is found. Then, based on the score, the initial confidence (or a simple mapping) that the text encoder may have attached to each text token in step S101 is downweighted. For example, the initial text confidence score of a "name" text identified in a blurry area may be downgraded from 0.9 to 0.6 to warn subsequent processes that the information source itself may be risky.

[0113] Through preprocessing steps, the system proactively perceives and adapts to the uneven quality of input images before feature fusion, providing cleaner and more reliable visual feature input for subsequent "dynamic weight fusion" and the entire recognition process. It also provides objective credibility warnings for text information originating from low-quality images, thereby improving the system's robustness to non-ideal real-world documents from the source.

[0114] Based on the same inventive concept, embodiments of this application provide a free layout recognition system, including: The acquisition module is used to acquire images of the document to be processed. A memory for storing programs for free layout recognition methods as described above; A control method that enables the processor to load and execute programs in memory and implement any of the free layout recognition methods described above.

[0115] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional modules is used as an example. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. The specific working process of the system, device, and unit described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0116] This application provides a computer-readable storage medium storing a computer program that can be loaded by a processor and executed as a free layout recognition method.

[0117] Computer storage media include, for example, USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, optical disks, and other media that can store program code.

[0118] Based on the same inventive concept, embodiments of this application provide a smart terminal, including a memory and a processor, wherein the memory stores a computer program that can be loaded by the processor and executed as a free layout recognition method.

[0119] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional modules is used as an example. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. The specific working process of the system, device, and unit described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0120] The above are all preferred embodiments of this application and are not intended to limit the scope of protection of this application. Any feature disclosed in this specification (including the abstract and drawings) may be replaced by other equivalent or similar features unless specifically stated otherwise. That is, unless specifically stated otherwise, each feature is only one example of a series of equivalent or similar features.

Claims

1. A method for recognizing free layouts, characterized in that, include: Obtain the image of the document to be processed; Visual feature sequences of images are extracted using a visual encoder, and text feature sequences of documents are extracted using a text encoder. The visual feature sequence and the text feature sequence are dynamically weighted and fused to generate a multimodal joint representation of the document. Based on multimodal joint representation, document layout understanding and element relationship parsing are performed, and basic recognition results and association confidence scores containing element category and location information are output. Based on the basic identification results, element relationship analysis results, and multimodal joint representation, initial structured output data is generated; The initial structured output data, multimodal joint characterization, and element relationship analysis results are input into the multimodal quality assessment engine to calculate the overall confidence of the initial structured output data and the sub-item confidence of key data items. If the overall confidence level or the confidence level of any key data item is lower than the first preset threshold, the adaptive repair process is triggered to generate repaired structured data. Confidence assessment of the repaired structured data; If the evaluation result is not lower than the second preset threshold, the repaired structured data will be used as the final structured data output.

2. The free layout recognition method according to claim 1, characterized in that, The steps to trigger the adaptive repair process and generate repaired structured data include: Based on confidence distribution, element relationships, and multimodal joint representation, potential error sources or areas of missing information are located. The initial structured output data is repaired by invoking the repair strategy corresponding to the error type, and repaired structured data is generated.

3. The free layout recognition method according to claim 1, characterized in that, The adaptive repair process also includes: If the confidence assessment result of the repaired structured data is still lower than the second preset threshold, or if the repair strategy is unavailable, an interactive correction request is generated. The correction request should include at least the low-confidence data item, its corresponding original document image region, the system recommended value, and alternative interpretations; Receive and record user feedback on correction requests; The feedback results, the corresponding multimodal joint representations, and the document images are used as a new training sample pair and stored in the incremental learning sample library.

4. The free layout recognition method according to claim 3, characterized in that, It also includes the online evolution step of the model: When the number of samples in the incremental learning sample library reaches the scale threshold, the data in the sample library is used. Incremental fine-tuning training is performed on at least one of the visual encoder, text encoder, dynamic weight fusion module, or layout understanding network to achieve online optimization of model parameters.

5. The free layout recognition method according to claim 1, characterized in that, The steps involved in calculating confidence levels by the multimodal quality assessment engine include: Based on the element relationship parsing results, a document semantic consistency graph is constructed to check for logical contradictions between different data items; The initial structured output data is reverse-encoded into reconstructed features, and the semantic reconstruction error between the features and the original multimodal joint representation is calculated. By integrating the correlation confidence scores of logical contradiction check results, semantic reconstruction errors, and basic recognition results, a lightweight evaluation network is used to generate overall confidence scores and sub-item confidence scores.

6. The free layout recognition method according to claim 1, characterized in that, Invoking the repair strategy corresponding to the error type includes: When a conflict is detected in data format, type, or preset business logic, the predefined rule base is invoked for automatic correction. When missing information or semantic ambiguity is detected, the multimodal context information of the problem area is converted into a prompt and input into an inference model for content completion or disambiguation. When there is a conflict between textual and visual information, the parsing results of the low-confidence modality are calibrated based on the information of the high-confidence modality.

7. The free layout recognition method according to claim 1, characterized in that, After features are extracted by the visual encoder and the text encoder, and before dynamic weight fusion is performed, the following is also included: A noise-sensing module analyzes the visual feature sequence to generate a noise mask image that characterizes the image quality. Based on the noise mask image, the visual feature sequence is adaptively enhanced, and the confidence of the corresponding low-quality image regions in the text feature sequence is reduced.

8. A free layout recognition system, characterized in that, include The acquisition module is used to acquire images of the document to be processed. A memory for storing the program of the free layout recognition method as described in any one of claims 1 to 7; A control method for implementing the free layout recognition method as described in any one of claims 1 to 7, wherein the program in the processor and the memory are loaded and executed by the processor.

9. A smart terminal, characterized in that, It includes a memory and a processor, wherein the memory stores a computer program that can be loaded by the processor and executed according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer program is stored that can be loaded by a processor and executed according to any one of claims 1 to 7.