Long-text-oriented analysis checking method and device, equipment and storage medium

By employing intelligent segmentation and hybrid scheduling methods, the problems of low efficiency and poor accuracy in long text verification are solved, achieving efficient and reliable automated verification, which is suitable for professional texts such as contracts and bidding documents.

CN121981104BActive Publication Date: 2026-06-12GUANGDONG UNITOLL COLLECTION INC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGDONG UNITOLL COLLECTION INC
Filing Date
2026-04-08
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies suffer from low efficiency and poor accuracy when processing professional texts of tens of thousands of words, especially when dealing with complex tables and multi-layered nested clauses, which are prone to misjudgment. Furthermore, large language models cannot effectively complete cross-chapter clause comparison and citation verification.

Method used

By identifying and segmenting paragraph boundaries of long texts, text fragments are generated and assigned to serial or parallel review nodes based on semantic relevance. A predefined set of structured instructions is loaded to constrain the output behavior of the long text review model, and system optimization is performed based on user feedback.

🎯Benefits of technology

It significantly improves the efficiency of reviewing long texts, reducing the time from hours to minutes, while also improving the accuracy and reliability of the verification results. It is suitable for automated and intelligent verification of various professional long texts such as contracts and bidding documents.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121981104B_ABST
    Figure CN121981104B_ABST
Patent Text Reader

Abstract

The application provides a long text-oriented analysis and checking method, device, equipment and storage medium, which significantly improves the auditing efficiency and accuracy through intelligent fragmentation and mixed scheduling. The core is to allocate the fragments with reference or dependency relationship to the serial auditing nodes for deep and coherent context verification according to the semantic correlation between the text fragments, and allocate the independent fragments to the parallel auditing nodes for concurrent processing, so as to solve the problem of long text semantic fragmentation, and greatly shorten the auditing time from hours to minutes. In addition, by loading a predefined structured instruction set in the auditing node to constrain the output behavior of the large language model, the "illusion" and misjudgment generated by the model when processing complex tables and nested clauses are effectively inhibited, and the reliability of the checking result is significantly improved. The method has flexibility and scalability, and is suitable for automatic and intelligent checking scenes of various professional long texts such as contracts and bidding documents.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and in particular to a method, apparatus, device, and storage medium for analyzing and verifying long texts. Background Technology

[0002] In specialized fields such as bidding documents, contracts, and policies and regulations, conducting compliance and consistency audits on texts of tens of thousands of words or even longer is a crucial but extremely tedious task. Traditionally, this work has relied heavily on manual labor, requiring reviewers to check every word and sentence. This is not only extremely time-consuming (for example, reviewing a 50,000-word document takes an average of about two hours), but also prone to overlooking issues due to fatigue or oversight, making it difficult to balance efficiency and accuracy. In recent years, with the development of large language model technology, solutions have emerged that directly utilize models for full-text auditing. However, limited by the model's own context window length, processing extremely long texts often requires forced segmentation, resulting in the loss of semantic connections between chapters and making it impossible to effectively compare and verify cross-chapter clauses and citations. At the same time, large models are prone to "illusions" when faced with complex tables and multi-layered nested clauses, making incorrect judgments or generating false content, resulting in a high false positive rate. Furthermore, while automated tools based on predefined rules can improve efficiency to some extent, they lack deep semantic understanding capabilities and cannot identify clauses with different expressions but the same meaning (such as "three years of experience" and "3 years of experience"), resulting in insufficient flexibility and adaptability.

[0003] In summary, the shortcomings of the existing technology urgently need to be addressed. Summary of the Invention

[0004] This invention provides a method, apparatus, device, and storage medium for analyzing and verifying long texts, in order to overcome the deficiencies in the prior art and realize the automated and intelligent verification of ultra-long professional texts.

[0005] This invention provides an analysis and verification method for long texts, comprising:

[0006] Receive long text to be verified;

[0007] Identify and segment the paragraph boundaries of the long text to be checked, and generate several text segments;

[0008] Based on the semantic relationships between the text fragments, the text fragments are assigned to the target review nodes;

[0009] In the target review node, a predefined structured instruction set is loaded and applied to call the long text review model to perform text review tasks in order to obtain text verification results; the structured instruction set is used to constrain the output behavior of the long text review model in the process of generating review results;

[0010] The target review nodes include serial review nodes or parallel review nodes. The serial review nodes are used to process text fragments that have reference relationships or contextual relationships, while the parallel review nodes are used to process text fragments whose content is independent of each other.

[0011] According to the long text analysis and verification method provided by the present invention, before the step of loading and applying a predefined structured instruction set and calling a long text verification model to perform a text verification task in the target verification node to obtain the text verification result, the method further includes:

[0012] When the text segment contains tables or content with nested hierarchical structures, the content of the tables or nested hierarchical structures is flattened and converted into a plain text entry format with hierarchical numbering.

[0013] According to the long text analysis and verification method provided by the present invention, after the step of loading and applying a predefined structured instruction set and calling a long text verification model to perform a text verification task in the target verification node to obtain the text verification result, the method further includes:

[0014] Obtain feedback information submitted by the user regarding the text verification results;

[0015] The collected feedback information is structured and associated with the corresponding text segments, review results, and the applied structured instruction set, and stored in the system knowledge base.

[0016] The long text review model is periodically invoked to analyze the feedback information accumulated in the system knowledge base, and a structured feedback report containing a specific problem description, preliminary analysis of the cause, and optimization suggestions is generated.

[0017] Based on the structured feedback report, the structured instruction set or text segmentation logic is updated to optimize the subsequent text verification effect.

[0018] According to the analysis and verification method for long texts provided by the present invention, the step of identifying and segmenting the paragraph boundaries of the long text to be verified and generating several text segments specifically includes:

[0019] The long text to be checked is coarsely segmented. Based on predefined regular expression matching rules, the format features in the document are identified to determine the initial segmentation.

[0020] The initial segments obtained from the coarse segmentation process are subjected to semantic verification and fine-grained segmentation to merge the incorrectly segmented related content.

[0021] Based on the requirements of the review task, filter out auxiliary text content that does not require semantic review;

[0022] Based on the results of the fine-grained segmentation and filtering, the aforementioned text segments are generated.

[0023] According to the analysis and verification method for long texts provided by the present invention, the step of assigning the text segments to the target review node based on the semantic correlation between the text segments specifically includes:

[0024] Analyze the semantic relationships between the text segments to identify the first text segment that has direct references, indirect references, or contextual logical dependencies, and the second text segments that are independent of each other in terms of content;

[0025] The first group of fragments is assigned to the serial audit node to leverage its ability to maintain a continuous context for deep cross-validation;

[0026] The second group of fragments is allocated to the parallel audit nodes to leverage their parallel processing capabilities and improve audit throughput.

[0027] According to the analysis and verification method for long texts provided by the present invention, the step of allocating the second group of slices to the parallel review nodes to improve the review throughput by utilizing their parallel processing capabilities specifically includes:

[0028] Real-time acquisition of status indicators of each parallel audit node in the system, including at least the current computing power utilization, the length of the task queue to be processed, and the available memory;

[0029] Based on the status indicators, the real-time load score of each parallel review node is calculated, and each independent text fragment in the second group of fragments is preferentially assigned to the parallel review node with the lowest current load score for processing.

[0030] According to the present invention, a method for analyzing and verifying long texts includes a structured instruction set comprising:

[0031] The role definition instruction is used to limit the role of the long text review model to the text review executor;

[0032] Task boundary instructions are used to define the input, output, and processing scope of the text review task;

[0033] Output format instructions are used to force the long text review model to output the verification results in a structured manner according to a preset template;

[0034] A binding instruction is provided to restrict the long text review model from rewriting, optimizing, or interpreting the long text to be reviewed and the template when the text review task is the review of the reference template.

[0035] The present invention also provides an analysis and verification device for long texts, comprising:

[0036] The text receiving module is used to receive long texts to be verified.

[0037] The text segmentation module is used to identify and segment the paragraph boundaries of the long text to be checked, and generate several text segments;

[0038] The text allocation module is used to allocate the text fragments to the target review nodes based on the semantic relationships between the text fragments;

[0039] The text verification module is used in the target review node to load and apply a predefined structured instruction set to call the long text review model to perform text review tasks and obtain text verification results; the structured instruction set is used to constrain the output behavior of the long text review model in the process of generating review results;

[0040] The target review nodes include serial review nodes or parallel review nodes. The serial review nodes are used to process text fragments that have reference relationships or contextual relationships, while the parallel review nodes are used to process text fragments whose content is independent of each other.

[0041] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the analysis and verification method for long text as described above.

[0042] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the analysis and verification method for long text as described above.

[0043] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the analysis and verification method for long text as described above.

[0044] This invention provides a method, apparatus, device, and storage medium for analyzing and verifying long texts, which significantly improves review efficiency and accuracy through intelligent fragmentation and hybrid scheduling. Its core lies in allocating fragments with referencing or dependency relationships to serial review nodes for deep and coherent contextual verification based on the semantic relationships between text fragments, while allocating independent fragments to parallel review nodes for concurrent processing. This solves the problem of semantic fragmentation in long texts and drastically reduces review time from hours to minutes. Furthermore, by loading predefined structured instruction sets into the review nodes to constrain the output behavior of large language models, the "illusion" and misjudgments generated by the model when processing complex tables and nested clauses are effectively suppressed, significantly improving the reliability of the verification results. This method combines flexibility and scalability, and is suitable for automated and intelligent verification scenarios of various professional long texts such as contracts and tender documents. Attached Figure Description

[0045] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0046] Figure 1 This is a flowchart illustrating the analysis and verification method for long texts provided by this invention.

[0047] Figure 2 This is a schematic diagram of the structure of the analysis and verification device for long texts provided by the present invention;

[0048] Figure 3 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0049] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0050] To address the problems in existing technologies, this invention proposes an analysis and verification method for long texts, enabling automated and intelligent verification of extremely long professional texts. The following describes this analysis and verification method for long texts, as follows: Figure 1 As shown, including but not limited to the following steps:

[0051] Step 110: Receive the long text to be verified.

[0052] In step 110, the system retrieves the long text file to be verified via a file upload interface, API call, or reading from a specified storage location. This long text is typically a professional document exceeding 50,000 words, such as tender documents, contracts, policies, and regulations, and its format supports common document formats such as PDF, DOCX, and TXT. The system first parses the document, converting it into a unified plain text format for subsequent processing.

[0053] Step 120: Identify and segment the paragraph boundaries of the long text to be checked, and generate several text segments.

[0054] Step 120 is fundamental to efficient and accurate review. Its purpose is to intelligently and meaningfully segment long documents, rather than simply dividing them into equal-length parts. Specifically, the following sub-steps can be used:

[0055] First, coarse segmentation is performed: the system scans the entire text based on predefined regular expression matching rules, identifies specific format features such as chapters and clause numbers such as “Chapter 1”, “1.1”, “Article 1”, and “(I)”, as well as page breaks and specific title font marks, in order to determine the initial paragraph boundaries of the document structure and divide the long text into multiple coarse-grained paragraphs.

[0056] Next, semantic verification and fine-grained segmentation are performed: the above coarse segmentation results are input into a long text review model (such as a finely tuned large language model). The model is instructed to perform semantic coherence analysis on the beginning and end of each coarse segment to determine whether the segmentation point is reasonable. The model will merge semantically closely related sentences or paragraphs that have been incorrectly segmented (e.g., a complete clause incorrectly segmented by a form break), and further subdivide overly lengthy coarse segments according to the completeness of the clauses.

[0057] Next, content filtering is performed: based on the preset review task requirements, the system filters out auxiliary content that does not require in-depth semantic review from the finely divided paragraphs, such as purely procedural instructions, fixed glossary sections, table of contents, headers and footers, etc. This content usually has a fixed format and does not contain risk clauses that need to be verified.

[0058] Finally, the independent semantic units obtained after verification, segmentation, and filtering are determined as the final text segments, and each segment is assigned a unique identifier and location information.

[0059] Step 130: Based on the semantic relationships between the text fragments, assign the text fragments to the target review nodes.

[0060] The target review nodes include serial review nodes or parallel review nodes. The serial review nodes are used to process text fragments that have reference relationships or contextual relationships, while the parallel review nodes are used to process text fragments whose content is independent of each other.

[0061] In step 130, the core of this step is to achieve intelligent scheduling to balance the depth and efficiency of the review process. The specific implementation is as follows:

[0062] The system analyzes the content of all text segments. It identifies semantic relationships between segments by calculating keyword overlap, detecting the presence of explicit citation markers (such as "see item X"), or using an embedding model to calculate semantic vector similarity. As a result, all segments are divided into two categories: a first group of segments with direct citations, indirect references, or strong logical dependencies; and a second group of segments that are completely self-contained and independent of each other.

[0063] Node matching and allocation are performed based on the identification results:

[0064] The first set of fragments is assigned to a serial review node. This node is designed to maintain a persistent dialogue context or state. During review, these related fragments are sequentially input as a continuous task, allowing the model to "remember" and refer to the content of previous fragments when reviewing subsequent fragments. This enables deep cross-validation of complex issues such as cross-chapter clause references and consistency of commitments between contexts.

[0065] The second set of shards is distributed to a pool of parallel audit nodes. The system consists of multiple audit node instances that can work in parallel. These independent shards are simultaneously distributed to different node instances for processing, thereby making full use of computing resources, greatly improving the throughput of the overall audit task, and shortening the total processing time.

[0066] In a preferred embodiment, the allocation process can also incorporate dynamic load balancing. The system monitors the real-time status of each parallel node (such as CPU / GPU utilization, memory usage, and task queue length). When allocating the second set of partitions, it prioritizes allocating the sub-parts to the node with the lightest current load to optimize system resource utilization and avoid single-point congestion.

[0067] Step 140: In the target review node, by loading and applying a predefined structured instruction set, the long text review model is invoked to perform the text review task to obtain the text verification result; the structured instruction set is used to constrain the output behavior of the long text review model in the process of generating the review result.

[0068] Step 140 is crucial for performing the core audit operations and ensuring the reliability of the results. When each target audit node (whether serial or parallel) initiates an audit task, the following operations are performed:

[0069] First, load the structured instruction set. This instruction set is a predefined collection of text instructions used to strictly constrain the behavior of large language models. Taking a contract clause comparison review task as an example, the instruction set might include:

[0070] Role definition instruction: "You are now a professional contract compliance review expert."

[0071] Task boundary instructions: "Your task is to compare the consistency of the 'Contract to be Reviewed' with the corresponding clauses in the 'Standard Template'. Only review the given text and do not consult external information on your own."

[0072] Example of a task boundary instruction:

[0073] Your task is to compare the textual consistency between the [Contract Under Review] and the corresponding clauses in the [Standard Template]. Specifically, this means identifying any differences in wording between the [Contract Under Review] and the [Standard Template], including but not limited to wording, punctuation, numbers, and clause order.

[0074] Scope limitation: Only the two given documents are compared, and no external information, laws and regulations or past cases may be introduced or referenced.

[0075] Binding directive: "No rewriting, optimization, or supplementary interpretation of any clauses in either document is permitted. In the event of any contradiction between the clauses in the template, the original template shall prevail. Your output must be strictly based on the text and no subjective inferences are allowed."

[0076] Example of behavioral constraint instructions:

[0077] When performing the comparison and output, you must strictly adhere to the following rules:

[0078] Modification is prohibited: No clauses in the [Pending Contract] or [Standard Template] may be rewritten, polished, optimized, or supplemented with additional interpretations.

[0079] Subjective judgment is prohibited: Your output must be strictly based on the text content and must not contain any form of subjective inference (e.g., you must not judge which expression is better or speculate on the reasons for the modification).

[0080] Handling internal contradictions: If there are contradictions within the standard template itself (e.g., clause A conflicts with clause B), the original text of the standard template must prevail. Simply describe the inconsistencies between the contract under review and the standard template in the discrepancy list; there is no need to resolve the contradictions within the template itself.

[0081] Output format instructions: "Please output according to the following template: [Issue Number] Issue Type: (e.g., missing clause, misrepresentation). Issue Location: Chapter X, Article X of the contract under review. Original text: '……'. Template Corresponding Content: '……'."

[0082] Then, the review task is executed: the "structured instruction set" is combined with the "text fragment content" (or the temporary focused document generated by it) assigned to this node to form a complete prompt, which is then submitted to the long text review model for processing. Under this strict constraint, the model generates analysis results, and its output is forced to conform to a structured format, which greatly suppresses "illusions," irrelevant comments, or format chaos caused by free interpretation, ensuring the accuracy, consistency, and direct usability of the output results.

[0083] Finally, the verification results generated at each review node are summarized, deduplicated, and formatted to generate a complete review report that includes problem location, type, and original text citations, thus completing this long text analysis and verification process.

[0084] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

[0085] As a further optional embodiment, before the step of loading and applying a predefined set of structured instructions to call a long text review model to perform a text review task in the target review node to obtain the text verification result, the method further includes:

[0086] When the text segment contains tables or content with nested hierarchical structures, the content of the tables or nested hierarchical structures is flattened and converted into a plain text entry format with hierarchical numbering.

[0087] In this embodiment, when the system determines or detects that the text fragment assigned to the target review node contains tables, charts or clauses with multi-level nested hierarchical structures (such as "a, (a), 1, (1)"), in order to eliminate the comprehension interference and illusion risk caused by these non-flat, non-continuous text structures to the long text review model, the system will first start an automated preprocessing process.

[0088] The core of this preprocessing workflow is "flattening" and "structured transformation." The system parses the cell structure of the table, identifies the correspondence between the table header and data rows, and associates the content of each cell with its row and column coordinates in a left-to-right, top-to-bottom order. For nested hierarchical structures, the system clarifies the hierarchical and parallel relationships between clauses by recognizing indentation, specific numbering, or conjunctions. Subsequently, the system converts the parsed content into plain text entries with clear hierarchical numbering. For example, a complex bidder qualification requirements table might be converted into: "1.1 Financial Requirements: 1.1.1 Audit Report: Reports for the past three years are required; 1.1.2 Debt-to-Equity Ratio: Not higher than 70%...". For nested clauses, it is converted into a linear list similar to "2. Delivery Standards: 2.1 Quality Requirements: 2.1.1 Should comply with national standard GB / T...; 2.1.2 No defects...".

[0089] This process essentially translates visual or structured information losslessly into a linear text sequence that is easier for the model to process and can clearly perceive hierarchical relationships. After this preprocessing, the resulting plain text entries with hierarchical numbers, along with other plain text content from the original segments, are provided as "cleaned" and "standardized" input to subsequent step 140 for review. This design significantly reduces parsing errors and the probability of content fabrication caused by the model directly processing complex formats, improving the accuracy of key review data.

[0090] As a further optional embodiment, after the step of loading and applying a predefined set of structured instructions to call the long text review model to perform the text review task in the target review node to obtain the text verification result, the method further includes:

[0091] Obtain feedback information submitted by the user regarding the text verification results;

[0092] The collected feedback information is structured and associated with the corresponding text segments, review results, and the applied structured instruction set, and stored in the system knowledge base.

[0093] The long text review model is periodically invoked to analyze the feedback information accumulated in the system knowledge base, and a structured feedback report containing a specific problem description, preliminary analysis of the cause, and optimization suggestions is generated.

[0094] Based on the structured feedback report, the structured instruction set or text segmentation logic is updated to optimize the subsequent text verification effect.

[0095] In this embodiment, the method also includes a user feedback-driven closed-loop optimization mechanism. This mechanism is activated after step 140 (when the text verification result is obtained), aiming to enable the system to learn itself and continuously improve.

[0096] Specifically, the mechanism includes the following:

[0097] First, when presenting the text verification results (such as the audit report) to the user, the system simultaneously provides an interactive feedback interface. Through this interface, users can evaluate the specific conclusions in the report, such as marking a "problem" as a false positive or a missed report (false negative), or providing suggestions for correction and supplementary explanations regarding the audit logic.

[0098] Next, the system automatically captures the user's feedback action. Then, it associates and structurally encapsulates the feedback content (e.g., "Term X should not be judged as inconsistent"), the feedback type (false positive / false negative / suggestion), and the context triple that triggered the feedback. The "context triple" includes: 1) the original text fragment corresponding to the audit conclusion; 2) the original audit result at the time; and 3) the version or identifier of the structured instruction set loaded during the audit. The encapsulated complete feedback entry is stored in a dedicated system knowledge base.

[0099] Subsequently, the system automatically executes analysis tasks periodically (e.g., weekly or monthly). In this task, the system invokes a long text review model, instructing it to analyze feedback entries accumulated in its knowledge base. The model's tasks are: to identify and statistically analyze frequently occurring feedback types; to summarize potential patterns leading to false positives or false negatives (e.g., "when the word 'in principle' appears in a clause, the model is prone to misinterpreting it as an absolute commitment"); and to preliminarily analyze whether the causes of these patterns are related to vague wording of specific instructions, improper segmentation boundaries, or deficiencies in model knowledge. Based on this, the model automatically generates a structured feedback report, which systematically lists the main problem patterns, examples, root cause inferences, and specific optimization suggestions (e.g., "it is recommended to add exemption rules for vague terms such as 'in principle' and 'depending on the circumstances' to the instruction set").

[0100] Finally, system maintenance personnel or the automatic update program make targeted adjustments and optimizations to the key configurations in the system based on the generated structured feedback report. This mainly includes two types of updates: 1) Optimizing the structured instruction set: revising, supplementing, or refining instructions according to the report's suggestions to make constraints clearer and task guidance more precise; 2) Optimizing the text fragmentation logic: if feedback indicates that problems often stem from improper segmentation leading to semantic breaks, the regular expression rules or validation logic of the fragmentation module can be adjusted. Through this closed loop, the system's audit accuracy and adaptability are continuously improved in practice, forming a virtuous cycle of "execution-feedback-learning-optimization".

[0101] As a further optional embodiment, the step of identifying and segmenting the paragraph boundaries of the long text to be checked and generating several text segments specifically includes:

[0102] The long text to be checked is coarsely segmented. Based on predefined regular expression matching rules, the format features in the document are identified to determine the initial segmentation.

[0103] The initial segments obtained from the coarse segmentation process are subjected to semantic verification and fine-grained segmentation to merge the incorrectly segmented related content.

[0104] Based on the requirements of the review task, filter out auxiliary text content that does not require semantic review;

[0105] Based on the results of the fine-grained segmentation and filtering, the aforementioned text segments are generated.

[0106] In this embodiment, the system first scans the received long document in plain text format, using a set of predefined regular expression matching rules to identify explicit formatting features in the document. These rules are designed to capture chapter and clause numbering patterns such as "Chapter X", "XY", "(X)", "Article 1", and specific markers that may indicate paragraph breaks (such as consecutive line breaks, page breaks, and specific heading style transition points). By matching these features, the system initially segments the long text into multiple initial paragraphs. This step is fast and reliable, providing a basic structural framework for subsequent processing.

[0107] Because relying solely on formatting rules can lead to incorrect segmentation (such as cutting a complete clause due to a page break) or overly coarse segmentation (such as merging multiple independent clauses into a lengthy chapter), the system inputs the aforementioned initial paragraphs into a long text review model for semantic analysis. The system prompts the model to determine whether the start and end points of each initial paragraph constitute a complete semantic unit and performs two core operations: merging incorrectly segmented, semantically closely connected content; and simultaneously subdividing overly long paragraphs that contain multiple independent arguments or sub-clauses. For example, the model might merge "Party A's Rights... (page break)... and Obligations" which were incorrectly separated due to formatting rules, while breaking down a large chapter, "Technical Specifications," into more refined segments such as "1.1 Performance Indicators" and "1.2 Test Methods" according to its internal logic.

[0108] Not all content in a document requires equal review resources. Based on the specific needs of the review task (e.g., focusing on compliance review), the system applies filtering rules to identify and remove auxiliary text content that does not require in-depth semantic review. This content typically has a fixed format and low information content, such as standard bidding process descriptions, general glossary sections, simple reference lists, and acknowledgments. Filtering this content effectively reduces distractions in subsequent review stages, allowing the system to focus more on substantive content such as risk clauses and core commitments.

[0109] After semantic verification, fine-grained segmentation, and content filtering, the system obtains a series of text units that are semantically relatively complete, independent, and strongly related to the review task. These units are formally identified as several text fragments. The system generates a unique identifier for each fragment and records its logical position in the original document (e.g., "from near Chapter 3, Article 2.1"), thus completing the transformation from the original long text to a structured, schedulable set of fragments, laying the foundation for subsequent intelligent allocation based on semantic relevance.

[0110] As a further optional embodiment, the step of assigning the text segments to the target review node based on the semantic correlation between the text segments specifically includes:

[0111] Analyze the semantic relationships between the text segments to identify the first text segment that has direct references, indirect references, or contextual logical dependencies, and the second text segments that are independent of each other in terms of content;

[0112] The first group of fragments is assigned to the serial audit node to leverage its ability to maintain a continuous context for deep cross-validation;

[0113] The second group of fragments is allocated to the parallel audit nodes to leverage their parallel processing capabilities and improve audit throughput.

[0114] In this embodiment, the system evaluates the semantic relationships between pairs or multiple generated text fragments. By calculating the text embedding vectors of the fragments and measuring cosine similarity, or by detecting the existence of explicit cross-reference expressions (such as "according to the above X" or "see the table below"), the system can automatically identify one or more groups of fragments with direct citations, indirect references, or strong contextual logical dependencies, and classify them as the first text fragments (i.e., associated fragment groups). The remaining fragments, which have no obvious dependencies or repetitions in terms of topic, entity, or argument, are classified as the second text fragments (i.e., independent fragment groups). This analysis ensures that subsequent allocation has a semantic basis.

[0115] Secondly, node matching and allocation are performed.

[0116] The system routes the data to the corresponding target review node based on the shard type:

[0117] Assigning associated fragment groups to serial review nodes: All fragments belonging to the same associated group are assigned to the same serial review node according to their order of appearance in the original document. This node has the ability to maintain a continuous context session state, and can continuously retain key information of previous fragments during processing, thereby achieving deep cross-validation of cross-fragment references, clause conflicts, and commitment consistency, effectively avoiding semantic loss caused by text fragmentation.

[0118] Independent shards are assigned to parallel review nodes: Since there are no context dependencies between independent shards, they can be concurrently assigned to multiple parallel review nodes. These nodes are typically deployed on distributed or asynchronous computing resources and can process multiple shards simultaneously, thereby significantly improving the overall review throughput and processing efficiency of the system. This is especially suitable for reviewing independent attachments, appendices, or parallel chapters in large documents.

[0119] Furthermore, the system can also introduce a dynamic load balancing mechanism. When allocating independent shards to parallel audit nodes, the system can monitor the computing resource utilization and task queue length of each node in real time, and dynamically select target nodes based on load scores to avoid resource idleness or overload, ensuring stable and efficient system operation.

[0120] Through the aforementioned semantic-based intelligent allocation mechanism, this invention technically achieves an organic combination of review depth and processing efficiency, providing a reliable task scheduling foundation for the automated and accurate verification of long texts.

[0121] As a further optional embodiment, the step of allocating the second group of fragments to the parallel review nodes to utilize their parallel processing capabilities to improve review throughput specifically includes:

[0122] Real-time acquisition of status indicators of each parallel audit node in the system, including at least the current computing power utilization, the length of the task queue to be processed, and the available memory;

[0123] Based on the status indicators, the real-time load score of each parallel review node is calculated, and each independent text fragment in the second group of fragments is preferentially assigned to the parallel review node with the lowest current load score for processing.

[0124] In this embodiment, the first step is real-time monitoring and indicator collection.

[0125] The system has a built-in monitoring module that continuously polls or receives heartbeats and status reports from all available parallel audit nodes. Key status indicators collected include at least:

[0126] Current computing power utilization: Reflects the percentage of computing load on the node's processors (CPU / GPU);

[0127] The length of the pending task queue indicates the number of tasks that the node has received but has not yet started processing.

[0128] Available memory: The amount of memory that the node is not currently using.

[0129] These metrics collectively characterize the real-time busyness and resource availability of each node.

[0130] Secondly, calculate the real-time load score.

[0131] The system calculates a single real-time load score by combining the above multiple status indicators according to a preset weighting algorithm. A simple implementation can be a weighted summation, for example: Load Score = a * Computational Utilization + b * Queue Length - c * Available Memory (where a, b, and c are positive coefficients set based on experience). A higher score indicates a heavier node load and a weaker ability to process new tasks; a lower score indicates a relatively idle node and a stronger processing capacity.

[0132] Finally, intelligent allocation based on scores is performed.

[0133] When it's time to distribute the individual text fragments from the second group of fragments, the scheduler doesn't use random or simple round-robin allocation. Instead, it makes decisions based on real-time load scores. The system queries the current scores of all parallel review nodes and prioritizes sending the individual text fragments to be allocated to the node with the lowest current load score. If multiple nodes have the same score or are all in a low-load state, other strategies (such as proximity allocation or node affinity) can be combined for allocation.

[0134] Through this dynamic load balancing mechanism, the system can automatically allocate computing tasks to the most idle resources, effectively preventing individual nodes from becoming overloaded due to task accumulation, while ensuring that other nodes' resources are fully utilized. This not only significantly improves the overall throughput of the system but also enhances the system's stability and response speed when facing fluctuating task volumes, serving as a key guarantee for achieving efficient parallel processing.

[0135] As a further optional embodiment, the structured instruction set includes:

[0136] The role definition instruction is used to limit the role of the long text review model to the text review executor;

[0137] Task boundary instructions are used to define the input, output, and processing scope of the text review task;

[0138] Output format instructions are used to force the long text review model to output the verification results in a structured manner according to a preset template;

[0139] A binding instruction is provided to restrict the long text review model from rewriting, optimizing, or interpreting the long text to be reviewed and the template when the text review task is the review of the reference template.

[0140] In this embodiment, 1. Role Definition Instructions: The primary function of these instructions is to define the model's role in this task, strictly limiting it to a specific domain expert or executor framework. For example, the instructions might explicitly state: "You are an expert focused on the compliance review of tender documents" or "You are now acting as an automated assistant for verifying the consistency of contract terms." By assigning the model a clear professional role, it is guided to invoke relevant knowledge patterns and judgment criteria, avoiding arbitrariness in general dialogues.

[0141] 2. Task Boundary Instructions: These instructions clearly define the scope of the review task, explicitly telling the model what it needs to do and what it should not. They specify the composition of the input (e.g., "You will receive contract paragraph A to be reviewed and reference template paragraph B"), the nature of the output (e.g., "Only determine if there are substantial differences between the two"), and the boundaries of processing (e.g., "Only analyze what is explicitly stated in the text; do not evaluate the commercial reasonableness of the clauses"). This prevents the model from task sprawling or handling irrelevant issues.

[0142] 3. Output Format Instructions: These instructions enforce a predefined structured template that the model's response must follow. For example, requiring the model to output each item strictly according to the following format: "[Number] Problem Type: [Inconsistency Conflict / Missing Clause / ...] | Location: [Section Number of Document to be Reviewed] | Original Excerpt: '...' | References: '...'". This mandatory formatted output not only standardizes the presentation of results, facilitating subsequent automatic parsing and summarization, but more importantly, it limits the model's room for flexibility, formally curbing the generation of lengthy, loosely structured, or commentary-containing text.

[0143] 4. Binding Instructions (Taking Template Review as an Example): In specific review scenarios, such as comparing the document to be reviewed with a standard template, stricter behavioral prohibitions need to be imposed. Typical instructions include: "It is forbidden to rewrite, polish, optimize, or supplement the input long text to be reviewed and the reference template in any form." "It is forbidden to question or evaluate the reasonableness or completeness of the template clauses themselves." "When there is ambiguity or contradiction within the template, the content presented in the original template shall prevail, and no inferences or reconciliations shall be made on one's own." These binding instructions directly target the common tendency of "creativity" or "over-serving" in large models and are the core means to suppress "illusions" and ensure the objectivity of the review.

[0144] In summary, this structured instruction set, through the synergistic effect of role anchoring, task focus, format locking, and strict constraints, constructs a clear "operational guardrail," enabling large language models to output highly controllable, stable, and professionally expected verification results within the scope of their powerful semantic understanding capabilities.

[0145] The analysis and verification device for long texts provided by this invention is described below, such as... Figure 2 As shown, the analysis and verification device for long texts described below and the analysis and verification method for long texts described above can be referred to in correspondence.

[0146] An analysis and verification device for long texts, comprising:

[0147] Text receiving module 210 is used to receive long text to be checked;

[0148] The text segmentation module 220 is used to identify and segment the paragraph boundaries of the long text to be checked, and generate several text segments;

[0149] The text allocation module 230 is used to allocate the text fragments to the target review node based on the semantic relationship between the text fragments;

[0150] The text verification module 240 is used in the target verification node to load and apply a predefined structured instruction set to call the long text verification model to perform text verification tasks and obtain text verification results; the structured instruction set is used to constrain the output behavior of the long text verification model in the process of generating verification results;

[0151] The target review nodes include serial review nodes or parallel review nodes. The serial review nodes are used to process text fragments that have reference relationships or contextual relationships, while the parallel review nodes are used to process text fragments whose content is independent of each other.

[0152] Figure 3 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 3 As shown, the electronic device may include: a processor 310, a communications interface 320, a memory 330, and a communication bus 340, wherein the processor 310, the communications interface 320, and the memory 330 communicate with each other via the communication bus 340. The processor 310 can call logical instructions in the memory 330 to execute a long text analysis and verification method, which includes:

[0153] Receive long text to be verified;

[0154] Identify and segment the paragraph boundaries of the long text to be checked, and generate several text segments;

[0155] Based on the semantic relationships between the text fragments, the text fragments are assigned to the target review nodes;

[0156] In the target review node, a predefined structured instruction set is loaded and applied to call the long text review model to perform text review tasks in order to obtain text verification results; the structured instruction set is used to constrain the output behavior of the long text review model in the process of generating review results;

[0157] The target review nodes include serial review nodes or parallel review nodes. The serial review nodes are used to process text fragments that have reference relationships or contextual relationships, while the parallel review nodes are used to process text fragments whose content is independent of each other.

[0158] Furthermore, the logical instructions in the aforementioned memory 330 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0159] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program that can be stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is able to execute the analysis and verification methods for long texts provided by the above methods, the method comprising:

[0160] Receive long text to be verified;

[0161] Identify and segment the paragraph boundaries of the long text to be checked, and generate several text segments;

[0162] Based on the semantic relationships between the text fragments, the text fragments are assigned to the target review nodes;

[0163] In the target review node, a predefined structured instruction set is loaded and applied to call the long text review model to perform text review tasks in order to obtain text verification results; the structured instruction set is used to constrain the output behavior of the long text review model in the process of generating review results;

[0164] The target review nodes include serial review nodes or parallel review nodes. The serial review nodes are used to process text fragments that have reference relationships or contextual relationships, while the parallel review nodes are used to process text fragments whose content is independent of each other.

[0165] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the analysis and verification methods for long texts provided by the methods described above, the method comprising:

[0166] Receive long text to be verified;

[0167] Identify and segment the paragraph boundaries of the long text to be checked, and generate several text segments;

[0168] Based on the semantic relationships between the text fragments, the text fragments are assigned to the target review nodes;

[0169] In the target review node, a predefined structured instruction set is loaded and applied to call the long text review model to perform text review tasks in order to obtain text verification results; the structured instruction set is used to constrain the output behavior of the long text review model in the process of generating review results;

[0170] The target review nodes include serial review nodes or parallel review nodes. The serial review nodes are used to process text fragments that have reference relationships or contextual relationships, while the parallel review nodes are used to process text fragments whose content is independent of each other.

[0171] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0172] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0173] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for analyzing and verifying long texts, characterized in that, include: Receive long text to be verified; Identify and segment the paragraph boundaries of the long text to be checked, and generate several text segments; Based on the semantic relationships between the text fragments, the text fragments are assigned to the target review nodes; In the target review node, a predefined structured instruction set is loaded and applied to call the long text review model to perform text review tasks in order to obtain text verification results; the structured instruction set is used to constrain the output behavior of the long text review model in the process of generating review results; The target review node includes a serial review node or a parallel review node. The serial review node is used to process text fragments that have reference relationships or contextual relationships, while the parallel review node is used to process text fragments whose content is independent of each other. The step of assigning the text fragments to the target review node based on the semantic relationships between the text fragments specifically includes: Analyze the semantic relationships between the text segments to identify the first text segment that has direct references, indirect references, or contextual logical dependencies, and the second text segments that are independent of each other in terms of content; The first text fragment is assigned to the serial review node to leverage its ability to maintain a continuous context for deep cross-validation. The second text fragment is assigned to the parallel review node to leverage its parallel processing capabilities and improve review throughput.

2. The analysis and verification method for long texts according to claim 1, characterized in that, Before the step in the target review node where a predefined set of structured instructions is loaded and applied to invoke a long text review model to perform a text review task and obtain the text verification result, the following steps are also included: When the text segment contains tables or content with nested hierarchical structures, the content of the tables or nested hierarchical structures is flattened and converted into a plain text entry format with hierarchical numbering.

3. The analysis and verification method for long texts according to claim 1, characterized in that, In the target review node, after the step of loading and applying a predefined set of structured instructions to call the long text review model to perform the text review task and obtain the text verification result, the following steps are also included: Obtain feedback information submitted by the user regarding the text verification results; The collected feedback information is structured and associated with the corresponding text segments, review results, and the applied structured instruction set, and stored in the system knowledge base. The long text review model is periodically invoked to analyze the feedback information accumulated in the system knowledge base, and a structured feedback report containing a specific problem description, preliminary analysis of the cause, and optimization suggestions is generated. Based on the structured feedback report, the structured instruction set or text segmentation logic is updated to optimize the subsequent text verification effect.

4. The analysis and verification method for long texts according to claim 1, characterized in that, The step of identifying and segmenting the paragraph boundaries of the long text to be checked, and generating several text fragments, specifically includes: The long text to be checked is coarsely segmented. Based on predefined regular expression matching rules, the format features in the document are identified to determine the initial segmentation. The initial segments obtained from the coarse segmentation process are subjected to semantic verification and fine-grained segmentation to merge the incorrectly segmented related content. Based on the requirements of the review task, filter out auxiliary text content that does not require semantic review; Based on the results of the fine-grained segmentation and filtering, the aforementioned text segments are generated.

5. The analysis and verification method for long texts according to claim 1, characterized in that, The step of allocating the second text fragment to the parallel review node to utilize its parallel processing capabilities to improve review throughput specifically includes: Real-time acquisition of status indicators of each parallel audit node in the system, including at least the current computing power utilization, the length of the task queue to be processed, and the available memory; Based on the status indicators, the real-time load score of each parallel review node is calculated, and each independent text fragment in the second text fragment is preferentially assigned to the parallel review node with the lowest current load score for processing.

6. The analysis and verification method for long texts according to claim 1, characterized in that, The structured instruction set includes: The role definition instruction is used to limit the role of the long text review model to the text review executor; Task boundary instructions are used to define the input, output, and processing scope of the text review task; Output format instructions are used to force the long text review model to output the verification results in a structured manner according to a preset template; A binding instruction is provided to restrict the long text review model from rewriting, optimizing, or interpreting the long text to be reviewed and the template when the text review task is the review of the reference template.

7. An analysis and verification device for long texts, characterized in that, include: The text receiving module is used to receive long texts to be verified. The text segmentation module is used to identify and segment the paragraph boundaries of the long text to be checked, and generate several text segments; The text allocation module is used to allocate the text fragments to the target review nodes based on the semantic relationships between the text fragments; The text verification module is used in the target review node to load and apply a predefined structured instruction set to call the long text review model to perform text review tasks and obtain text verification results; the structured instruction set is used to constrain the output behavior of the long text review model in the process of generating review results; The target review node includes a serial review node or a parallel review node. The serial review node is used to process text fragments that have reference relationships or contextual relationships, while the parallel review node is used to process text fragments whose content is independent of each other. The step of assigning the text fragments to the target review node based on the semantic relationships between the text fragments specifically includes: Analyze the semantic relationships between the text segments to identify the first text segment that has direct references, indirect references, or contextual logical dependencies, and the second text segments that are independent of each other in terms of content; The first text fragment is assigned to the serial review node to leverage its ability to maintain a continuous context for deep cross-validation. The second text fragment is assigned to the parallel review node to leverage its parallel processing capabilities and improve review throughput.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the analysis and verification method for long text as described in any one of claims 1 to 6.

9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the analysis and verification method for long text as described in any one of claims 1 to 6.