A method and system for extracting long document content based on dynamic segmentation and multi-model optimization
By employing dynamic segmentation and multi-model optimization methods, combined with automatic recognition and manual annotation, dynamically adjusting the segmentation window length, selecting the optimal set of segmented documents, and introducing a dialogue isolation mechanism, the problems of contextual forgetting and frequent hallucinations in long text processing by large AI models are solved, thereby improving the accuracy of content extraction and reducing the hallucination rate.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XINGYUNSHUJU (BEIJING) TECH CO LTD
- Filing Date
- 2025-07-29
- Publication Date
- 2026-06-30
Smart Images

Figure CN121168413B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of long text natural language processing content extraction technology, and in particular to a method and system for extracting long document content based on dynamic segmentation and multi-model optimization. Background Technology
[0002] The importance of using large AI models for long text content extraction lies in its significant improvement in long text processing efficiency, knowledge mining depth, and application value across multiple scenarios. With the acceleration of digitalization, the amount of text data generated by humans is growing exponentially. Traditional manual processing methods are struggling to meet the demands of parsing massive amounts of long text. Large AI models, with their powerful semantic understanding and generation capabilities, have become a key technology for breaking through the bottlenecks in long text processing.
[0003] Current large AI models (e.g., GPT-4, Claude) have two significant problems when processing long texts: (1) Context forgetting: When the input length exceeds the model window (e.g., 4k tokens), the accuracy of key content extraction drops by more than 30%; (2) Frequent hallucinations: In continuous dialogue, the probability of large AI models fabricating content increases. Experiments show that when the input exceeds 5 interactions, the hallucination rate increases to 18%.
[0004] Currently, methods available for long text processing include segmenting long texts using the sliding window method (such as Transformer-XL) and combining long texts with external databases using retrieval enhancement methods (such as REALM). The problems are: (1) the fixed segmentation method used by the sliding window method may disrupt the coherence of the document structure (such as the break of chapter relationships), leading to semantic breaks; (2) the implementation of retrieval enhancement methods depends on additional storage and has poor real-time performance.
[0005] In summary, existing large AI models for natural language processing have the following problems when processing long texts: (1) the segmentation method may affect the results of extracting long document content; (2) the cumulative error of multi-turn dialogue can cause hallucinations. Summary of the Invention
[0006] In view of this, embodiments of the present invention provide a method for extracting long document content based on dynamic segmentation and multi-model optimization, so as to eliminate or improve one or more defects existing in the prior art.
[0007] One aspect of the present invention provides a method for extracting content from long documents based on dynamic segmentation and multi-model optimization. The method includes the following steps: identifying chapter boundaries and paragraph boundaries of the long document from which content is to be extracted; calculating the semantic relevance of adjacent paragraphs separated by paragraph boundaries; using chapter boundaries and paragraph boundaries with semantic relevance below a preset threshold as undetermined segmentation nodes; selecting combinations of undetermined segmentation nodes according to preset filtering rules; and segmenting the long document into different sets of segmented documents based on different combinations of undetermined segmentation nodes; wherein each set of segmented documents contains multiple segments of the long document; obtaining manually annotated chapter and paragraph boundary information; and calculating boundary recognition for different sets of segmented documents based on the manually annotated information. The accuracy is calculated, and the semantic unit integrity of different segmented document sets is evaluated. The segmented document set that achieves the best combination of boundary recognition accuracy and semantic unit integrity is taken as the optimal segmented document set. Multiple AI large model interfaces for natural language processing are called. Each call takes a combination of a segmented document contained in the optimal segmented document set and a prompt word template as input. Each AI large model outputs the content extraction result of a segmented document. For each segmented document, the output of each call to the AI large model is evaluated in multiple dimensions. The one with the highest evaluation score is taken as the final content extraction result of the current segmented document. The final content extraction results of all segmented documents contained in the optimal segmented document set are integrated to obtain the long document content extraction result.
[0008] In some embodiments of the present invention, the step of identifying the chapter boundaries and paragraph boundaries of the long document to be extracted is obtained by regular expression matching; the preset filtering rules include the length range of the segmentation window, and the length candidate value range is the number of tokens between 500 and 8000.
[0009] In some embodiments of the present invention, the data structure of the segmented document set is stored through a structured metadata table, each item of which includes a segment ID, starting position, length, chapter level, and parent node of a segmented document.
[0010] In some embodiments of the present invention, the step of calculating the boundary recognition accuracy of different segmented document sets based on manual annotation includes: determining the total number of paragraphs based on the manually annotated chapter boundaries and paragraph boundaries; comparing the manually annotated paragraphs with the identified chapter boundaries and paragraph boundaries to determine the number of correctly segmented paragraphs; and using the result of dividing the number of correctly segmented paragraphs corresponding to the current segmented document set by the total number of paragraphs as the boundary recognition accuracy.
[0011] In some embodiments of the present invention, the step of calculating the semantic unit integrity of different segmented document sets includes: calculating the total number of semantic units in the long text, calculating the number of truncated semantic units when the long text is segmented by different pending segmentation nodes, based on the formula Calculate the semantic unit integrity score C(L), where L represents the length range of the segmentation window included by the preset filtering rules; wherein, the set of segmented documents that achieves the optimal combination of boundary recognition accuracy and semantic unit integrity is determined by calculating the harmonic mean of the boundary recognition accuracy and semantic unit integrity score.
[0012] In some embodiments of the present invention, the prompt word template includes the role played by the AI large model, the interface of the AI large model called, the segmented document as input, the elements that the output must strictly include, and the output format requirements.
[0013] In some embodiments of the present invention, the step of multi-dimensional quantitative evaluation of the output of each call to the AI large model includes: quantitatively evaluating the output of each call to the AI large model from multiple dimensions, including accuracy, completeness, illusion rate, and robustness, based on preset weights.
[0014] In some embodiments of the present invention, the method further includes: for each segmented document, obtaining a preset number of expert-annotated manual annotation results conforming to a three-level labeling system, wherein the Kappa coefficient of the manual annotation results is greater than 0.8, and evaluating the output of each call to the AI large model based on the manual annotation results; wherein the three-level labeling system includes core arguments, supporting arguments, and abnormal content.
[0015] Corresponding to the above methods, the present invention also provides a long document content extraction system based on dynamic segmentation and multi-model optimization, including a processor, a memory, and a computer program / instructions stored in the memory. The processor is used to execute the computer program / instructions, and when the computer program / instructions are executed, the system implements the steps of any of the methods described in the above embodiments.
[0016] In accordance with the above methods, the present invention also provides a computer-readable storage medium having a computer program / instructions stored thereon, which, when executed by a processor, implements the steps of the method as described in any of the above embodiments.
[0017] The long document content extraction method proposed in this invention, based on dynamic segmentation and multi-model optimization, combines automatic identification and manual annotation to determine segmentation nodes for text segmentation. It traverses the combinations of segmentation nodes to obtain multiple sets of segmented documents. These sets are evaluated by comprehensively considering boundary recognition accuracy and semantic unit integrity, and the optimal set is selected as the input to the AI large model. Finally, the method selects the highest multi-dimensional quantitative evaluation score as the final content extraction result for each segmented document, thus improving the accuracy of long document content extraction. Furthermore, by calling the AI large model interface separately for each segmented document in the optimal set, a dialogue isolation mechanism is introduced, which helps suppress the illusion rate in multi-turn dialogues.
[0018] Additional advantages, objects, and features of the invention will be set forth in part in the description which follows, and will also become apparent in part to those skilled in the art upon studying the description, or may be learned by practice of the invention. The objects and other advantages of the invention can be realized and obtained by means of the structures specifically pointed out in the description and drawings.
[0019] Those skilled in the art will understand that the objectives and advantages achievable with the present invention are not limited to those specifically described above, and that the above and other objectives achievable with the present invention will become clearer from the following detailed description. Attached Figure Description
[0020] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, are not intended to limit the scope of the invention. In the drawings:
[0021] Figure 1 This is a flowchart of a long document content extraction method based on dynamic segmentation and multi-model optimization in one embodiment of the present invention. Detailed Implementation
[0022] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and descriptions of this invention are used to explain the invention, but are not intended to limit the invention.
[0023] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the solution according to the invention are shown in the accompanying drawings, while other details that are not closely related to the invention are omitted.
[0024] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.
[0025] It should also be noted that, unless otherwise specified, the term "connection" in this article can refer not only to a direct connection, but also to an indirect connection involving an intermediary.
[0026] In the following description, embodiments of the invention will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.
[0027] To overcome the problems existing in current long text processing, this invention proposes a long document content extraction method based on dynamic segmentation and multi-model optimization, which includes the following steps:
[0028] Step S110: Identify the chapter boundaries and paragraph boundaries of the long document from which content is to be extracted, and calculate the semantic relevance of adjacent paragraphs separated by paragraph boundaries.
[0029] Step S120: Take the chapter boundaries and paragraph boundaries with semantic relevance below a preset threshold as undetermined segmentation nodes, select combinations of undetermined segmentation nodes according to preset filtering rules, and segment the long document into different sets of segmented documents based on different combinations of undetermined segmentation nodes; wherein, the set of segmented documents contains multiple segmented documents from which the long document is segmented.
[0030] Step S130: Obtain manually annotated chapter and paragraph boundary information, calculate the boundary recognition accuracy of different segmented document sets based on the manual annotation, and calculate the semantic unit integrity of different segmented document sets. The segmented document set that achieves the best combination of boundary recognition accuracy and semantic unit integrity is taken as the optimal segmented document set.
[0031] Step S140: Call multiple AI large model interfaces for natural language processing. Each call takes a combination of a segmented document contained in the optimal segmented document set and a prompt word template as input. Each AI large model outputs the content extraction result of a segmented document.
[0032] Step S150: For each segmented document, evaluate the output of each call to the AI large model in multiple dimensions, and take the one with the highest evaluation score as the final result of content extraction for the current segmented document. Integrate the final results of content extraction of all segmented documents contained in the optimal segmented document set to obtain the long document content extraction result.
[0033] The long document content extraction method proposed in this invention, based on dynamic segmentation and multi-model optimization, combines automatic identification and manual annotation to determine segmentation nodes for text segmentation. It traverses the combinations of segmentation nodes to obtain multiple sets of segmented documents. These sets are evaluated by comprehensively considering boundary recognition accuracy and semantic unit integrity, and the optimal set is selected as the input to the AI large model. Finally, the method selects the highest multi-dimensional quantitative evaluation score as the final content extraction result for each segmented document, thus improving the accuracy of long document content extraction. Furthermore, by calling the AI large model interface separately for each segmented document in the optimal set, a dialogue isolation mechanism is introduced, which helps suppress the illusion rate in multi-turn dialogues.
[0034] In one embodiment of the present invention, the step of identifying the chapter boundaries and paragraph boundaries of the long document from which content is to be extracted is obtained by using regular expression matching.
[0035] Furthermore, in one embodiment of the present invention, the preset filtering rules include the length range of the segmentation window, wherein the candidate length value ranges from 500 to 8000 tokens.
[0036] By combining regular expression matching with preset filtering rules, this invention facilitates dynamic and structured long text segmentation, ensuring semantic accuracy and semantic integrity as much as possible during the text segmentation process.
[0037] In one embodiment of the present invention, the data structure of the segmented document set is stored through a structured metadata table. Each item in the structured metadata table includes a segment ID, starting position, length, chapter level, and parent node of a segmented document.
[0038] Using this embodiment of the invention, the starting position and length of each segmented document in the dynamically segmented document set can be fully recorded. Different segmented documents are identified by segment IDs, and the position of the segmented document in the long document is located by chapter level and parent node. Thus, the contextual logic of the original long document is taken into account during the process of extracting and integrating content from the segmented documents, and the semantic information in the long document is preserved to the greatest extent.
[0039] In one embodiment of the present invention, the step of calculating the boundary recognition accuracy of different segmented document sets based on manual annotation includes: determining the total number of paragraphs based on the manually annotated chapter boundaries and paragraph boundaries; comparing the manually annotated sections and paragraph boundaries with the identified sections and paragraph boundaries to determine the number of correctly segmented paragraphs; and dividing the number of correctly segmented paragraphs corresponding to the current segmented document set by the total number of paragraphs as the boundary recognition accuracy.
[0040] Using this embodiment of the invention, the quality of automatically identified chapter and paragraph boundaries can be analyzed by manual annotation, thereby determining the set of segmented documents that achieves the optimal combination of boundary recognition accuracy and semantic unit integrity.
[0041] In one embodiment of the present invention, the step of calculating the semantic unit integrity of different segmented document sets includes: calculating the total number of semantic units in the long text, calculating the number of truncated semantic units when the long text is segmented by different pending segmentation nodes, based on the formula Calculate the semantic unit integrity score C(L), where L represents the length range of the segmentation window included by the preset filtering rules; wherein, the set of segmented documents that achieves the optimal combination of boundary recognition accuracy and semantic unit integrity is determined by calculating the harmonic mean of the boundary recognition accuracy and semantic unit integrity score.
[0042] Using this embodiment of the invention, it is possible to calculate the semantic unit integrity score and determine the set of segmented documents that achieves the best combination of boundary recognition accuracy and semantic unit integrity, thereby determining the optimal solution for dynamic segmentation of long texts.
[0043] In one embodiment of the present invention, the prompt word template includes the role played by the AI large model, the interface of the AI large model called, the segmented document as input, the elements that the output must strictly include, and the output format requirements.
[0044] Using this embodiment of the invention, it is possible to use prompt word templates for batch input and output to obtain the content extraction results of different AI large models for different segmented documents.
[0045] In one embodiment of the present invention, the step of multi-dimensional quantitative evaluation of the output of each call to the AI large model includes: quantitatively evaluating the output of each call to the AI large model from multiple dimensions, including accuracy, completeness, illusion rate and robustness, based on preset weights.
[0046] By employing this embodiment of the invention, the model output quality is comprehensively considered from multiple dimensions, thereby obtaining the content extraction result with the best quality within the selectable range.
[0047] In one embodiment of the present invention, the method further includes: for each segmented document, obtaining a preset number of expert-annotated manual annotation results conforming to a three-level labeling system, wherein the Kappa coefficient of the manual annotation results is greater than 0.8, and evaluating the output of each call to the AI large model based on the manual annotation results; wherein the three-level labeling system includes core arguments, supporting arguments, and abnormal content.
[0048] Using this embodiment of the invention, it is possible to combine expert annotations to assist in evaluating the output of the large AI model.
[0049] In the specific implementation process, step S110 first performs structured segmentation on the long document, tries different dynamic segmentation methods, and calculates the document segmentation evaluation results. The length L of si corresponding to each segmented document is recorded to form a set of segmented documents Si([s1,s2,s3]).
[0050] The process of structurally segmenting long documents includes: (1) Chapter boundary identification, which can be achieved by using a dual-modal segmentation of regular expression matching and manual verification. The manual verification (intervention rule) is: if the semantic relevance of adjacent paragraphs is greater than the threshold (e.g., cosine similarity > 0.7), segmentation is prohibited. (2) The length range of the segmentation window used for long document segmentation is dynamically adjusted according to the dynamic window adjustment formula. By traversing different segmentation lengths L, different combinations of segmentation nodes are tried to determine the window size that can achieve the best combination of accuracy and integrity, avoiding excessive length (content redundancy) or excessive shortness (semantic breakage). A set of segmented documents Si([s1,s2,s3]) is formed.
[0051] The regular expression is: r'(?m)^#{1,3}\s*(?:\d+\.)*\d+\s+[\u4e00-\u9fa5a-zA-Z]+'. This regular expression means matching Markdown formatted heading lines (headings of levels 1 to 3, with numbered headings and both Chinese and English heading text). Furthermore, a structured metadata table is used to record the delimiter parameters. An example of the constructed structured metadata table is shown below.
[0052] Table 1 shows an example of a structured metadata table.
[0053] Segment ID of a split document Starting position length Chapter hierarchy Parent node S1 0 1024 1 NULL S2 1025 768 2 S1
[0054] The dynamic window adjustment formula is as follows:
[0055]
[0056] In the above formula, L is the length range of the segmentation window (unit: number of tokens), and the candidate value range is usually 500-8000 tokens. A(L): average accuracy at the current segmentation length, measuring whether the segmentation boundary is consistent with the manual annotation, calculated as follows: C(L): Completeness score, measures whether complete semantic units are preserved after segmentation. Calculation method: The formula for the harmonic mean is expressed as follows: Balancing accuracy and completeness, avoiding imbalance caused by excessively high scores on any single metric. The physical meaning of the above formula is to find the length L of the segmentation window that maximizes the harmonic mean.
[0057] In the specific implementation process, step S140 involves trying different large-scale AI models, writing and optimizing prompt words, and inputting one segmented document si at a time to obtain the output results. Furthermore, dynamic variable injection technology is used to enable the invocation and input of different large-scale AI models. Different prompt words Pro = {p1, p2, p3} are tried, where Pro is the set of prompt words, and p1, p2, and p3 are individual prompt words. The optimal prompt word is obtained based on the quantitative evaluation model results.
[0058] In actual coding, the prompt template can be in the following format: prompt_template={"role":"system","content":"As an expert in {domain}, please extract the core content of the {section_type} section from the following text, which must strictly include: {elements}. The output format is JSON key-value pairs: {format}."} #Example variable substitution: {domain:"Computer Science",section_type:"Methodology",elements:["Assumptions","Experimental Parameters"]}
[0059] Furthermore, adversarial test designs can be inserted into the prompt word template to test the model's robustness by inserting distracting sentences (such as irrelevant mathematical formulas) and to test robustness by using obfuscated chapter titles (such as "3.1 Experimental Design" vs "3.1 Experimental Design").
[0060] In the specific implementation process, the quantitative evaluation process in step S150 can adopt an expert annotation system or a multi-dimensional scoring algorithm. The quantitative evaluation model is used to compare the output results of different large models and documents segmented with different lengths to obtain the large model with the optimal content extraction result and the most suitable set of segmented documents.
[0061] The expert annotation system is built based on the Gold Standard. The system includes a defined three-level tagging system: {"Core Argument":{"required":true,"weight":0.6},"Supporting Evidence":{"required":true,"subtypes":["Data","Citations"]},"Abnormal Content":{"type":"Hallucination Detection","penalty":-0.2}}
[0062] During the annotation process, double-blind annotation can be used, that is, two experts annotate independently, and the Kappa coefficient must be greater than 0.8.
[0063] For multi-dimensional scoring algorithms, multiple dimensions such as accuracy, completeness, illusion rate, and robustness are comprehensively considered.
[0064] Defined formula for quantifying illusion rate:
[0065]
[0066] Parameter description:
[0067] H: Hallucination Rate, the proportion of erroneous or fictitious content in the model output.
[0068] n: Total number of test samples (i.e., the number of labeled gold standard paragraphs).
[0069] f model (xi): The set of output results of the model on the input paragraph xi (such as extracted arguments and evidence).
[0070] G(xi): The gold standard result set (i.e., the correct answer) annotated by experts.
[0071] II(): Indicator function, takes the value 1 if the model output is not in the gold standard, otherwise takes the value 0.
[0072] Calculation example: If 10 paragraphs are tested, and 3 of them contain unlabeled content, then H = 3 / 10 × 100% = 30%.
[0073] The evaluation formula for the model, which integrates various dimensions, is as follows:
[0074] Parameter description:
[0075] Q: The overall quality score of the model is used for horizontal comparison of different models.
[0076] A: Accuracy is used to evaluate the matching rate between the model output and the gold standard.
[0077] C: Completeness, used to analyze the extent to which the model covers all the necessary content of the gold standard.
[0078] H: Hallucination Rate.
[0079] R: Robustness, used to evaluate stable performance under adversarial tests (such as distractor sentences and misspellings).
[0080] A, β, γ, δ: Weighting coefficients used to adjust priority based on document type.
[0081] In the specific implementation process, the weight coefficients for different documents are as follows: (1) Scientific papers: α=0.6, β=0.2, γ=0.15, δ=0.05 (emphasizing accuracy). (2) Legal documents: α=0.4, β=0.4, γ=0.1, δ=0.1 (emphasizing completeness).
[0082] Its practical significance lies in the fact that by adjusting the weights, the scoring system can be adapted to the needs of different scenarios. For example, legal documents require the complete retention of clauses, while academic papers must strictly avoid fabricating data.
[0083] Furthermore, in some embodiments of the present invention, a dialogue isolation execution mechanism is introduced. Each time the model interface is called, a new session is created to prevent hallucinations and reduce the hallucination rate.
[0084] The long document content extraction method proposed in this invention, based on dynamic segmentation and multi-model optimization, combines automatic identification and manual annotation to determine segmentation nodes for text segmentation. It traverses the combinations of segmentation nodes to obtain multiple sets of segmented documents. These sets are evaluated by comprehensively considering boundary recognition accuracy and semantic unit integrity, and the optimal set is selected as the input to the AI large model. Finally, the method selects the highest multi-dimensional quantitative evaluation score as the final content extraction result for each segmented document, thus improving the accuracy of long document content extraction. Furthermore, by calling the AI large model interface separately for each segmented document in the optimal set, a dialogue isolation mechanism is introduced, which helps suppress the illusion rate in multi-turn dialogues.
[0085] The long document content extraction method proposed in this invention, based on dynamic segmentation and multi-model optimization, adopts a long document dynamic segmentation strategy based on semantic structure. Furthermore, it establishes a quantitative mapping relationship between segmentation length and model performance, flexibly adjusts the segmentation strategy, dynamically selects the optimal combination of model and segmentation strategy, and suppresses hallucinations through a dialogue isolation mechanism.
[0086] The method proposed in this invention has been tested through experimental simulations, and its beneficial effects include, but are not limited to: (1) improved accuracy: on the arXiv paper test set, the accuracy rate increased from 72% to 89%; (2) hallucination suppression: the hallucination rate of continuous processing was reduced to below 3% through the isolation mechanism; (3) adaptive optimization: dynamic selection of model-segmentation combination reduced resource consumption by 20%.
[0087] Corresponding to the above method, the present invention also provides a long document content extraction system based on dynamic segmentation and multi-model optimization. The system includes a computer device, which includes a processor and a memory. The memory stores computer instructions, and the processor is used to execute the computer instructions stored in the memory. When the computer instructions are executed by the processor, the system implements the steps of the method described above.
[0088] Corresponding to the methods described above, the present invention also provides a computer-readable storage medium having a computer program / instructions stored thereon, which, when executed by a processor, implements the steps of the method as described in any of the above embodiments. The computer-readable storage medium may be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.
[0089] Corresponding to the above methods, the present invention also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the method as described in any of the above embodiments.
[0090] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the desired tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave.
[0091] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.
[0092] In this invention, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.
[0093] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations of the embodiments of the present invention are possible. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for extracting content from long documents based on dynamic segmentation and multi-model optimization, characterized in that, include: Identify the chapter boundaries and paragraph boundaries of the long document from which content is to be extracted, and calculate the semantic relevance of adjacent paragraphs separated by paragraph boundaries; The chapter boundaries and paragraph boundaries with semantic relevance below a preset threshold are used as undetermined segmentation nodes. Combinations of undetermined segmentation nodes are selected according to preset filtering rules. Based on different combinations of undetermined segmentation nodes, the long document is segmented into different sets of segmented documents. Each set of segmented documents contains multiple segments of the long document. The length of each segmented document is recorded in each set of segmented documents. Obtain manually annotated chapter and paragraph boundary information, calculate the boundary recognition accuracy of different segmented document sets based on the manually annotated information, and calculate the semantic unit integrity of different segmented document sets. By calculating the harmonic mean of the boundary recognition accuracy and semantic unit integrity scores, determine the segmented document set that is optimal in terms of both boundary recognition accuracy and semantic unit integrity. The segmented document set that is optimal in terms of both boundary recognition accuracy and semantic unit integrity is taken as the optimal segmented document set. Multiple AI large model interfaces for natural language processing are called. Each call takes a combination of a segmented document from the optimal segmented document set and a prompt word template as input. Each AI large model outputs the content extraction result of a segmented document. For each segmented document, the output of each call to the AI large model is evaluated in multiple dimensions. The highest evaluation score is taken as the final result of content extraction for the current segmented document. The final results of content extraction from all segmented documents in the optimal segmented document set are integrated to obtain the long document content extraction result.
2. The method according to claim 1, characterized in that, The step of identifying the chapter boundaries and paragraph boundaries of the long document from which content is to be extracted is obtained using regular expression matching. The preset filtering rules include the length range of the segmentation window, and the number of tokens with the candidate length value range of 500-8000.
3. The method according to claim 1, characterized in that, The data structure of the split document collection is stored through a structured metadata table. Each item in the structured metadata table includes the segment ID, starting position, length, chapter level, and parent node of a split document.
4. The method according to claim 1, characterized in that, The step of calculating the boundary recognition accuracy of different segmented document sets based on manual annotation includes: The total number of paragraphs is determined based on manually annotated chapter and paragraph boundaries. The manually annotated boundaries are compared with the identified chapter and paragraph boundaries to determine the number of correctly segmented paragraphs. The result of dividing the number of correctly segmented paragraphs corresponding to the current segmented document set by the total number of paragraphs is used as the boundary recognition accuracy.
5. The method according to claim 4, characterized in that, The step of calculating the semantic unit integrity of different segmented document sets includes: Calculate the total number of semantic units in a long text, and calculate the number of truncated semantic units when the long text is segmented by different pending segmentation nodes, based on the formula. Calculate semantic unit integrity score , which indicates The preset filtering rules include the length range of the segmented windows.
6. The method according to claim 1, characterized in that, The prompt word template includes the role played by the AI model, the interface of the AI model called, the segmented document as input, the elements that the output must strictly include, and the output format requirements.
7. The method according to claim 1, characterized in that, The steps for multi-dimensional quantitative evaluation of the output of each call to the AI large model include: quantitatively evaluating the output of each call to the AI large model from multiple dimensions, including accuracy, completeness, illusion rate, and robustness, based on preset weights.
8. The method according to claim 1, characterized in that, The method further includes: for each segmented document, obtaining a preset number of expert-annotated manual annotation results conforming to a three-level labeling system, wherein the Kappa coefficient of the manual annotation results is greater than 0.8, and evaluating the output of each call to the AI large model based on the manual annotation results; wherein the three-level labeling system includes core arguments, supporting arguments, and abnormal content.
9. A long document content extraction system based on dynamic segmentation and multi-model optimization, comprising a processor, a memory, and a computer program / instructions stored in the memory, characterized in that, The processor is configured to execute the computer program / instructions, and when the computer program / instructions are executed, the system implements the steps of the method as described in any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method as described in any one of claims 1 to 8.