Document chunking method, apparatus, computer device and computer readable storage medium
By preprocessing documents and identifying paragraphs, generating chunks using a chunk accumulator, and filtering based on scores, the problems of semantic integrity destruction and lack of quantitative evaluation in existing technologies are solved, thereby improving the semantic integrity of chunk content and retrieval accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHIYING MATRIX (XIONGAN) MEDICAL TECHNOLOGY CO LTD
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-12
AI Technical Summary
Existing document segmentation methods result in the destruction of semantic integrity and lack quantitative evaluation of segmentation quality, which affects retrieval accuracy.
By preprocessing the document content, identifying paragraphs and segmenting sentences, and using a chunk accumulator to combine sentences sequentially to generate chunks, high-quality chunks are generated by filtering the content of the chunks based on semantic integrity, information density, and search friendliness scores.
It improves the semantic integrity of the segmented content, increases the accuracy of subsequent searches, and enables quantitative evaluation of segmented quality.
Smart Images

Figure CN122197879A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a document segmentation method, apparatus, computer device, and computer-readable storage medium. Background Technology
[0002] Current document segmentation technologies typically employ fixed-length blocks, such as dividing a document into blocks of 512 or 1024 tokens. However, fixed-length blocks often truncate sentences or paragraphs, leading to the separation of important concept definitions and other semantic integrity, thus impacting the accuracy of subsequent searches. Furthermore, current document segmentation technologies lack quantitative evaluation of block quality. Summary of the Invention
[0003] This application provides a document segmentation method, apparatus, computer device, computer-readable storage medium, and computer program product, which can solve at least one of the technical problems mentioned in the background art.
[0004] In view of this, firstly, embodiments of this application provide a document segmentation method, including: The original document content is preprocessed to obtain the document content to be segmented; The document content to be segmented is used to obtain a sentence sequence through sentence segmentation. All sentences in the sentence sequence are combined sequentially to generate several blocks.
[0005] Optionally, the step of performing sentence segmentation on the content of the document to be segmented to obtain a sentence sequence includes: Identify paragraphs of the document content to be segmented; For each paragraph, sentence segmentation is performed to obtain the sentence sequence corresponding to that paragraph; The step of using a block accumulator to sequentially divide the sentences in the sentence sequence into blocks includes: The sentence sequence corresponding to each paragraph is divided into separate blocks.
[0006] Optionally, combining all sentences in the sentence sequence in order includes: The sentences in the sentence sequence are accumulated sequentially using a block accumulator; At the end of each accumulation, all sentences in the block accumulator are combined into a block of text; The block accumulator includes a sentence buffer and a token counter; The step of using a block accumulator to sequentially accumulate sentences in the sentence sequence includes: S31, initialize the block accumulator, clear the sentence buffer and reset the token counter to zero; S32, the block accumulator determines whether the sum of the number of tokens of all accumulated sentences and the number of tokens of the next sentence to be accumulated is greater than the preset upper limit of tokens. If the result is no, proceed to step S33; if the result is yes, proceed to step S34. S33, accumulate the next sentence, and then return to step S32; S34, triggering the generation of the segmented text.
[0007] Optionally, after step S34, the method includes: S35, generate the segmented text; S38, Extract the last N sentences of the segmented text; S39, reset the block accumulator and return to step S32, wherein resetting the block accumulator includes: initializing the block accumulator and using the last N sentences as the starting content of the next block text; Steps S32 to S39 are executed repeatedly until all sentences in the sentence sequence have been processed. S40, return all qualified chunks, each chunk including at least: chunk content, chunk ID and metadata corresponding to the chunk content; S41 outputs a list containing all eligible blocks.
[0008] Optionally, the metadata of the chunk includes: document information to which the chunk content belongs, and its position information within the document.
[0009] Optionally, after step S35 and before step S38, the method further includes: performing a quality assessment on the block content therein; The quality assessment of the segmented content includes: The semantic integrity score of the segmented content is calculated; The information density score of the segmented content is calculated; The search friendliness score of the segmented content is calculated; The comprehensive quality score of the segmented content is calculated based on the semantic integrity score, the information density score, and the search friendliness score. If the overall quality score is greater than or equal to the quality threshold, the corresponding block is retained; If the overall quality score is less than the quality threshold, the corresponding segment is discarded.
[0010] Optionally, the metadata of the segment includes: semantic integrity score, information density score, search friendliness score, and overall quality score.
[0011] Secondly, embodiments of this application also provide a document segmentation apparatus, including: a module for performing the document segmentation method as described above.
[0012] Thirdly, embodiments of this application also provide a computer device, including a memory and a processor; The memory is connected to the processor, the memory is used to store computer programs, and the processor is used to invoke the computer programs so that the computer device executes the document segmentation method described in any one of the first aspects.
[0013] Fourthly, embodiments of this application also provide a computer-readable storage medium storing a computer program adapted to be loaded by a processor and to execute the document segmentation method described in any one of the first aspects.
[0014] Fifthly, embodiments of this application also provide a computer program product, including a computer program that, when executed by a processor, implements the document segmentation method steps described in any of the first aspects.
[0015] The aforementioned document segmentation method, apparatus, computer device, computer-readable storage medium, and computer program product segment the document content to be segmented into sentence sequences, and then combine all sentences in the sentence sequences sequentially to generate several segments. Compared with existing technologies that segment according to a fixed length, this method can effectively improve the semantic integrity of the segmented content, thereby improving the accuracy of subsequent retrieval. Attached Figure Description
[0016] Figure 1 This is a flowchart illustrating the document segmentation method of this application embodiment.
[0017] Figure 2 This is a schematic diagram of a document segmentation device according to an embodiment of this application.
[0018] Figure 3 This is a schematic diagram illustrating the process of generating several blocks based on a sentence sequence in an embodiment of this application.
[0019] Figure 4 This is a schematic diagram of a computer device according to an embodiment of this application. Detailed Implementation
[0020] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0021] Please see Figure 1 and Figure 2 This application discloses a document segmentation method.
[0022] The document segmentation method includes: S1, preprocess the original document content to obtain the document content to be segmented.
[0023] In some implementations, the original document content is preprocessed, including removing headers and footers and cleaning up page numbers. After preprocessing such as removing headers and footers and cleaning up page numbers, the resulting document content is easier to segment into sentences.
[0024] It should be noted that for different documents, since page numbers may not exist or may only exist in the header / footer, in addition to setting up the removal of headers and footers, a page number cleanup operation should also be performed, which can be directly processed based on the existing model.
[0025] In addition, preprocessing can also include removing duplicate content. There may be highly repetitive content within the same document, and there may also be highly repetitive content between different documents. To avoid duplicate content in the database, duplicate content can be removed during preprocessing. This removal of duplicate content can be performed directly based on the existing model.
[0026] S2, perform sentence segmentation on the content of the document to be segmented to obtain a sentence sequence.
[0027] In some implementations, sentence segmentation is performed on the content of the document to be segmented to obtain a sentence sequence, including: Natural Language Processing (NLP) techniques are used to segment the content of the document into sentences. Of course, this is not a limitation.
[0028] When performing correct segmentation, the sentences in the embodiments of this application refer to complete sentences, which have boundary markers (。, ?, !, etc.).
[0029] In some implementations, sentence segmentation is performed on the content of the document to be segmented to obtain a sentence sequence, including: Identify paragraphs in the document content to be segmented; Sentence segmentation is performed on each paragraph to obtain the sentence sequence corresponding to that paragraph; The sentence sequence is divided into blocks sequentially using a block accumulator, including: The sentence sequence corresponding to each paragraph is divided into separate blocks.
[0030] By first identifying the paragraph, then segmenting the sentences within the paragraph to form a sentence sequence, and then generating several blocks based on the corresponding sentence sequence of the paragraph, instead of ignoring the paragraph structure, the content of the blocks will not cross paragraphs, thus further improving semantic integrity.
[0031] S3 combines all sentences in the sentence sequence in order to generate several blocks.
[0032] In some implementations, all sentences in the sentence sequence are combined sequentially, including: The block accumulator is used to accumulate sentences in the sentence sequence sequentially. At the end of each accumulation, all sentences in the chunk accumulator are combined into a chunk of text; The block accumulator includes a sentence buffer and a token counter; The block accumulator is used to accumulate sentences in a sentence sequence sequentially, including: S31, initialize the block accumulator, clear the sentence buffer and reset the token counter to zero.
[0033] S32, the block accumulator determines whether the sum of the number of tokens of all accumulated sentences (starting from 0) and the number of tokens of the next sentence to be accumulated is greater than the preset upper limit of tokens (for example, it can be set to 1000 tokens). If the result is no, proceed to step S33; if the result is yes, proceed to step S34.
[0034] S33, accumulate the next sentence, and then return to step S32.
[0035] S34 triggers the generation of chunked text.
[0036] It should be noted that triggering the generation of chunked text can be done by first accumulating the next sentence before triggering the generation of chunked text, or the next sentence can be omitted from the process.
[0037] Specifically, after step S34, the method includes: S35, generate chunked text. Each chunked text can be named based on its unique ID, for example, chunk 1_text, chunk 2_text.
[0038] S36, extract the last N sentences of the segmented text.
[0039] N is a very small number, specifically 2.
[0040] S37, reset the chunk accumulator and return to step S32, wherein resetting the chunk accumulator includes: initializing the chunk accumulator and using the last N sentences as the starting content of the next chunk text. Based on this step, the end of the generated previous chunk content and the beginning of the next chunk content contain repeated sentences.
[0041] Repeat steps S32 to S37 until all sentences in the sentence sequence have been processed.
[0042] S38 returns all eligible chunks, each chunk including at least: chunk content, chunk ID, and metadata corresponding to the chunk content.
[0043] It should be noted that the definition of "qualified" can be determined based on the actual situation. If no other screening is performed, all generated blocks can be considered as qualified blocks.
[0044] Specifically, the metadata includes: document information to which the chunk of content belongs and its location information within the corresponding document.
[0045] More specifically, location information in a document can include chapter information and starting page number information.
[0046] Specifically, when identifying paragraphs, the content can directly establish a mapping relationship between paragraph and chapter information and the starting page number of the paragraph. The chapter information and starting page number information in the metadata of each block within a paragraph can be directly obtained from the chapter information and starting page number information of that paragraph, but this is not a limitation.
[0047] Furthermore, the location information of the segmented content within the document can also include the paragraph information to which it belongs. Each identified paragraph can be numbered beforehand.
[0048] More specifically, metadata may also include: the starting byte position of the chunk content and / or the number of tokens.
[0049] S39 outputs a list containing all eligible blocks.
[0050] Specifically, the content, ID, and metadata of each chunk can be located in different columns of a single row in the list. Of course, this is not a limitation.
[0051] Specifically, after step S35 and before step S36, the method further includes: performing a quality assessment on the block content therein; The quality of the segmented content was assessed, including: The semantic integrity score of the chunked content is calculated; The information density score of the segmented content is calculated; The search friendliness score of the segmented content is calculated; The overall quality score of the chunked content is calculated based on semantic integrity score, information density score, and search friendliness score. If the overall quality score is greater than or equal to the quality threshold (e.g., it can be set to 0.2), then the corresponding block is retained; If the overall quality score is less than the quality threshold, the corresponding segment is discarded.
[0052] Based on the above technical means, the quality and overall quality of different levels of each component content can be quantified.
[0053] In a specific example, the metadata for a chunk can include at least some of the following: overall quality score, semantic integrity score, information density score, and search friendliness score. This allows users to easily track the quality of the chunked content during subsequent searches.
[0054] In specific examples, semantic integrity can be assessed from sentence integrity and paragraph integrity, using existing models (which can be appropriately trained). Sentence integrity refers to the proportion of complete sentences in a document chunk. Since punctuation may be missed during extraction, and sometimes even in photocopied documents, these issues can affect the semantic integrity of document chunks. Therefore, assessing the proportion of complete sentences in a document chunk can be a factor in evaluating semantic integrity. Additionally, poorly segmented paragraphs that are stuck together can also affect the semantic integrity of document chunks; therefore, assessing paragraph integrity can also be a factor in evaluating semantic integrity.
[0055] When scoring semantic integrity, a score between 0 and 1 can be assigned based on the level of integrity.
[0056] The assessment of information density is known to those skilled in the art and can be performed directly using existing models.
[0057] In specific examples, information density can be assessed from the proportion of effective words and redundancy, and can be scored between 0 and 1 according to the level of information density.
[0058] Search friendliness can be evaluated using keyword density and comprehensibility, with a score ranging from 0 to 1 based on information density. Comprehensibility is assessed by considering the appropriate length of document chunks, such as 300-800 tokens; a higher score indicates chunks within this range, and a lower score indicates chunks with smaller chunks.
[0059] For example, the overall quality score of the chunked content, calculated based on semantic integrity score, information density score, and search friendliness score, can be obtained using the following formula: overall = 0.4*semantic + 0.3*density + 0.3*retrieval Among them, semantic, density, and retrieval are the semantic integrity score, information density score, and retrieval friendliness score, respectively.
[0060] It should be noted that in the specific example, when the overall quality score is less than the quality threshold and the corresponding block is discarded, the last N sentences of the corresponding block text will still be extracted and overlapped, which will then be used as the starting sentence of the next block content.
[0061] To better understand the embodiments of this application, the following are examples of generated blocks: Chunk #1 (document_id: 10001, chunk_seq: 1): Content: Diabetes mellitus is a metabolic disease characterized by chronic hyperglycemia. Long-term hyperglycemia can lead to various chronic complications, severely impacting patients' quality of life and prognosis. Diabetic complications can be broadly classified into microvascular complications and macrovascular complications. Microvascular complications mainly include diabetic nephropathy, diabetic retinopathy, and diabetic neuropathy... Chapter: Introduction Page numbers: page_start=1, page_end=1 Number of Tokens: 856 Quality assessment: - semantic_completeness: 0.92 (semantic completeness, contains complete paragraphs) - information_density: 0.85 (High information density) - retrieval_friendliness: 0.88 (rich in keywords) - quality_score: 0.89 (Overall quality is excellent) Chunk #2 (chunk_seq: 2): Content: Diabetic nephropathy (DN) is one of the most serious microvascular complications of diabetes and a leading cause of end-stage renal disease. Pathological features of DN include thickening of the glomerular basement membrane, mesangial dilation, and glomerular sclerosis. Its pathogenesis involves multiple aspects: 1) Metabolic abnormalities: Hyperglycemia leads to activation of the polyol pathway... Chapter: Diabetic Nephropathy Page numbers: page_start=5, page_end=6 Number of Tokens: 982 Quality assessment: quality_score: 0.91 In this embodiment, the document content to be segmented is divided into sentences to obtain a sentence sequence. Then, all sentences in the sentence sequence are combined sequentially to generate several blocks. Compared with the prior art of segmenting according to a fixed length, this method can effectively improve the semantic integrity of the segmented content, thereby improving the accuracy of subsequent retrieval.
[0062] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0063] Based on the same inventive concept, this application also provides a document segmentation apparatus for implementing the document segmentation method described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method; therefore, the specific limitations / descriptions in the verification apparatus embodiments provided below can be found in the limitations / descriptions of the document segmentation method above, and will not be repeated here.
[0064] Please see Figure 3 The document segmentation apparatus of this application embodiment includes: Preprocessing module 201 is used to preprocess the original document content to obtain the document content to be segmented; Segmentation module 202 is used to segment the content of the document to be segmented into sentences to obtain a sentence sequence; The combination and generation module 203 is used to combine all the sentences in the sentence sequence in order to generate several blocks.
[0065] In this embodiment, the document content to be segmented is divided into sentences to obtain a sentence sequence. Then, all sentences in the sentence sequence are combined sequentially to generate several blocks. Compared with the prior art of segmenting according to a fixed length, this method can effectively improve the semantic integrity of the segmented content, thereby improving the accuracy of subsequent retrieval.
[0066] In this application embodiment, the term "module" refers to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.
[0067] Figure 4 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Figure 4 As shown, the computer device may include a processor 601 and a memory 602. The memory 602 is connected to the processor 601 and is used to store computer programs. The processor 601 is used to invoke the computer programs to cause the computer device to execute the document segmentation method described in the above embodiments. Furthermore, the computer device may also include at least one communication bus 603. The communication bus 603 is used to implement communication between components. The memory 602 may be a high-speed RAM or a non-volatile memory, such as at least one disk storage device.
[0068] This application also provides a computer-readable storage medium storing a computer program adapted to be loaded by a processor and executed by the document segmentation method described in the above embodiments.
[0069] This application also provides a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the electronic device to perform the document segmentation method as described in the above embodiments.
[0070] It should be understood that, in the embodiments of this application, the processor may be a central processing unit (CPU), but it may also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.
[0071] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by hardware related to computer program instructions. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0072] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0073] The above-disclosed examples are merely preferred embodiments of this application and should not be construed as limiting the scope of this application. Therefore, any equivalent variations made in accordance with the claims of this application shall fall within the scope of this application.
Claims
1. A document segmentation method, characterized in that, include: The original document content is preprocessed to obtain the document content to be segmented; The document content to be segmented is used to obtain a sentence sequence through sentence segmentation. All sentences in the sentence sequence are combined sequentially to generate several blocks.
2. The document segmentation method according to claim 1, characterized in that, The step of segmenting the document content to be segmented into sentences to obtain a sentence sequence includes: Identify paragraphs of the document content to be segmented; For each paragraph, sentence segmentation is performed to obtain the sentence sequence corresponding to that paragraph; The step of using a block accumulator to sequentially divide the sentences in the sentence sequence into blocks includes: The sentence sequence corresponding to each paragraph is divided into separate blocks.
3. The document segmentation method according to claim 1 or 2, characterized in that, The step of combining all sentences in the sentence sequence in order includes: The sentences in the sentence sequence are accumulated sequentially using a block accumulator; At the end of each accumulation, all sentences in the block accumulator are combined into a block of text; The block accumulator includes a sentence buffer and a token counter; The step of using a block accumulator to sequentially accumulate sentences in the sentence sequence includes: S31, initialize the block accumulator, clear the sentence buffer and reset the token counter to zero; S32, the block accumulator determines whether the sum of the number of tokens of all accumulated sentences and the number of tokens of the next sentence to be accumulated is greater than the preset upper limit of tokens. If the result is no, proceed to step S33; if the result is yes, proceed to step S34. S33, accumulate the next sentence, and then return to step S32; S34, triggering the generation of the segmented text.
4. The document segmentation method according to claim 3, characterized in that, After step S34, the method includes: S35, generate the segmented text; S36, Extract the last N sentences of the segmented text; S37, reset the block accumulator and return to step S32, wherein resetting the block accumulator includes: initializing the block accumulator and using the last N sentences as the starting content of the next block of text; Steps S32 to S37 are executed repeatedly until all sentences in the sentence sequence have been processed. S38, return all qualified chunks, each of which includes at least: chunk content, chunk ID, and metadata corresponding to the chunk content; S39 outputs a list containing all eligible blocks.
5. The document segmentation method according to claim 4, characterized in that, The metadata of the chunk includes: document information to which the chunk content belongs, and its position information within the document.
6. The document segmentation method according to claim 4, characterized in that, After step S35 and before step S36, the method further includes: performing a quality assessment on the block content therein; The quality assessment of the segmented content includes: The semantic integrity score of the segmented content is calculated; The information density score of the segmented content is calculated; The search friendliness score of the segmented content is calculated; The comprehensive quality score of the segmented content is calculated based on the semantic integrity score, the information density score, and the search friendliness score. If the overall quality score is greater than or equal to the quality threshold, the corresponding block is retained; If the overall quality score is less than the quality threshold, the corresponding segment is discarded.
7. The document segmentation method according to claim 6, characterized in that, The metadata of the segment includes: semantic integrity score, information density score, search friendliness score, and overall quality score.
8. A document segmentation device, characterized in that, include: A module for performing the document segmentation method according to any one of claims 1 to 7.
9. A computer device, characterized in that, Including memory and processor; The memory is connected to the processor, the memory is used to store computer programs, and the processor is used to invoke the computer programs so that the computer device executes the document segmentation method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program adapted to be loaded by a processor and executed as described in any one of claims 1 to 7.