A text segmentation method, device and medium based on document structure embedding

By constructing a document structure tree and embedding title paths to process text segments, the problem of lost document hierarchical structure information is solved, achieving high-precision semantic matching and fast location of text segments, thus improving the accuracy of retrieval and answer generation.

CN122197878APending Publication Date: 2026-06-12INSPUR INTELLIGENT NUMBER (TIANJIN) DIGITAL TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INSPUR INTELLIGENT NUMBER (TIANJIN) DIGITAL TECHNOLOGY CO LTD
Filing Date
2026-05-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies fail to effectively preserve the hierarchical structure information of documents during the document segmentation process, resulting in vectorized text segments that cannot fully reflect the main semantics of the text, affecting retrieval accuracy and the accuracy of generated answers.

Method used

By performing layout analysis on the document, constructing a document structure tree, extracting and embedding title paths as context prefixes for text segments, and vectorizing them, the data is stored in the database to ensure that the text segments carry clear subject-specific identifiers.

🎯Benefits of technology

It improves the semantic matching accuracy of text segments and the topic matching accuracy during retrieval, and can quickly locate and trace the position of text segments in the original document, thereby improving the accuracy and completeness of the answers generated by the large language model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122197878A_ABST
    Figure CN122197878A_ABST
Patent Text Reader

Abstract

The present application belongs to the technical field of natural language processing and information retrieval, and particularly relates to a text segmentation method and device based on document structure embedding and a medium. A document structure tree is constructed through layout analysis, format text extraction and structure recognition. The text is segmented based on the tree to obtain initial text segments. The position of each text segment in the structure tree is traced back to a title path, and the title path is embedded as an indivisible context prefix at the beginning of the text segment. The text segment with the embedded structure is vectorized and stored. The method improves the consistency of the retrieved segments and the query intent in the RAG system by explicitly embedding the title path, provides an input rich in logical context for the LLM, and thus enhances the accuracy and coherence of the generated answers.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of natural language processing and information retrieval technology, specifically relating to a text segmentation method, device, and medium based on document structure embedding. Background Technology

[0002] Retrieval Enhancement Generation (RAG) systems generate more accurate and reliable answers by matching relevant information from external knowledge bases with user queries, providing context for Large Language Models (LLMs). One of its main steps is to segment and vectorize text from a document repository to build a searchable vector database.

[0003] The text segmentation method employs a fixed-length sliding window approach. When splitting experimental arguments in academic papers, this method truncates a complete description of an experimental step, leaving the first half in the previous segment and the second half in the next, resulting in both segments losing their complete semantic context. Furthermore, during retrieval, even if one segment is matched, complete experimental information cannot be obtained, thus affecting the understanding and response to user queries.

[0004] Most vectorization techniques in related technologies process the original text segments directly without incorporating the document's hierarchical structure. However, document titles, chapters, and other structural information are crucial representations of the text's theme. For example, the text under the introduction in Chapter 1 and the text under the experimental results in Chapter 3, even with similar vocabulary, have completely different themes. The lack of this structural information prevents vectors from fully reflecting the text's main semantics, thus reducing semantic matching accuracy during retrieval and impacting the overall system performance.

[0005] When generating text segments, it is necessary to verify the reliability of the source of the text segment, or to trace the complete topic background corresponding to the text segment. For example, if a user questions the accuracy of the conclusion of a certain text segment and needs to view the original text context for verification, it is only possible to search one by one in the complete document, which increases the difficulty of data verification and background query. Summary of the Invention

[0006] This invention provides a text segmentation method based on document structure embedding, which improves the retrieval quality of vectorized text segments in the retrieval enhancement generation system, provides information fragments rich in structural context for large language models, and improves the quality of generated answers.

[0007] The methods include: S1: Perform layout analysis on the original document to obtain the physical layout information of the original document; S2: Based on the physical layout information, extract the plain text content from the main text area and retain the format information corresponding to the plain text content to obtain a formatted text sequence. S3: Based on the formatted text sequence, identify the level and type of each line of text, and construct a document structure tree that includes the parent-child relationship of each level of headings and the relationship between headings and body paragraphs; S4: Based on the document structure tree, extract the main text, perform semantic segmentation on the main text, and obtain preliminary text segments around the sub-topics; S5: Based on the document structure tree obtained in step S3 and the preliminary text segments obtained in step S4, locate the position of each preliminary text segment in the document structure tree, trace back the various levels of headings to which the preliminary text segments belong and concatenate them into a heading path, add the heading path as a context prefix to the beginning of the corresponding preliminary text segment, and obtain a text segment with an embedded heading path. S6: Vectorize each text segment embedded in the title path to obtain the text segment vector corresponding to each text segment embedded in the title path. S7: Store the obtained text segment vector and the corresponding text segment with the embedded title path as records in the database, and also store the title path as metadata associated with the records.

[0008] According to another embodiment of this application, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of a text segmentation method based on document structure embedding.

[0009] According to yet another embodiment of this application, a storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the steps of a text segmentation method based on document structure embedding.

[0010] As can be seen from the above technical solutions, the present invention has the following advantages: The document structure embedding-based text segmentation method provided in this invention extracts the main text stream arranged in the original document order by traversing the document structure tree. Sentence units are split using natural punctuation marks such as periods and question marks, and then aggregated into basic text blocks by combining them with line breaks from the original document. By analyzing the overlap ratio of key words in adjacent basic text blocks, thematic relevance is determined, and segmentation points are identified. This fully preserves the semantic context and avoids the problem of topic fragmentation. The text segment's position in the structure tree is located by matching the first sentence unit of the text segment with the text content of the main text node. All title nodes are traced back along the parent node's upstream link, rearranged, and concatenated to form a title path, which is then embedded into the text segment as a context prefix. This ensures that the text segment carries a clear topic affiliation identifier, clearly distinguishing similar content from different chapters, and enabling rapid matching of text segments at the corresponding topic level during retrieval, thus improving topic matching accuracy.

[0011] This invention uses text segments embedded with title paths as vectorized input. This allows the pre-trained embedding model to simultaneously capture the structural context information contained in the title paths while extracting semantic features from the text content. This enables the generated vectors to represent the main semantics of the text, effectively distinguishing similar content at different topic levels during retrieval and improving semantic matching accuracy. By using structure tree localization and title path association, a mapping relationship is established between text segments and the hierarchical structure of the original document. Each text segment's title path corresponds to a real chapter level in the original document, allowing for quick tracing of its specific position within the document. This improves the traceability of text data, eliminating the need to search the entire document one by one. Attached Figure Description

[0012] To more clearly illustrate the technical solution of the present invention, the accompanying drawings used in the description will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 The flowchart shows a text segmentation method based on document structure embedding. Figure 2 A sequence diagram for a text segmentation method based on document structure embedding; Figure 3 This is a schematic diagram of an electronic device. Detailed Implementation

[0014] The document structure embedding-based text segmentation method provided by this invention deeply couples the logical structure recognition of a document with the semantic segmentation process and introduces title path embedding technology. Specifically, the method accurately parses the document through layout analysis and rule / machine learning models to construct a document structure tree reflecting the relationships between chapters, sections, subsections, etc. Secondly, after performing intelligent semantic segmentation using algorithms such as semantic similarity, the segmented content is not used directly. Instead, each preliminary text segment is located in the structure tree, and all its ancestor titles are concatenated into a complete title path, which is embedded at the beginning of the text segment as an indivisible prefix. Finally, the context text segments are vectorized and stored in a vector database.

[0015] This invention, by explicitly embedding title paths, endows each text segment with accurate and rich macro-theme and contextual information. This results in a final vector representation that not only includes the micro-semantics of the text segment itself but also strengthens its position and attribution signals within the entire document. When searching in a RAG system, this vector can more accurately match the user's true query intent, effectively avoiding the retrieval of fragments with incomplete information or off-topic content, thereby improving the accuracy, completeness, and coherence of the answers generated by the Large Language Model (LLM). This invention is applicable to knowledge base construction and intelligent information retrieval for complex documents such as technical manuals, academic papers, and legal documents.

[0016] The following describes in detail the text segmentation method based on document structure embedding involved in this application. Specific details such as particular system architectures and techniques are presented for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application can also be implemented in other embodiments without these specific details.

[0017] It should be understood that, when used in this specification, the term "comprising" indicates the presence of the described feature, integral, step, operation, element, and / or component, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or collections thereof. The terms "comprising," "including," "having," and variations thereof all mean "including but not limited to," unless otherwise specifically emphasized.

[0018] The terms "one embodiment" or "some embodiments" used in this application mean that one or more embodiments of this application include the specific features, structures, or characteristics described in that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this application do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized.

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0020] Please see Figure 1 and Figure 2 The diagram shows a flowchart of a text segmentation method based on document structure embedding in a specific embodiment. The method includes: S1: Perform layout analysis on the original document, dividing the original document into areas corresponding to the main text area, header, footer, tables, charts, and table of contents, to obtain the physical layout information of the original document.

[0021] In some embodiments, based on the underlying data storage characteristics of different types of documents, the original document is transformed into analyzable page element information. By utilizing the positional patterns, characteristics, and content attribute differences of elements in each region, the regions are divided, and the spatial distribution of each region within the page is clarified.

[0022] S1 specifically includes the following steps: S11: Determine the type of the original document; if it is a scanned PDF image document, convert the content of each page into a two-dimensional pixel array through optical scanning, with each pixel corresponding to a grayscale value or RGB value, and record the page size parameters to completely preserve the image shape and positional relationship of all elements on the page. If it is a DOCX editable document, parse the XML data inside the file, read the format description tags that mark text, paragraphs, and tables, extract the position attributes corresponding to each tag, and obtain the horizontal and vertical coordinate values ​​of the element on the page.

[0023] As can be seen, this embodiment selects an appropriate parsing method based on the storage characteristics of different documents. Image documents use pixel matrices to carry the shape and position of elements, while editable documents directly record element information using labels and coordinates. The basic data required for element positioning is obtained through targeted parsing.

[0024] S12: For the basic data processing obtained in S11, if it is a pixel matrix of a scanned PDF, find the edges of each independent element in the page through image contour detection, calculate the bounding box of each element according to the coordinate range of the edge pixels, and use the coordinates of the upper left and lower right corners of the bounding box as the position information of the element. If the format is DOCX, describe the tags and coordinate data, directly organize the coordinate values ​​corresponding to each element to form a list of element positions. Then compare the element positions and content across multiple pages of the document, and within a fixed height range at the top of the page, find the set of elements that repeats on 2 to 4 consecutive pages or more. Merge the bounding boxes containing the sets of repeating elements and mark them as the header area. Within a fixed height range at the bottom of the page, identify the set of elements using the same repetition criteria, merge their bounding box ranges, mark them as the footer area, and record the complete coordinate range of these two areas.

[0025] As can be seen, the basic data of different types of documents is transformed into unified element position information to eliminate data format differences; then, by utilizing the inherent characteristics of fixed header and footer positions and repetition across multiple pages, the set of elements that meet the characteristics is filtered out through multi-page comparison to complete the area marking.

[0026] S13: Filter out the remaining elements that are not marked as headers or footers, and classify them by analyzing their form and content characteristics one by one: For a set of elements that are continuously distributed, have no obvious dividing lines, and contain continuous text, merge their bounding boxes and mark them as the main text area. For elements with clear horizontal and vertical dividing lines, or elements that exhibit row-column alignment, mark them as table areas; Elements whose internal pixel grayscale values ​​conform to image features and present an independent and complete image form are marked as chart areas; For hierarchical numbers containing “1.、1.1、(1)” followed by short text, the set of elements whose text corresponds to the subsequent chapter titles of the document is marked as the directory area.

[0027] The coordinate ranges of headers, footers, body text, tables, charts, and tables of contents are organized one by one to form the original document physical layout information containing all area types and corresponding location information.

[0028] This embodiment classifies remaining elements based on differences in whether they contain separators, alignment, continuous text, and hierarchical numbering, thus clarifying the functional attributes and spatial scope of each area. This clearly distinguishes areas with different functions within the document, and the summarized physical layout information comprehensively records the location and type of each area, ensuring that no major areas are missed during the extraction process.

[0029] S2: Based on the physical layout information, extract the plain text content from the body text area and the title area, and retain the font, size, bold, and italic formatting information corresponding to the plain text content to obtain a formatted text sequence.

[0030] In some embodiments, the main text extraction objects are defined based on the regional coordinates of physical layout information; an appropriate text extraction method is adopted for different document types, and format information is obtained by associating element inherent attributes or matching attribute tables. The text and format are integrated in the order of reading to form a text sequence.

[0031] S2 specifically includes the following steps: S21: Retrieve the coordinate range of the text area and title area from the physical layout information obtained in step S1, check each page element in the original document, mark the elements whose coordinates fall within the text area and title area as elements to be processed, and exclude elements in other areas.

[0032] In some embodiments, the coordinate ranges of the body text area and the title area are retrieved from the physical layout information obtained in step S1. Each page element in the document is checked individually. The position coordinates of each element are checked to see if they fall within the coordinate range of the body text or title area. Elements falling within these areas are marked as elements to be processed. Elements belonging to headers, footers, etc., are directly excluded.

[0033] S22: For the marked elements to be processed, if they are elements of an editable document, read the text content of the elements; if they are elements of an image document, perform character recognition on the image area corresponding to the element, convert the recognition results into text, and then arrange all the extracted text in the reading order from top to bottom and from left to right in the same row and column to form an initial text sequence.

[0034] In some embodiments, for editable documents like DOCX, the text content stored in each element to be processed is read. For image-based documents like scanned PDFs, the OCR recognition module is opened, and characters are recognized by the image block corresponding to each element to be processed, converting the recognized content into editable text. After all the text of all elements has been extracted, these text fragments are sorted according to the reading order of the page (first look at the content at the top, then look at the leftmost part of the same line first) to form an initial text sequence.

[0035] S23: For elements to be processed in editable documents, read the font name, font size, bold status mark, and italic status mark from the element attribute fields; for text in image documents, match the element attributes corresponding to each text fragment through the element attribute association table stored in the layout analysis tool, extract the font, size, bold, and italic information, bind each text fragment to its corresponding format information one by one, and arrange it into a formatted text sequence according to the order of the initial text sequence.

[0036] In some embodiments, for the elements to be processed in an editable document, information such as font name, specific font size, whether it is bold, and whether it is italic is read from the element's built-in attributes.

[0037] For text recognized by OCR, the layout analysis tool checks the pre-stored element attribute association table to find the element attributes corresponding to each text fragment, extracting information such as font, size, bold, and italics. Then, each text fragment and its corresponding formatting information are bound together, and the text is rearranged according to the initial text sequence to obtain a formatted text sequence. This method fully preserves the original formatting features of the text, which are the basis for recognizing the document structure and do not lose important information that can determine the heading and body text hierarchy.

[0038] S3: Based on the formatted text sequence, identify the level and type of each line of text, and construct a document structure tree that includes the parent-child relationship of each level of headings and the relationship between headings and body paragraphs.

[0039] In some embodiments, by extracting and standardizing the format and content features of the text, and utilizing the inherent features of the title to construct multi-dimensional judgment rules, accurate classification of text levels and types is achieved. Based on the text sequence order, the hierarchical relationship is used to sort out the subordinate logic of nodes and construct a tree structure that conforms to the original logical structure of the document.

[0040] S4: Based on the document structure tree obtained in step S3, extract the main text and perform semantic segmentation on the main text to obtain preliminary text segments around the sub-topics.

[0041] In some embodiments, the order of the main text is extracted and restored by combining the node labels and starting position attributes of the tree structure. Basic text blocks are formed by aggregating them according to natural language punctuation and line break boundaries. The semantic association of the text blocks is quantified by the overlap ratio with the main words through a sliding window. The text is segmented at topic transitions, and highly associated text blocks are merged to form fragments in the topic set.

[0042] S5: Based on the document structure tree obtained in step S3 and the preliminary text segments obtained in step S4, locate the position of each preliminary text segment in the document structure tree, trace back the various levels of headings to which it belongs and concatenate them into a heading path, add the heading path as a context prefix to the beginning of the corresponding preliminary text segment, and obtain a text segment with an embedded heading path.

[0043] In some embodiments, based on the text association information from previous steps, the starting node of the text segment in the structure tree is located by matching; a backtracking link is constructed using the node's parent attribute, the title nodes are filtered, and their order is adjusted to concatenate the path; the path is integrated with the text segment as an indivisible context prefix, ensuring format distinctiveness. The embedded text segment retains its structural context, does not destroy the semantic integrity of the original text, and has a unified format for easy processing.

[0044] S6: Vectorize each text segment embedded in the title path obtained in step S5 to obtain the text segment vector corresponding to each text segment embedded in the title path.

[0045] In some embodiments, preprocessing is used to eliminate irrelevant characters and formatting interference, unify encoding, and adapt to model input constraints. The semantic extraction capabilities of the pre-trained model are utilized to transform text semantics into quantifiable floating-point vectors; vector quality is ensured by normalizing and unifying the vector scale.

[0046] S7: The text segment vector obtained in step S6 and the corresponding text segment with the embedded title path are stored as records in the database. The title path is also stored as metadata associated with the record.

[0047] In some embodiments, a data storage table is created in the database. The table structure includes a unique record ID, a text segment vector field, an embedded title path text segment field, a title path metadata field, and a text segment unique ID field. A unique record ID is assigned to each text segment vector and its corresponding embedded text segment, and the two are stored in the corresponding fields of the data table. The title path string corresponding to each record is extracted and stored in the metadata field, establishing the association between the title path and the corresponding record.

[0048] Optionally, during storage, data is written in batches to reduce the number of database interactions. After storage, the integrity and correctness of the fields in each record are verified to ensure that vectors, text segments, and title path metadata correspond one-to-one without any omissions. This guarantees data integrity and traceability; batch writing improves storage efficiency.

[0049] In one embodiment of the present invention, based on step S3, the following is a possible embodiment and its specific implementation will be described in a non-limiting manner. S3 specifically includes the following steps: S31: Traverse the formatted text sequence obtained in step S2, and extract the format features and text content features corresponding to each line of text. The format features include font name, font size, bold status, italic status, and indentation. The text content features include the text start character and numbering pattern. Standardize the extracted features to form a feature set for each line of text.

[0050] In some embodiments, each text data in the formatted text sequence is read one by one. For each text, the specific font type, font size, whether it is bold, whether it is italic, and the text indentation are extracted from the bound format information. The character fragments at the beginning position and whether it contains numbering patterns such as "Chapter X", "1.", "1.1", and "(1)" are extracted from the text content. Then, the extracted features are standardized.

[0051] For example, different units of font size are uniformly converted into point values, indentation is quantized into specific pixel values, and bold and italic states are converted into Boolean values, ultimately compiling a set containing all standardized features for each line of text.

[0052] S32: Construct a set of rules for determining hierarchy and type. First, perform regular expression matching based on the numbering pattern in the text content features. If the match is successful, the text type is determined to be a title and the corresponding hierarchy is determined. If the match fails, the text type is determined based on the format feature threshold. If the text meets the title format threshold, it is determined to be a title and the hierarchy is divided. If the text does not meet the threshold, it is determined to be body text. Output the hierarchy label and type label of each line of text.

[0053] In some embodiments, a set of judgment rules is established, which includes numbering pattern matching rules and preset regular expressions for various common numbering patterns in headings; it also includes format feature threshold rules, which set corresponding thresholds such as font size range and bolding requirements for first-level, second-level, and third-level headings respectively.

[0054] In this embodiment, the numbering pattern of each line of text is first compared with a preset regular expression. For example, if "Chapter X" is matched, it is determined to be a first-level heading; if "1.1" is matched, it is determined to be a second-level heading. If no numbering pattern is matched, the format characteristics of the text are compared with the heading format threshold. For example, text with a font size ≥ 20 and in bold is determined to be a first-level heading, and text with a font size between 16 and 20 and in bold is determined to be a second-level heading.

[0055] If none of the conditions are met, it is considered part of the main text. Finally, each line of text is given a clear hierarchical tag, such as "h1", "h2", "paragraph", and a type tag "title" or "paragraph".

[0056] This approach leverages the inherent properties of title text, which often exhibits fixed numbering patterns or special formats, to construct multi-dimensional judgment rules. By prioritizing content over format, it achieves precise classification of text levels and types.

[0057] S33: Initialize the root node of the document structure tree, traverse each line of text in the order of the formatted text sequence, create corresponding nodes according to the level label and type label, adjust the parent-child relationship between nodes by comparing levels, attach the title node to the corresponding parent title node, attach the body text node to the nearest parent title node, and gradually build the document structure tree.

[0058] In some embodiments, a root node, identified as ROOT, is created as the document structure tree root node. The root node does not carry any actual text content; it merely serves as the top-level carrier of the structure tree. Each line of tagged text is processed sequentially according to the original order of the formatted text sequence. When a heading tag is encountered, a corresponding heading node is created. The level of this node is compared with the levels of existing heading nodes. If the level is higher than the previous heading node, it is attached as a child node under the previous heading node.

[0059] If the title node is at a lower level than the previous title node, the corresponding title node is searched back to become the parent node. When a body text tag is encountered, a body text node is created and attached directly to the nearest parent title node.

[0060] Processing and attaching nodes line by line according to this logic, a document structure tree containing all parent-child relationships of headings and the relationship between headings and body text is ultimately formed.

[0061] This embodiment can restore the hierarchical structure of a document, clarify the hierarchical relationship between headings at each level and the relationship between the body text and headings. The resulting document structure tree can directly provide structural support for body text extraction and text segment location.

[0062] In one embodiment of the present invention, based on step S4, the following is a possible embodiment and its specific implementation will be described in a non-limiting manner. S4 specifically includes the following steps: S41: Starting from the root node of the document structure tree, traverse all child nodes layer by layer, identify the nodetype tag of the node, filter out the text nodes with the tag "paragraph", and then concatenate the text content in the node according to the original document order corresponding to the startpos attribute of each text node to form a continuous text stream.

[0063] In some embodiments, starting from the root node of the document structure tree, all child nodes are traversed layer by layer downwards. Each node has a nodetype tag. Optionally, this embodiment filters out nodes tagged "paragraph". These are the main text nodes, each recording its starting position in the original document. Following this starting position from beginning to end, the text content of all main text nodes is sequentially joined together to form a continuous, unbroken stream of main text. This embodiment can accurately extract the main text content from the structure tree, ensuring that the order of the concatenated text is consistent with the original document.

[0064] S42: Using periods, question marks, and exclamation marks as sentence delimiters, the main text stream is split into independent sentence units; then, using line breaks in the original document as boundaries, adjacent sentence units are aggregated into several basic text blocks, and the number of sentences contained in each basic text block is recorded.

[0065] In some embodiments, periods, question marks, and exclamation marks are used as markers to indicate the end of sentences, splitting the continuous text stream into individual sentence units. After splitting, sentence units connected before and after a line break are grouped together, with each group forming a basic text block, based on the line break positions in the original document. The number of sentences contained in each basic text block is recorded; for example, one text block contains 3 sentences, another contains 2 sentences, and so on.

[0066] As can be seen, the method of splitting and then aggregating in this embodiment ensures the independence of sentences, preserves the paragraph structure of the original document, and records the number of sentences to provide a reference for subsequent semantic association analysis.

[0067] S43: Set up a continuous text block group containing 3 consecutive basic text blocks, and advance the text block group on all basic text blocks in sequence; for two adjacent basic text blocks in the text block group, extract the nouns and verbs, calculate the overlap ratio of the vocabulary set, obtain the semantic correlation between the two, and record the correlation value of each pair of adjacent text blocks.

[0068] In some embodiments, a fixed group of text blocks consisting of three consecutive basic text blocks is defined. Starting from the first basic text block, this group of text blocks is progressively advanced sequentially, covering all basic text blocks. Each time the group is advanced, an analysis is performed on two adjacent basic text blocks within the group. First, key words are extracted from these two text blocks, primarily nouns and verbs that reflect the main content. Then, the number of repetitions in these two groups of key words is counted. The ratio obtained by dividing the number of repetitions by the total number of words in both groups is the semantic correlation between the two text blocks. The correlation value for each pair of adjacent text blocks is recorded.

[0069] S44: Preset semantic relevance threshold, traverse the relevance values ​​of all adjacent text block pairs, set split points between text block pairs with relevance values ​​lower than the threshold, merge continuous basic text blocks between split points into a segment, organize all segments to form a preliminary text segment, and mark the range of basic text blocks corresponding to each preliminary text segment.

[0070] In some embodiments, a fixed semantic relevance threshold is set, such as 0.3. The relevance values ​​of each previously recorded pair of adjacent text blocks are examined one by one. If the relevance value of a pair is lower than this threshold, it indicates that the topic relevance between the two text blocks is weak, and a split point is drawn between the two text blocks.

[0071] This embodiment merges the basic text blocks connected between two adjacent segmentation points into a larger segment. After all these segments are organized, a preliminary text segment is formed. It's important to record which basic text blocks each preliminary text segment corresponds to for easy tracking and verification. This allows text segmentation to be completed at the point of topic transition, ensuring that each preliminary text segment revolves around a relatively focused sub-topic.

[0072] In one embodiment of the present invention, based on step S5, the following is a possible embodiment and its specific implementation is described in a non-limiting manner. The step S5, based on the document structure tree obtained in step S3 and the preliminary text segments obtained in step S4, specifically includes the following steps for locating the position of each preliminary text segment in the document structure tree: S51: Retrieve the correspondence table between the preliminary text segments and the basic text blocks in the records, extract all the basic text blocks associated with each preliminary text segment; then match the association information between the basic text blocks and the sentence units, and summarize the text content and quantity of all sentence units contained in each preliminary text segment.

[0073] In some embodiments, the correspondence table between the preliminary text segments and the basic text blocks recorded in step S44 is found, and each preliminary text segment is associated with which basic text blocks.

[0074] Retrieve the association information of the basic text blocks and sentence units recorded in step S42, organize the text content of all sentence units contained in each preliminary text segment, obtain how many sentence units are contained in each preliminary text segment, and record this information one by one.

[0075] S52: Using the text nodes selected in step S41 as the retrieval objects, match the text content of the first sentence unit contained in the preliminary text segment with the text attribute content of each text node to locate the target text node containing the sentence unit.

[0076] In some embodiments, all text nodes filtered in step S41 are used as the objects to be searched. From the sentence units contained in each preliminary text segment, the text content of the first sentence unit is selected, and this text content is used to perform a precise comparison with the text content stored in each text node.

[0077] In this embodiment, after finding a text node that completely matches the text content, this node is defined as the target text node containing the beginning part of the initial text segment. Here, the uniqueness of the first sentence unit of the initial text segment is utilized to lock the starting node corresponding to the initial text segment in the text nodes of the structure tree through text content matching, thus establishing the initial association between the text segment and the structure tree node.

[0078] S53: Obtain the parent attribute of the target text node and determine the parent node; trace back up the parent node layer by layer until the root node of the document structure tree is reached, and record all title nodes with the tag "title" on the backtracking path to form a preliminary title path.

[0079] In some embodiments, after locating the target text node, the `parent` attribute is examined, which records the parent node. The search continues upwards along the `parent` attribute of this parent node, tracing back node by node until the root node of the document structure tree is found.

[0080] In this tracing process, only those title nodes labeled "title" are left, and these title nodes are recorded in tracing order from bottom to top to form a preliminary title path.

[0081] S54: Verify the validity of the preliminary title path. Compare the text of the last sentence unit contained in the preliminary text segment with the text of the adjacent text node at the same level as the target text node to confirm whether the two belong to the title node corresponding to the preliminary title path. If the belonging is confirmed, the preliminary title path is the structural path corresponding to the preliminary text segment, and the location is completed.

[0082] In some embodiments, it is necessary to verify whether the initial title path truly corresponds to the initial text segment. In this embodiment, the text content of the last sentence unit of the initial text segment is retrieved, and then other adjacent text nodes at the same level as the target text node are found. The text content of these adjacent nodes is compared with the text content of the last sentence of the initial text segment.

[0083] This embodiment confirms by comparison whether these adjacent nodes and the target text node are all governed by the last title node in the initial title path. If they are confirmed to belong to the same jurisdiction, it means that the initial title path is correct, and the specific position of this initial text segment in the document structure tree is determined. In this way, the positioning deviation caused by matching only the starting sentence can be avoided, ensuring that the entire initial text segment belongs to the located structure path, making the position positioning result more reliable.

[0084] In one embodiment of the present invention, step S5 involves tracing back the various levels of headings to which the heading belongs and concatenating them into a heading path, then adding the heading path as a context prefix to the beginning of the corresponding initial text segment to obtain a text segment with an embedded heading path, including the following steps: S511: Retrieve all title nodes obtained during the backtracking process, extract the title text corresponding to the text attribute of each node, perform cleaning processing on each title text, remove redundant punctuation marks, spaces and meaningless special characters at the end, retain the title content, and form a standardized title text list.

[0085] For example, if the title text is "1.1 Research Background:", remove the last two spaces and the colon; if it is "(3) Experimental Design Scheme", delete the last space and the auxiliary symbols other than the extra parentheses at the beginning, leaving only the main content such as "3 Experimental Design Scheme". Organize all the processed title texts together to form a standardized title text list.

[0086] S512: Adjust the order of the standardized heading text list, reversing the original bottom-up order of direct superior headings to root headings obtained by backtracking, and correcting it to a root-to-leaf order of root heading, first-level heading, second-level heading, ..., directly managed headings, to ensure that it matches the inherent hierarchical structure of the document.

[0087] For example, if the list order is "1.1 Research Background" and "Chapter 1 Introduction", then reversing the list elements to "Chapter 1 Introduction" and "1.1 Research Background" ensures that the heading arrangement aligns with the document's overall hierarchical structure from macro to micro. This restores the document's inherent hierarchical logic, correcting the reverse-ordered heading list obtained during the backtracking process into a forward-order list that conforms to reading comprehension and document structure understanding, ensuring that the heading paths accurately reflect the hierarchical affiliation of text segments.

[0088] S513: Preset a unified path separator, traverse the adjusted standardized title text list, and concatenate the title texts in order using the separator to generate a coherent title path string; if the list contains only a single title text, then directly use that text as the title path string.

[0089] For example, "#" is used as a fixed path separator, and other symbols are not arbitrarily changed. Based on the adjusted standardized list of title texts, each title text is concatenated with "#". For example, if the list is "Chapter 1 Introduction" and "1.1 Research Background", it is concatenated into the string "Chapter 1 Introduction #1.1 Research Background". If the list only contains the title "Chapter 2 Experimental Methods", this text is directly used as the title path string.

[0090] S514: Extract the original content of the corresponding preliminary text segment, remove blank lines at the beginning and redundant formatting such as consecutive spaces; add the title path string as a context prefix to the beginning of the regularized preliminary text segment, with the prefix and text segment separated by two newline characters; after adding, check the connection format to confirm that there are no overlapping or missing characters, thus forming a text segment with the embedded title path.

[0091] In some embodiments, the original content of the corresponding initial text segment is retrieved, and any redundant formatting such as blank lines at the beginning or multiple consecutive spaces is checked. These redundant formats are then removed to make the text segment content more organized.

[0092] In this embodiment, the previously constructed title path string is placed at the beginning of the standardized text segment, with two newline characters added between the path and the text segment to clearly separate the prefix from the body text. For example, if the title path is "Chapter 1 Introduction #1.1 Research Background", and the standardized text segment is "This paper studies document analysis technology based on deep learning. Artificial intelligence technology is developing rapidly.", the combined text becomes "Chapter 1 Introduction #1.1 Research Background\n\nThis paper studies document analysis technology based on deep learning. In recent years, artificial intelligence technology has developed rapidly." After combining the text, it is checked again to ensure that the path and text segment are not connected and that there are no extra characters. This results in a text segment with the embedded title path. This makes the title path an inherent context of the text segment, without compromising the semantic integrity of the original text segment, ensuring a consistent presentation style for all text segments.

[0093] In one embodiment of the present invention, based on step S6, the following is a possible embodiment and its specific implementation will be described in a non-limiting manner. S6 specifically includes the following steps: S61: Perform preprocessing on each text segment with embedded title path obtained in step S5, remove consecutive newline characters, invisible control characters and repeated spaces in the text, and convert the text encoding to a preset format; for text segments whose length exceeds the upper limit of the pre-trained embedding model input, truncate the tail content according to the maximum input length of the model, and retain the preceding text containing the title path.

[0094] In some embodiments, text segments embedded in the title path are extracted one by one, and formatting issues in the text content are checked. Consecutive line breaks are replaced with single spaces, tabs and non-printable control characters—invisible symbols that can interfere with processing—are removed, and duplicate spaces are merged into one.

[0095] Change the encoding of all text to UTF-8. If a text segment is too long, exceeding the maximum input length that the selected pre-trained model can handle, truncate it from the end of the text to ensure that the main part before the title path is preserved and that key structural context information is not lost.

[0096] S62: Select a pre-trained text embedding model suitable for semantic extraction of long texts, load the model's pre-trained weights and initialize the inference environment, and configure the input batch size and output vector dimension parameters of the text embedding model; according to the input requirements of the text embedding model, convert the pre-processed text segments into tensor formats that the model can recognize, and generate text input batches.

[0097] In some embodiments, a suitable pre-trained text embedding model is selected based on the text segment type. The text embedding model can be the all-MiniLM-L6-v2 version of Sentence-BERT. The weight file of the trained text embedding model is loaded into the inference environment. The number of text segments processed in each batch is set, for example, 8 text segments at a time. The output vector dimension of the model is then confirmed to be either 384-dimensional or 768-dimensional to match subsequent storage and retrieval requirements. Finally, each pre-processed text segment is converted into a batch format that the model can understand, and then divided into multiple input batches according to the set batch size.

[0098] S63: Input text batches into the pre-trained text embedding model after initialization, call the text embedding model inference interface to perform semantic feature extraction, generate the original floating-point vector corresponding to each text segment; record the dimension information of each original vector, and establish a unique ID mapping relationship between the vector and the corresponding embedded title path text segment.

[0099] In some embodiments, grouped text inputs are fed batch by batch to a pre-initialized, pre-trained text embedding model. The text embedding model extracts semantic features from each text segment, converting the semantic information in the text into a raw vector composed of a string of floating-point numbers. After each raw vector is generated, its dimension must be recorded, for example, 384 dimensions. Each vector is also assigned a unique ID identical to the corresponding text segment, and the correspondence between vectors and text segments is recorded in a mapping table.

[0100] S64: Perform standardization on the generated original floating-point vectors to obtain the modulus of each vector. Divide each element in the vector by the modulus to obtain a standardized vector with a modulus of 1. Verify the numerical range of the standardized vectors, remove abnormal numerical vectors, and regenerate the vectors of the corresponding text segments. Finally, obtain the standardized text segment vectors corresponding to each text segment.

[0101] In some embodiments, normalization is performed on each raw floating-point vector. The magnitude of the vector is calculated by summing the squares of each element in the vector and then taking the square root.

[0102] Divide each element of the vector by this magnitude; the resulting vector will then have a magnitude of 1. After processing, check if the value of each element in the normalized vector is between -1 and 1, and whether there are any abnormally large or small values.

[0103] If anomaly vectors are detected, the corresponding text segment is re-vectorized until a standardized vector that meets the requirements is obtained, ultimately forming a standardized text segment vector for each text segment. In this way, the standardized vectors have a uniform scale, which can measure the semantic similarity between different vectors during retrieval and matching.

[0104] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0105] like Figure 3 As shown, this application also provides an electronic device, including a display module 103, a memory 102, a processor 101, a communication module 104, and a computer program stored in the memory and executable on the processor 101. When the processor 101 executes the program, it implements the steps of a text segmentation method based on document structure embedding.

[0106] In embodiments of the present invention, electronic devices include, but are not limited to, laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the embodiments described and / or claimed herein.

[0107] In this embodiment, processor 101 may be implemented using at least one of an application-specific integrated circuit, a programmable logic device, a field-programmable gate array, a processor, a controller, a microcontroller, a microprocessor, or an electronic unit designed to perform the functions described herein. In some cases, such an implementation may be implemented within a controller. For software implementation, implementations such as processes or functions may be implemented with separate software modules that allow the performance of at least one function or operation. Software code may be implemented by a software application (or program) written in any suitable programming language, and the software code may be stored in memory and executed by the controller.

[0108] The display module 103 is used to display information input by the user or information provided to the user. The display module 103 may include a display panel, which may be configured in the form of a liquid crystal display, an organic light-emitting diode, or the like.

[0109] The memory 102 can be used to store software programs and various data. The memory 102 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device.

[0110] The communication module 104 transmits radio signals to and / or receives radio signals from at least one of a base station, an external terminal, and a server. Such radio signals may include voice call signals, video call signals, or various types of data sent and / or received according to text and / or multimedia messages.

[0111] The present invention also provides a storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the text segmentation method based on document structure embedding.

[0112] The storage medium may be any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example,, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of readable storage media include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0113] The storage medium stores a program product capable of implementing the methods described above in this specification. In some possible implementations, various aspects of this disclosure may also be implemented as a program product comprising program code that, when run on a terminal device, causes the terminal device to perform the steps described in the "Exemplary Methods" section of this specification according to various exemplary embodiments of this disclosure.

[0114] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text segmentation method based on document structure embedding, characterized in that the method... include: S1: Perform layout analysis on the original document to obtain the physical layout information of the original document; S2: Based on the physical layout information, extract the plain text content from the main text area and retain the format information corresponding to the plain text content to obtain a formatted text sequence. S3: Based on the formatted text sequence, identify the level and type of each line of text, and construct a document structure tree that includes the parent-child relationship of each level of headings and the relationship between headings and body paragraphs; S4: Based on the document structure tree, extract the main text, perform semantic segmentation on the main text, and obtain preliminary text segments around the sub-topics; S5: Based on the document structure tree obtained in step S3 and the preliminary text segments obtained in step S4, locate the position of each preliminary text segment in the document structure tree, trace back the various levels of headings to which the preliminary text segments belong and concatenate them into a heading path, add the heading path as a context prefix to the beginning of the corresponding preliminary text segment, and obtain a text segment with an embedded heading path. S6: Vectorize each text segment embedded in the title path to obtain the text segment vector corresponding to each text segment embedded in the title path. S7: Store the obtained text segment vector and the corresponding text segment with the embedded title path as records in the database, and also store the title path as metadata associated with the records.

2. The text segmentation method based on document structure embedding according to claim 1, characterized in that, S1 specifically includes the following steps: Parse the original document and extract pixel matrix and coordinate data according to the document type to obtain the basic information of all elements in the document; The basic information is uniformly converted into element position information, and the header and footer areas are identified and marked based on the recurrence pattern of elements in the top and bottom fixed areas within multiple pages. Based on the element location information, elements not marked as headers and footers are classified according to their visual shape, content arrangement, and numbering characteristics, marking the text area, table area, chart area, and table of contents area, and integrating them to obtain physical layout information.

3. The text segmentation method based on document structure embedding according to claim 1, characterized in that, S2 specifically includes the following steps: Retrieve the coordinate range of the text area and title area from the physical layout information obtained in step S1, check each page element in the original document, mark the elements whose coordinates fall within the text area and title area as elements to be processed, and exclude elements in other areas. For the marked elements to be processed, if they are elements of an editable document, the text content of the element is read; if they are elements of an image document, the corresponding image area of ​​the element is used for character recognition, the recognition result is converted into text, and then all the extracted text is arranged in the reading order from top to bottom and from left to right in the same row and column to form an initial text sequence. For editable documents, the font name, font size, bold status mark, and italic status mark in the element attribute fields are read. For text in image documents, the element attribute association table stored in the layout analysis tool is used to match the element attributes corresponding to each text fragment, extract the font, size, bold, and italic information, bind each text fragment with the corresponding format information, and arrange them into a formatted text sequence according to the order of the initial text sequence.

4. The text segmentation method based on document structure embedding according to claim 1, characterized in that, S3 specifically includes the following steps: Traverse the formatted text sequence obtained in step S2, extract the format features and text content features corresponding to each line of text, and standardize the extracted features to form a feature set for each line of text. Construct a set of rules for determining hierarchy and type, perform regular expression matching based on the numbering pattern in the text content features, and determine the text type as a title and the corresponding hierarchy if the match is successful. If a match fails, the format feature threshold is used to determine the content. If the content meets the title format threshold, it is determined as a title and classified into levels. If the content does not meet the threshold, it is determined as body text. The level label and type label of each line of text are output. Initialize the root node of the document structure tree, traverse each line of text in the order of the formatted text sequence, create corresponding nodes according to the hierarchy and type labels, adjust the parent-child relationship between nodes by comparing the hierarchy, attach the title node to the corresponding parent title node, attach the body text node to the nearest parent title node, and gradually build the document structure tree.

5. The text segmentation method based on document structure embedding according to claim 1, characterized in that, S4 specifically includes the following steps: Starting from the root node of the document structure tree, all child nodes are traversed layer by layer. The node type label of the node is identified, and the text nodes with the label "paragraph" are selected. Then, according to the original document order corresponding to the startpos attribute of each text node, the text content in the node is concatenated to form a continuous text stream. Using periods, question marks, and exclamation marks as sentence markers, the main text stream is split into independent sentence units; then, using line breaks in the original document as boundaries, adjacent sentence units are aggregated into several basic text blocks, and the number of sentences contained in each basic text block is recorded. Set up a continuous text block group containing 3 consecutive basic text blocks, and advance the text block group on all basic text blocks in sequence; for two adjacent basic text blocks in the text block group, extract the nouns and verbs, calculate the overlap ratio of the vocabulary set, obtain the semantic relevance between the two, and record the relevance value of each pair of adjacent text blocks. A semantic relevance threshold is preset. The relevance values ​​of all adjacent text block pairs are traversed. A split point is set between text block pairs with a relevance lower than the threshold. The continuous basic text blocks between the split points are merged into a segment. All segments are sorted to form a preliminary text segment. The range of basic text blocks corresponding to each preliminary text segment is marked.

6. The text segmentation method based on document structure embedding according to claim 5, characterized in that, S5, based on the document structure tree obtained in step S3 and the preliminary text segments obtained in step S4, specifically includes the following steps for locating the position of each preliminary text segment in the document structure tree: Retrieve the correspondence table between the preliminary text segments and the basic text blocks in the records, and extract all the basic text blocks associated with each preliminary text segment; then match the association information between the basic text blocks and the sentence units, and summarize the text content and quantity of all sentence units contained in each preliminary text segment; Using the text nodes selected in step S41 as the retrieval objects, the text content of the first sentence unit contained in the preliminary text segment is matched with the text attribute content of each text node to locate the target text node containing the sentence unit. Get the parent attribute of the target text node and determine its parent node; Traverse back along the parent attribute of the parent node layer by layer until the root node of the document structure tree is reached, and record all title nodes with the tag "title" on the backtracking path to form a preliminary title path; To verify the validity of the preliminary title path, the last sentence unit text contained in the preliminary text segment is compared with the adjacent text node text at the same level as the target text node to confirm whether the two belong to the title node corresponding to the preliminary title path. If the belonging is confirmed, the preliminary title path is the structural path corresponding to the preliminary text segment, and the location is completed.

7. The text segmentation method based on document structure embedding according to claim 1, characterized in that, In S5, tracing back to the various levels of headings and concatenating them into a heading path, then adding the heading path as a context prefix to the beginning of the corresponding initial text segment, results in a text segment with embedded heading paths, comprising the following steps: Retrieve all title nodes obtained during the location process, extract the title text corresponding to the text attribute of each node, perform cleaning processing on each title text to remove redundant punctuation marks, spaces and meaningless special characters at the end, retain the title content, and form a standardized title text list; The order of the standardized heading text list is adjusted. The original bottom-up order of direct parent headings to root headings is reversed and corrected to root heading, first-level heading, second-level heading, ..., directly subordinate headings, from root to leaf, to ensure that it matches the inherent hierarchical structure of the document. A unified path separator is preset. The system iterates through the adjusted and standardized list of title texts and concatenates each title text in sequence using the separator to generate a coherent title path string. If the list contains only a single title text, then the title text is used as the title path string. Extract the original content of the corresponding preliminary text segment, and remove leading blank lines and consecutive spaces for redundant formatting; add the title path string as a context prefix to the beginning of the neatened preliminary text segment, with the prefix and text segment separated by two newline characters; after adding, check the connection format to confirm that there are no overlapping or missing characters, thus forming a text segment with the title path embedded.

8. The text segmentation method based on document structure embedding according to claim 1, characterized in that, S6 specifically includes the following steps: Preprocessing is performed on each text segment with embedded title path obtained in step S5 to remove consecutive newline characters, invisible control characters and repeated spaces in the text, and convert the text encoding to a preset format. For text segments whose length exceeds the upper limit of the pre-trained embedding model input, the tail content is truncated according to the maximum input length of the model, and the preceding text containing the title path is retained. Select a pre-trained text embedding model suitable for semantic extraction of long texts, load the model's pre-trained weights and initialize the inference environment, and configure the input batch size and output vector dimension parameters of the text embedding model; according to the input requirements of the text embedding model, convert the pre-processed text segments into tensor formats that the model can recognize, and generate text input batches; The text input is passed to the pre-trained text embedding model in batches, and the text embedding model inference interface is called to perform semantic feature extraction, generating the original floating-point vector corresponding to each text segment; the dimension information of each original vector is recorded, and a unique ID mapping relationship between the vector and the corresponding embedded title path text segment is established. The generated raw floating-point vectors are standardized to obtain the modulus of each vector. Each element in the vector is divided by the modulus to obtain a standardized vector with a modulus of 1. The numerical range of the standardized vectors is verified, abnormal numerical vectors are removed, and the vectors corresponding to the text segments are regenerated. Finally, the standardized text segment vectors corresponding to each text segment are obtained.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the text segmentation method based on document structure embedding as described in any one of claims 1 to 8.

10. A storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the text segmentation method based on document structure embedding as described in any one of claims 1 to 8.