A method and system for text structure recognition and recovery based on structural features and spatial distribution analysis
By using a shared slice pool and structural feature vector analysis, the logical structure of non-standard text is identified and restored, solving the problem of inaccurate directory identification in existing technologies and achieving efficient text structure reconstruction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 邵珠清
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-30
AI Technical Summary
Existing e-reading and text processing software cannot accurately identify the table of contents when processing non-standard text, resulting in missing or non-standard table of contents or the inclusion of junk text. Furthermore, the lack of a unified data sharing layer leads to low processing efficiency and limited accuracy.
By establishing a shared slice pool, extracting the structural feature vectors of row slices, performing pattern clustering and confidence decisions, identifying structurally abnormal regions, and restoring the logical structure through a self-healing completion mechanism, a complete directory index tree is generated.
It enables the recovery of the logical structure of non-standard text, improves text processing efficiency, supports global languages and various non-standard and ancient documents, and meets the requirements for low-power operation.
Smart Images

Figure CN122309518A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of text processing technology, specifically relating to a method and system for text structure recognition and recovery based on structural features and spatial distribution analysis. Background Technology
[0002] For various novels and other texts available for electronic reading, an accurate table of contents is a core element, directly impacting the user's reading experience. Existing electronic reading and text processing software primarily relies on semantic keywords or rule templates for table of contents recognition. However, because this depends on these templates, when semantic features are missing (e.g., accidental deletion of identifiers, layout variations, or OCR errors), the logical structure of the text cannot be recovered, leading to structurally abnormal areas and broken logical chains. Furthermore, current technologies often employ isolated scanning and recognition modes for text formatting, lacking a unified data sharing layer, resulting in low efficiency and limited accuracy when processing large-scale non-standard text. Summary of the Invention
[0003] The present invention aims to provide a method and system for text structure recognition and recovery based on structural features and spatial distribution analysis, so as to solve the technical problem that the existing technology cannot perform text typesetting based on the physical structure of the text.
[0004] In practical text processing, it's sometimes necessary to automatically generate a table of contents for plain text without formatting rules (such as TXT ebooks / novels). For example, a user might have a millions of words of a downloaded novel, "Document A.txt," of extremely poor quality. It lacks a standard "Chapter X" prefix, some chapter titles are simply plain text like "So-and-so's Anger," and the text is interspersed with numerous misformatted advertisements, author-added anecdotal descriptions (such as "Requesting monthly votes!"), and long dialogues. Traditional readers (such as Apple Books or regular reading software) rely on fixed regular expressions (such as ^Chapter.*$) for scanning. If they encounter such non-standard text, it can lead to severely missing chapters or a large amount of junk text (a fake table of contents).
[0005] To solve the above-mentioned technical problems, the present invention adopts the following technical solution: A text structure recognition and recovery method based on structural features and spatial distribution analysis is provided, including: Text data processing: Establish a shared slice pool, perform a full linear scan of the text, divide the text into line slices according to line breaks, record the generated text slices in the shared slice pool, and synchronously record the physical environment characteristics of each line slice in the shared slice pool; Structural feature vector extraction: Extract the character skeleton information of the line slice, perform quantization processing on the line slice according to the preset attribute mapping rules, so as to generate a structural feature vector representing the physical contour of the line slice, and describe the text physical contour of the line slice through the structural feature vector; Pattern clustering of structural feature vectors: Statistically analyze the structural feature vectors of the entire text extracted from the line slices, and cluster structural feature vectors with the same pattern into a structural family based on the repetition frequency of each structural feature vector; Decision on the confidence of structural families: The confidence of the structural feature vectors in each structural family obtained by clustering as directory nodes is jointly arbitrated to determine whether the structural family can be used as the logical backbone node of the text. Text structure impulse stability audit: Using statistical indicators, the baseline value of the structure impulse is obtained based on the statistical distribution model of the physical distance between nodes; Text diagnostics: Scan the node spacing of the entire text to find structurally abnormal regions. The identification threshold of the structurally abnormal regions is adaptively calculated based on the statistical distribution of the spacing between adjacent chapters. When the spacing between adjacent chapters is significantly abnormal compared with the structural pulse reference value, the text content between the adjacent chapters is determined to be a structurally abnormal region. Text structure repair: Re-search the line slices corresponding to the structurally abnormal region in the shared slice pool, find candidate line slices that can be used as candidate directories, determine whether the candidate line slices can be used as directory nodes in the structurally abnormal region, and perform node self-healing completion. Structural Index Output: After the above cleaning and repair, global logical indices are assigned to the row slices corresponding to the structural feature vectors in all the structural families, and a directory index tree with a complete inheritance chain is output.
[0006] Preferably, in the text data processing step, the physical environment features include the length of the line slice, the whitespace ratio, and the line indentation.
[0007] Preferably, the text data processing step further includes: filtering out long line slices with a length exceeding a certain number of characters from the line slices, retaining short line slices, and recording the absolute character offset of each line slice in the entire text; Preferably, in the step of extracting the structural feature vector, the method of mapping the line slices to generate the structural feature vector includes mapping characters into number class, letter class, unit class, punctuation class and ordinary character class, or performing word-level longest common prefix analysis on the text line slices to identify and extract static structural identifiers.
[0008] Preferably, in the step of determining the confidence level of the structure family, the weight of the confidence level is calculated by combining the structural length consistency, node density stability, and sequence continuity of the structural feature vectors in the structure family.
[0009] Preferably, in the text diagnosis step, the method for determining that the spacing between adjacent chapters is significantly abnormal compared to the structural pulse reference value is: Gapi > k × Pulse, where Gapi is the spacing between adjacent chapters, Pulse is the structural pulse reference value, and k is the dynamic deviation coefficient.
[0010] Preferably, in the text structure repair step, determining whether the candidate line slice can be used as a directory node in the structurally abnormal region includes: when the features of the candidate line slice satisfy at least one of the structural conditions of sequence number inheritance, homology of structural feature vectors, or resonance of spatial sites, the candidate line slice is used as a directory node in the structurally abnormal region.
[0011] Preferably, in the step of auditing the stability of text structure pulses: the statistical index is the coefficient of variation (CV) of the number of pulse words in the text. In the step of pattern clustering of the structural feature vectors, a structural fingerprint database is constructed, and the information of the structural families is stored in the structural fingerprint database.
[0012] This invention also provides a text structure recognition and recovery system based on structural features and spatial distribution analysis, comprising: Text data processing module: Used to establish a shared slice pool, perform a full linear scan of the text, divide the text into line slices according to newline characters, record the generated text slices in the shared slice pool, and synchronously record the physical environment characteristics of each line slice in the shared slice pool. The physical environment characteristics include the length of the line slice, the whitespace ratio, and the line indentation. Filter out long line slices that exceed a certain number of characters in length, retain short line slices, and record the absolute character offset of each line slice in the entire text. The structural feature vector extraction module is used to extract the character skeleton information of the line slice and map it to generate a quantized structural feature vector. The structural feature vector describes the text physical contour of the line slice. The method of mapping the line slice to generate the structural feature vector includes classifying characters into number class, letter class, unit class, punctuation class and ordinary character class for mapping. The structural feature vector pattern clustering module is used to count the structural feature vectors of the entire text extracted from the line slices, and to construct a structural fingerprint database based on the repetition frequency of each structural feature vector. In the structural fingerprint database, structural feature vectors with the same pattern are clustered into a structural family. The decision module for the confidence of structural family is used to collaboratively arbitrate the confidence of structural feature vectors in each structural family obtained by clustering as directory nodes. The confidence weight is calculated by the structural length consistency, node density stability and sequence continuity of the structural feature vectors in the structural family. It is used to determine whether the structural family can be used as the logical backbone node of the text. Text structure impulse stability audit module: used to evaluate the distribution stability of the entire text backbone using statistical indicators, calculate the structural impulse baseline value of the entire text, and calculate the average number of words between two adjacent chapters; the statistical indicator is the coefficient of variation; The text diagnostic module scans the node spacing of the entire text to find structurally abnormal regions. The identification threshold for these abnormal regions is adaptively calculated based on the statistical distribution of the spacing between adjacent chapters. When the spacing between adjacent chapters shows a significant anomaly compared to the structural pulse reference value, the text content between those adjacent chapters is determined to be a structurally abnormal region. The method for determining that the spacing between adjacent chapters shows a significant anomaly compared to the structural pulse reference value is: Gapi > k × Pulse, where Gapi is the spacing between adjacent chapters, Pulse is the structural pulse reference value, and k is the dynamic deviation coefficient. Text structure repair module: used to re-retrieve the row slices corresponding to the structurally abnormal region in the shared slice pool, find candidate row slices that can be used as candidate directories, and when the features of the candidate row slices satisfy at least one of the structural conditions of sequence number inheritance, structural feature vector homology or spatial site resonance, perform node self-healing completion and use the candidate row slices as directory nodes in the structurally abnormal region. Structure Index Output Module: After the above cleaning and repair, it assigns global logical numbers to the row slices corresponding to the structure feature vectors of all the structure families in the structure fingerprint database, and outputs a directory index tree with a complete inheritance chain.
[0013] The present invention also provides a computer-readable storage medium storing at least one instruction, which is loaded and executed by a processor to implement the text structure recognition and recovery method based on structural features and spatial distribution analysis described in any of the preceding claims.
[0014] Compared with existing technologies, the beneficial effects of this invention are as follows: This text structure recognition and recovery method based on structural features and spatial distribution analysis establishes a shared slice set recording line length and indentation features, maps structural feature vectors containing line slice skeleton information and constructs a structural fingerprint database, establishes structural node families through similarity clustering, performs collaborative arbitration based on confidence weights calculated from structural length consistency and node density stability to identify structural nodes, calculates text structure impulses and identifies structurally abnormal regions, and then completes structural nodes within abnormal regions through digital inheritance or feature homology verification to restore the logical sequence. This realizes a text typesetting method based on the physical structure of the text. This sorting method has the following advantages: through the structural impulse detection and repair mechanism, it solves the problem of text structure discontinuity caused by typesetting variations or missing identifiers; it does not rely on specific language corpora, but is based on physical feature modeling, achieving universal support for global languages and various non-standard and ancient documents; the shared slice pool mechanism avoids repeated scanning of text, meets the millisecond-level parsing and low-power operation requirements of edge devices during text processing, and greatly improves the efficiency of text structure reconstruction. Attached Figure Description
[0015] The present invention will be explained in detail below with reference to the accompanying drawings. It should be noted that the drawings are used to provide a further understanding of the present invention and form part of the specification. They are used together with the embodiments of the present invention to explain the present invention, but should not impose any limitation on the implementability of the present invention.
[0016] In the attached diagram: Figure 1 This is a flowchart of an embodiment of the text structure recognition and recovery method based on structural features and spatial distribution analysis of the present invention.
[0017] Figure 2 This is an architecture diagram of an embodiment of the text structure recognition and recovery system based on structural features and spatial distribution analysis of the present invention. Detailed Implementation
[0018] The technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this disclosure. All other embodiments obtained by those skilled in the art based on the embodiments of this disclosure without creative effort are within the scope of protection of this disclosure.
[0019] In one embodiment, a text structure recognition and recovery method based on structural features and spatial distribution analysis is provided, such as... Figure 1 As shown, the text structure recognition and recovery method based on structural features and spatial distribution analysis includes the following steps: Step S100: Text Data Processing: Establish a shared slice pool, perform a full linear scan of the text, divide the text into line slices according to newline characters, record the generated text slices in the shared slice pool, and synchronously record the physical environment characteristics of each line slice in the shared slice pool. The physical environment characteristics include the length of the line slice, the proportion of whitespace, and the line indentation. Filter out long line slices with a length exceeding 45 characters, retain short line slices, and record the absolute character offset of each line slice in the entire text.
[0020] In this step, by generating slice lines from text slices, we achieve "spatial archiving" of the text, clarifying its physical structure. Millions of words of text can be divided into hundreds of thousands of "line slices" based on line breaks. After filtering out long sentences exceeding 45 characters (which are definitely not the table of contents) and retaining short lines, the text's table of contents is essentially located within the remaining line slices. By recording the "absolute character offset" of each line slice within the entire text, we can locate the line slices belonging to the table of contents nodes within the entire text. The line slices filtered out that exceed the corresponding character count can be set according to the text format; the length threshold can be within a preset range of 40 to 50 characters.
[0021] Step S200: Extracting structural feature vectors: Extracting the character skeleton information of the line slices, and performing quantization processing on the line slices according to the preset attribute mapping rules to generate structural feature vectors that characterize the physical contours of the line slices. The text physical contours of the line slices are described by the structural feature vectors. The methods for mapping the line slices to generate structural feature vectors include classifying characters into number classes, letter classes, unit classes, punctuation classes and ordinary character classes for mapping, or performing word-level longest common prefix analysis on the text line slices to identify and extract static structural identifiers.
[0022] The longest common prefix analysis at the word level performed on the text line slices here is a structural feature vector of a "non-numerical structure". In the absence of mapping elements such as numbers and units, it automatically extracts the static family emblems (such as the repeated book titles / author names) by means of the "longest common prefix at the word level".
[0023] In this step, since the content of real text varies greatly, in order to find patterns, it is necessary to perform structural dimensionality reduction on each line slice. By mapping, the specific semantics in the line slice are stripped away (regardless of what words are written), and a "structural fingerprint" that only reflects the physical layout outline is obtained.
[0024] Here, the preset attribute mapping rules include, but are not limited to, classification coding, hash mapping, or character attribute fingerprinting. Classification coding categorizes words in the text according to semantics, part-of-speech, usage, etc., and assigns them corresponding numerical codes. For example, "Chapter 125: The Decisive Battle" is mapped to a structural feature vector DDU C (D represents a number, U represents a unit, and C represents a Chinese character). In the structural feature vector DDU C, the first D represents "100," the second D represents "25," U represents "chapter," and C represents "decisive battle." By performing this structural dimensionality reduction on line slices, structural feature vectors are extracted, enabling the extraction of the "skeleton and DNA" of the text content corresponding to the line slices. Different structural feature vectors are constructed for different sections, such as chapters, sections, and subsections, to distinguish between them. For example, the table of contents for the large chapter "[Volume Two: The Imperial Capital]" is mapped to other types of structural feature vectors to differentiate it from the table of contents for subsections.
[0025] Hash mapping converts words in a text line slice into fixed-length numerical or string identifiers (hash values) using a hash function, thereby enabling fast lookup, comparison, and storage of words at corresponding positions in the line slice.
[0026] Character attribute fingerprinting is a method of generating fingerprints based on the inherent attributes of characters, including ASCII values, initial letters of pinyin, character structure, and stroke count. For example, special symbols, punctuation marks, and stroke counts in line slice text can be extracted as features to generate character attribute fingerprints for line slices.
[0027] Step S300: Pattern clustering of structural feature vectors: Statistically analyze the structural feature vectors of the entire text extracted from line slices. Based on the repetition frequency of each structural feature vector, construct a structural fingerprint database. In the structural fingerprint database, structural feature vectors with the same pattern are clustered into a structural family.
[0028] In this step, different structural families obtained by pattern clustering are stored in the fingerprint database to achieve structural family aggregation. For example, by counting the structural feature vectors in the full text obtained earlier, it is found that the feature "DDU C" appears 1200 times in the book, and it can be clustered into a "structural family".
[0029] Step S400, Decision on the confidence of structural families: The confidence of the structural feature vectors in each structural family obtained by clustering as directory nodes is jointly arbitrated. The confidence weight is calculated by the structural length consistency, node density stability and sequence continuity of the structural feature vectors in the structural family, and is used to determine whether the structural family can be used as the logical backbone node of the text.
[0030] In this step, by collaboratively arbitrating the credibility of structural families as a directory, non-directory line slices are eliminated. This identifies candidate line slices that do not conform to the characteristics of structural node families and reduces their weight or excludes them from the structural node set. Here, structural length consistency refers to the relative balance in length among the chapters in the text, avoiding some parts being too long or too short. Node density stability refers to the uniformity of the chapter density in the text. Sequence continuity refers to whether the chapter sequence is coherent, ensuring a progressive relationship between chapters and avoiding jumps or breaks. The decision on structural family confidence is generally based on performing sovereignty arbitration on the physical territory thickness between candidate nodes and performing sequence gravity merging on neighboring nodes below a preset threshold. For example, if a fake family is mixed into the fingerprint database—the text contains 50 short dialogues (such as "Xiao Yan said: 'I am not convinced!'")—by examining the punctuation of this structural family, it is found to contain narrative punctuation (colon, quotation mark, exclamation mark). Since real directories rarely contain exclamation marks, the system judges this family to be "toxic beyond the limit" and disqualifies it from the candidate list, thus accurately eliminating non-directory text.
[0031] Step S500, Text Structure Pulse Stability Audit: Using statistical indicators, obtain the baseline value of the structure pulse based on a statistical distribution model of node physical spacing. The parameters of this statistical distribution model based on node physical spacing can include various statistical quantities such as mean, median, quantiles, or distribution function fitting, thereby covering a wider range of text variations.
[0032] For example, statistical indicators can be used to assess the stability of the distribution of the entire text's main body, calculate the baseline value of the structural impulse of the entire text, calculate the average number of words between two adjacent chapters, and the statistical indicator is the coefficient of variation (CV) of the number of words in the structural impulse of the text. The coefficient of variation (CV) is obtained by calculating the variance of the chapter spacing (number of words) of the text. Specifically, the average number of words between two adjacent chapters is calculated in the main body of the text (e.g., an average of 3,000 words per chapter), and the coefficient of variation (CV) is calculated. Here, 3,000 words is the "structural impulse" of the text, that is, the word count rhythm of the author's writing.
[0033] Step S600, Text Diagnosis: Scan the node spacing of the entire text. When the spacing between adjacent chapters shows a significant abnormality compared to the structural pulse reference value, the text content between the adjacent chapters is determined to be a structurally abnormal region. The method for showing a significant abnormality between the spacing between adjacent chapters compared to the structural pulse reference value is: Gapi > k × Pulse, where Gapi is the spacing between adjacent chapters, Pulse is the structural pulse reference value, and k is the dynamic deviation coefficient.
[0034] In this step, Gapi is the distance between adjacent chapters, specifically the difference in absolute character offset between the i-th identified node and the (i+1)-th identified node. By diagnosing structurally abnormal areas in the text, "black areas / faults" are detected. For example, by scanning downwards, if it is found that there is a gap of 15,000 words between chapters 15 and 16 (i.e., Gap > 2 × Pulse), it can be diagnosed that this is not because the author suddenly wrote a long chapter, but because there are unidentified abnormal areas (missing chapters).
[0035] Step S700, Text Structure Repair: Re-search the row slices corresponding to the structurally abnormal regions in the shared slice pool, find candidate row slices that can be used as candidate directories, and when the features of the candidate row slices satisfy at least one of the structural conditions of sequence number inheritance, homology of structural feature vectors, or resonance of spatial sites, perform node self-healing completion and use the candidate row slices as directory nodes in the structurally abnormal regions.
[0036] Sequence inheritance refers to the hierarchical continuity of chapter numbers. For example, if the previous chapter is "Chapter 125" and the next chapter is "Chapter 126," with no candidate line slices in between, then the title inheritance relationship is satisfied. This inheritance ensures clear document logic and distinct levels, making it easy for readers to follow the content structure. Homogeneity of structural feature vectors means that different chapters have similar structural vector representations in their structural patterns. For example, each chapter follows the structure of "Introduction—Methods—Results—Discussion," and its structural features can be encoded as fixed-dimensional vectors. If the structural vectors of candidate line slices are highly similar to those of other chapters, then the candidate line slices can be considered independent chapters, just like other chapters. Spatial site resonance is based on deriving the expected spatial location of candidate nodes from structural pulse benchmark values, and matching candidate line slices in the shared slice pool within a preset tolerance range to determine them as structural nodes.
[0037] In this step, text repair is used to determine the accurate directory of the text. For example, the previously identified 15,000-character "abnormal black area" is isolated, and the recognition threshold is lowered for directory retrieval. In this black area (structurally abnormal area), an isolated line of text "(Yunlan Sect Change)" is found. Although it does not contain the words "Chapter X", it is exactly at the physical midpoint of the 15,000 characters and conforms to the sequential inheritance relationship. Therefore, the directory can be self-healed and modified based on the content of this text, and it can be exceptionally promoted and admitted as a real directory node.
[0038] Step S800, Structure Index Output: After the above cleaning and repair, global logical sequence numbers are assigned to the row slices corresponding to the structural feature vectors in all structural families in the structure fingerprint database, and a directory index tree with a complete inheritance chain is output.
[0039] In this step, through cleaning and repair, the content of the row slices corresponding to the remaining structural feature vectors can largely serve as real directory nodes. All real nodes are assigned logical numbers (1, 2, 3...). Based on the hierarchical inclusion relationship of volumes and chapters, a perfect directory tree with nested folding effects, zero redundancy, and zero missing elements is output.
[0040] In one embodiment, a text structure recognition and recovery system based on structural features and spatial distribution analysis is provided, such as... Figure 2 As shown, the system includes a text data processing module 100, a structural feature vector extraction module 200, a structural feature vector pattern clustering module 300, a structural family confidence decision module 400, a text structure impulse stability audit module 500, a text diagnosis module 600, a text structure repair module 700, and a structure index output module 800. The specific functional modules of the system are as follows: The text data processing module 100 is used to establish a shared slice pool, perform a full linear scan of the text, divide the text into line slices according to newline characters, record the generated text slices in the shared slice pool, and synchronously record the physical environment characteristics of each line slice in the shared slice pool. The physical environment characteristics include the length of the line slice, the whitespace ratio, and the line indentation. Long line slices with a length exceeding 45 characters are filtered out, and short line slices are retained. At the same time, the absolute character offset of each line slice in the entire text is recorded.
[0041] The structural feature vector extraction module 200 is used to extract the character skeleton information of the line slice and map it to generate a quantized structural feature vector. The structural feature vector describes the text physical contour of the line slice. The method of mapping the line slice to generate the structural feature vector includes classifying characters into number class, letter class, unit class, punctuation class and ordinary character class for mapping.
[0042] The structural feature vector here refers to the sequence generated by quantizing and mapping the physical contour of the text line, which includes the skeleton information of the line slice (punctuation sequence) and the character attribute fingerprint (number → D, unit → U, ordinary Chinese character → C, letter → A, etc.).
[0043] The pattern clustering module for structural feature vectors is used to statistically analyze the structural feature vectors extracted from the line slices of the entire text. Based on the repetition frequency of each structural feature vector, a structural fingerprint database is constructed. In the structural fingerprint database, structural feature vectors with the same pattern are clustered into a structural family. Here, a structural family is a group of candidate nodes with high similarity of feature vectors, consistent physical steps, and satisfying sequence inheritance relationships. The obtained structural fingerprint database is a set of patterns generated by statistically summarizing the repetition frequency of specific structural feature vectors in the document.
[0044] The decision module 400 for the confidence of structural family is used to conduct collaborative arbitration on the confidence of structural feature vectors in each structural family obtained by clustering as directory nodes. The confidence weight is calculated by the structural length consistency, node density stability and sequence continuity of the structural feature vectors in the structural family, and is used to determine whether the structural family can be used as the logical backbone node of the text.
[0045] The text structure impulse stability audit module 500 is used to evaluate the distribution stability of the entire text backbone using statistical indicators, calculate the structural impulse baseline value of the entire text, and calculate the average number of words between two adjacent chapters; the statistical indicator is the coefficient of variation. Here, the text structure impulse refers to the macroscopic rhythmic characteristics of structural nodes in physical space, used to describe the "heartbeat" frequency baseline of the document layout.
[0046] The text diagnostic module 600 is used to scan the node spacing of the entire text to find structurally abnormal regions. The identification threshold of the structurally abnormal regions is adaptively calculated based on the statistical distribution of the spacing between adjacent chapters. When the spacing between adjacent chapters is significantly abnormal compared with the structural pulse reference value, the text content between the adjacent chapters is determined to be a structurally abnormal region. The method for determining that the spacing between adjacent chapters is significantly abnormal compared with the structural pulse reference value is: Gapi > k × Pulse, where Gapi is the spacing between adjacent chapters, Pulse is the structural pulse reference value, and k is the dynamic deviation coefficient.
[0047] The text structure repair module 700 is used to re-retrieve the row slices corresponding to the structurally abnormal region in the shared slice pool, find candidate row slices that can be used as candidate directories, and when the features of the candidate row slices satisfy at least one of the structural conditions of sequence number inheritance, structural feature vector homology, or spatial site resonance, node self-healing completion is performed, and the candidate row slices are used as directory nodes in the structurally abnormal region.
[0048] The structure index output module 800 is used to assign global logical numbers to the row slices corresponding to the structure feature vectors of all the structure families in the structure fingerprint database after the above cleaning and repair, and output a directory index tree with a complete inheritance chain.
[0049] This text structure recognition and recovery system based on structural features and spatial distribution analysis has the following advantages: 1. Closed-loop text structure repair capability: Through structural impulse detection and repair mechanisms, it solves the problem of text structure discontinuity caused by typesetting variations or missing identifiers. 2. Non-semantic universality: It does not rely on specific language corpora, but is based on physical feature modeling, achieving universal support for global languages and various non-standard and ancient documents. 3. Optimal energy-efficient architecture: The shared slice pool mechanism avoids repeated scanning, meeting the requirements of millisecond-level parsing and low-power operation of edge devices during text processing.
[0050] In one embodiment, a computer-readable storage medium is provided, which stores at least one instruction that is loaded and executed by a processor to implement the preceding text structure recognition and recovery method based on structural features and spatial distribution analysis. The computer-readable storage medium may be an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system or propagation medium, specifically including semiconductor or solid-state memory, magnetic tape, removable computer floppy disk, random access memory (RAM), read-only memory (ROM), hard disk, and optical disk.
[0051] It should be noted that, unless otherwise defined, all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains, and terms such as those defined in a common dictionary should be interpreted as having a meaning consistent with their meaning in the context of the relevant art. It should also be understood that the above is a description of the disclosure and should not be considered as a limitation thereof. Although several exemplary embodiments of the disclosure have been described, those skilled in the art will readily understand that various changes, modifications, substitutions, and variations can be made to these embodiments without departing from the principles and spirit of the invention. Therefore, all such modifications are intended to be included within the scope of the disclosure as defined in the claims, and will not be detailed here.
Claims
1. A text structure recognition and recovery method based on structural features and spatial distribution analysis, characterized in that, include: Text data processing: Establish a shared slice pool, perform a full linear scan of the text, divide the text into line slices according to line breaks, record the generated text slices in the shared slice pool, and synchronously record the physical environment characteristics of each line slice in the shared slice pool; Structural feature vector extraction: Extract the character skeleton information of the line slice, perform quantization processing on the line slice according to the preset attribute mapping rules, so as to generate a structural feature vector representing the physical contour of the line slice, and describe the text physical contour of the line slice through the structural feature vector; Pattern clustering of structural feature vectors: Statistically analyze the structural feature vectors of the entire text extracted from the line slices, and cluster structural feature vectors with the same pattern into a structural family based on the repetition frequency of each structural feature vector; Decision on the confidence of structural families: The confidence of the structural feature vectors in each structural family obtained by clustering as directory nodes is jointly arbitrated to determine whether the structural family can be used as the logical backbone node of the text. Text structure impulse stability audit: Using statistical indicators, the baseline value of the structure impulse is obtained based on the statistical distribution model of the physical distance between nodes; Text diagnostics: Scan the node spacing of the entire text to find structurally abnormal regions. The identification threshold of the structurally abnormal regions is adaptively calculated based on the statistical distribution of the spacing between adjacent chapters. When the spacing between adjacent chapters is significantly abnormal compared with the structural pulse reference value, the text content between the adjacent chapters is determined to be a structurally abnormal region. Text structure repair: Re-search the line slices corresponding to the structurally abnormal region in the shared slice pool, find candidate line slices that can be used as candidate directories, determine whether the candidate line slices can be used as directory nodes in the structurally abnormal region, and perform node self-healing completion. Structural Index Output: After the above cleaning and repair, global logical indices are assigned to the row slices corresponding to the structural feature vectors in all the structural families, and a directory index tree with a complete inheritance chain is output.
2. The text structure recognition and recovery method based on structural features and spatial distribution analysis according to claim 1, characterized in that: In the text data processing steps, the physical environment features include the length of the line slice, the whitespace ratio, and the line indentation.
3. The text structure recognition and recovery method based on structural features and spatial distribution analysis according to claim 2, characterized in that: The text data processing steps also include: filtering out long line slices that exceed a certain number of characters in length, retaining short line slices, and recording the absolute character offset of each line slice in the entire text.
4. The text structure recognition and recovery method based on structural features and spatial distribution analysis according to claim 1, characterized in that: In the step of extracting the structural feature vector, the method of mapping the line slice to generate the structural feature vector includes classifying characters into number class, letter class, unit class, punctuation class and ordinary character class for mapping, or performing word-level longest common prefix analysis on the text line slice to identify and extract static structural identifiers.
5. The text structure recognition and recovery method based on structural features and spatial distribution analysis according to claim 3, characterized in that: In the step of determining the confidence level of the structure family, the weight of the confidence level is calculated by combining the structural length consistency, node density stability, and sequence continuity of the structural feature vectors in the structure family.
6. The text structure recognition and recovery method based on structural features and spatial distribution analysis according to claim 1, characterized in that: In the text diagnosis step, the method for determining that the spacing between adjacent chapters is significantly abnormal compared to the structural pulse reference value is: Gapi > k × Pulse, where Gapi is the spacing between adjacent chapters, Pulse is the structural pulse reference value, and k is the dynamic deviation coefficient.
7. The text structure recognition and recovery method based on structural features and spatial distribution analysis according to claim 1, characterized in that: In the text structure repair step, determining whether the candidate line slice can be used as a directory node in the structurally abnormal region includes: when the features of the candidate line slice satisfy at least one of the structural conditions of sequence number inheritance, homology of structural feature vectors, or resonance of spatial sites, the candidate line slice is used as a directory node in the structurally abnormal region.
8. The text structure recognition and recovery method based on structural features and spatial distribution analysis according to claim 1, characterized in that, In the step of the text structure pulse stability audit: the statistical index is the coefficient of variation (CV) of the number of pulse words in the text structure; In the step of pattern clustering of the structural feature vectors, a structural fingerprint database is constructed, and the information of the structural families is stored in the structural fingerprint database.
9. A text structure recognition and recovery system based on structural features and spatial distribution analysis, characterized in that, include: Text data processing module: Used to establish a shared slice pool, perform a full linear scan of the text, divide the text into line slices according to newline characters, record the generated text slices in the shared slice pool, and synchronously record the physical environment characteristics of each line slice in the shared slice pool. The physical environment characteristics include the length of the line slice, the whitespace ratio, and the line indentation. Filter out long line slices that exceed a certain number of characters in length, retain short line slices, and record the absolute character offset of each line slice in the entire text. The structural feature vector extraction module is used to extract the character skeleton information of the line slice and map it to generate a quantized structural feature vector. The structural feature vector describes the text physical contour of the line slice. The method of mapping the line slice to generate the structural feature vector includes classifying characters into number class, letter class, unit class, punctuation class and ordinary character class for mapping. The structural feature vector pattern clustering module is used to count the structural feature vectors of the entire text extracted from the line slices, and to construct a structural fingerprint database based on the repetition frequency of each structural feature vector. In the structural fingerprint database, structural feature vectors with the same pattern are clustered into a structural family. The decision module for the confidence of structural family is used to collaboratively arbitrate the confidence of structural feature vectors in each structural family obtained by clustering as directory nodes. The confidence weight is calculated by the structural length consistency, node density stability and sequence continuity of the structural feature vectors in the structural family. It is used to determine whether the structural family can be used as the logical backbone node of the text. Text structure impulse stability audit module: used to evaluate the distribution stability of the entire text backbone using statistical indicators, calculate the structural impulse baseline value of the entire text, and calculate the average number of words between two adjacent chapters; the statistical indicator is the coefficient of variation; Text diagnostic module: used to scan the node spacing of the entire text to find structurally abnormal regions. The identification threshold of the structurally abnormal region is adaptively calculated and generated based on the statistical distribution of the spacing between adjacent chapters. When the spacing between adjacent chapters is significantly abnormal compared with the structural pulse reference value, the text content between the adjacent chapters is determined to be a structurally abnormal region. The method to determine if the spacing between adjacent chapters is significantly abnormal compared to the structural pulse reference value is: Gapi > k × Pulse, where Gapi is the spacing between adjacent chapters, Pulse is the structural pulse reference value, and k is the dynamic deviation coefficient. Text structure repair module: used to re-retrieve the row slices corresponding to the structurally abnormal region in the shared slice pool, find candidate row slices that can be used as candidate directories, and when the features of the candidate row slices satisfy at least one of the structural conditions of sequence number inheritance, structural feature vector homology or spatial site resonance, perform node self-healing completion and use the candidate row slices as directory nodes in the structurally abnormal region. Structure Index Output Module: After the above cleaning and repair, it assigns global logical numbers to the row slices corresponding to the structure feature vectors of all the structure families in the structure fingerprint database, and outputs a directory index tree with a complete inheritance chain.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one instruction, which is loaded and executed by a processor to implement the text structure recognition and recovery method based on structural features and spatial distribution analysis as described in any one of claims 1-8.