Methods and systems for splitting and reconstructing complex merged cells across pages
By parsing the table structure and constructing a syntactic dependency tree, identifying semantically independent fragments, and optimizing cross-page processing strategies, the problems of content fragmentation and unattractive layout in complex table cross-page scenarios are solved, achieving high-quality document reconstruction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 冠骋信息技术(苏州)有限公司
- Filing Date
- 2026-04-23
- Publication Date
- 2026-06-30
AI Technical Summary
When dealing with complex table-crossing scenarios, conventional methods in existing technologies can lead to content fragmentation, semantic corruption, or low page space utilization, thus affecting the reading experience.
By parsing the table structure, identifying the row and column spanning attributes of merged cells, constructing a syntactic dependency tree, dividing semantically independent segments, optimizing allocation based on semantic integrity boundaries and space requirements, dynamically detecting page-crossing conditions, and employing inline splitting or migration strategies to ensure content coherence and aesthetically pleasing layout.
It enables intelligent splitting and reconstruction of complex tables in cross-page scenarios, maintaining semantic coherence and page aesthetics, and improving document readability and efficiency.
Smart Images

Figure CN122090477B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of document processing technology, and in particular to a method and system for splitting and reconstructing complex merged cells across pages. Background Technology
[0002] In the field of document processing technology, handling heterogeneous documents containing complex tables is a common requirement. These documents often contain merged cells spanning multiple rows or columns to hold structured data or descriptive text. Maintaining the integrity of the table structure and the readability of the content, especially when handling multi-page scenarios, is a key technical challenge when such documents need to be rendered or printed in a fixed-page environment (e.g., A4 paper).
[0003] Current conventional practices typically focus on simply splitting tables geometrically. One common approach is to divide the table equally based on the physical cell grid covered by merged cells, distributing the text content within the merged cells evenly according to the number of split cells. Another approach is to forcibly move the entire row triggering the page break to the next page when a page break is detected, leaving whitespace at the bottom of the table on the current page. These methods rely on basic layout information such as the table's row and column coordinates and page height thresholds.
[0004] However, these conventional practices have significant drawbacks. The method of equal splitting completely ignores the semantic structure of the text itself. Abruptly dividing a complete sentence or semantic unit into different cells or even different pages severely disrupts the coherence and comprehensibility of the content, making it difficult for readers to obtain accurate information. While moving entire lines to the next page avoids the semantic breaks caused by inline splitting, it easily creates large blank areas at the bottom of the current page, disrupting the visual balance and aesthetics of the layout, and reducing the overall quality of the document. Especially when dealing with densely packed tables with large line heights, this simple migration can lead to low page space utilization and negatively impact the reading experience. Summary of the Invention
[0005] This invention provides a method and system for splitting and reconstructing complex merged cells across pages, which can solve the problems in the prior art.
[0006] A first aspect of this invention provides a method for splitting and reconstructing complex merged cells across pages, comprising:
[0007] Obtain heterogeneous documents and parse their table structures, identify the row and column spanning attributes of merged cells and their coordinate range in the table, extract the internal text content of the identified merged cells and construct the text's grammatical dependency tree, and identify sentence boundaries and semantic integrity boundaries through the grammatical dependency tree;
[0008] Based on the semantic integrity boundary, the text within the merged cell is divided into multiple semantically independent fragments, and the pixel width and line count requirements of each semantically independent fragment in the target rendering environment are calculated. According to the number of original cells covered by the merged cell and the space requirements of each semantically independent fragment, an allocation matrix between the fragments and the split cells is established so that each fragment is evenly distributed in the split cells.
[0009] The vertical position of the table in the target document page layout is detected. The table bottom coordinates are compared with the page height threshold to determine whether the cross-page condition is triggered. When the cross-page condition is triggered, the table row index where the cross-page dividing line is located is located, and the height parameters of the row and its adjacent rows are extracted. The row height elasticity coefficient and content density of each cell in the row are calculated to determine whether the row supports inline splitting.
[0010] If inline splitting is supported, a soft pagination mark is inserted inside the line, and a copy of the table header and a line continuation mark are generated at the beginning of the next page. If inline splitting is not supported, the entire line is moved to the next page, and a blank area is filled at the end of the current page to maintain visual balance. The target document after content splitting and cross-page adaptation is output.
[0011] For the identified merged cells, their internal text content is extracted and a grammatical dependency tree is constructed. This grammatical dependency tree is then used to identify sentence boundaries and semantic integrity boundaries, including:
[0012] Lexical analysis is performed on the text content within the merged cells to segment the text into a sequence of tokens and label each token with a part-of-speech tag.
[0013] Syntactic analysis is performed on the word sequence to identify the dominance and subordination relationships between words and to construct a grammatical dependency tree that reflects the hierarchical dependency relationships between words.
[0014] Locate the leaf node with the sentence-ending punctuation mark in the grammar dependency tree, and mark its corresponding text position as the sentence boundary;
[0015] Traverse each subtree of the grammatical dependency tree, identify subtrees with complete subject-verb-object structures or independent semantic units, and mark the text range covered by the subtree as semantic integrity boundaries.
[0016] Extract the dependency arcs representing parallel relationships from the grammatical dependency tree, identify the boundary positions between parallel components, and use these boundary positions as candidate semantic integrity boundaries;
[0017] By combining the marking results of sentence boundaries and semantic integrity boundaries, a set of segmentation points for the text content is generated. Each segmentation point in the set corresponds to the start or end position of a semantic unit in the text content that can stand alone as a segment.
[0018] Traverse each subtree of the grammatical dependency tree, identify subtrees with complete subject-verb-object structures or independent semantic units, and mark the text range covered by the subtree as semantic integrity boundaries, including:
[0019] Starting from the root node of the grammar dependency tree, perform a depth-first traversal and extract each non-leaf node as a candidate subtree root node;
[0020] For each candidate subtree root node, collect all its direct and indirect child nodes to form the node set of the candidate subtree;
[0021] Detect whether the node set of the candidate subtree contains nodes labeled as subject, predicate, and object simultaneously;
[0022] When the simultaneous presence of subject, verb, and object nodes is detected, the starting and ending positions of the lexical units corresponding to all nodes in the candidate subtree node set are extracted in the original text, and the text interval between the starting and ending positions is marked as the semantic integrity boundary with a complete subject-verb-object structure.
[0023] When it is detected that the subject, verb, and object nodes do not exist simultaneously, it is further detected whether the candidate subtree contains a specific dependency relationship type with independent semantic function. The specific dependency relationship type includes combinations of adverbial-head relation, verb-object relation, and modifier-head relation.
[0024] If the candidate subtree contains the specific dependency relationship type, then the text range covered by the candidate subtree is marked as a semantic integrity boundary with independent semantic units.
[0025] For table rows at page breakpoints, calculate the row height flexibility coefficient and content density of each cell within that row to determine whether the row supports inline splitting, including:
[0026] Extract the rendering height and minimum allowed height of each cell in the table row at the page break line, and calculate the height compression margin for each cell;
[0027] Divide the height compression margin of each cell by the rendering height of that cell to obtain the row height elasticity coefficient of that cell;
[0028] Count the total number of text characters and the equivalent number of non-text elements in each cell within the table row at the page break line, and sum the two to get the total content of that cell;
[0029] Divide the total content of each cell by the rendered area of that cell to obtain the content density of that cell;
[0030] Iterate through all cells in the table row at the page break line and determine if there are any cells with a row height elasticity coefficient lower than the elasticity threshold or a content density higher than the density threshold.
[0031] If there are cells with a row height elasticity coefficient lower than the elasticity threshold or a content density higher than the density threshold, then the table row is determined to not support inline splitting.
[0032] If the row height elasticity coefficient of all cells is not lower than the elasticity threshold and the content density is not higher than the density threshold, then the table row is determined to support inline splitting.
[0033] Extract the rendering height and minimum allowed height of each cell within the table row at the page break line, and calculate the height compression margin for each cell, including:
[0034] In the target rendering environment, the layout calculation is performed on the table rows at the page split line, and the actual height occupied by each cell in the row in the current page coordinate system is obtained as the rendering height.
[0035] Extract the font parameters and line spacing parameters of the text content in each cell, and calculate the standard line height of a single line of text in that cell;
[0036] Count the number of lines of text content that must be displayed in each cell, multiply the number of lines required by the standard row height, and add the top and bottom padding of the cell to get the minimum allowable height of the cell;
[0037] For cells containing non-text elements, extract the inherent height and vertical margin of the non-text elements, and use the sum of the inherent height and vertical margin of the non-text elements as an additional component of the minimum allowed height of the cell.
[0038] Subtract the rendered height of each cell from the minimum allowed height of that cell to obtain the height compression margin of that cell;
[0039] The calculated height compression margin of each cell is stored in a data structure corresponding to the cell coordinates for use in subsequent calculations of row elasticity coefficients.
[0040] Based on the semantic integrity boundary, the text within the merged cell is divided into multiple semantically independent segments, and the pixel width and line count requirements for each semantically independent segment in the target rendering environment are calculated, including:
[0041] Based on the position of the split point of the semantic integrity boundary marker, the text content in the merged cell is divided into multiple sub-text segments, each sub-text segment corresponding to a semantically independent fragment;
[0042] Extract the font rendering engine interface of the target rendering environment, pass the text content of each semantically independent segment into the font rendering engine, and obtain the character-by-character rendering width of the segment under the specified font and font size;
[0043] The total pixel width of a semantically independent segment is obtained by summing the rendering widths of all characters within that segment.
[0044] Get the available content width of the split cell, divide the total pixel width of each semantically independent fragment by the available content width and round up to get the row requirement of the semantically independent fragment in the cell;
[0045] The calculated pixel width and row count requirements are mapped to the corresponding semantically independent segments and stored in the segment attribute set for use in subsequent allocation matrix construction.
[0046] A second aspect of this invention provides a system for splitting and reconstructing complex merged cells across pages, comprising:
[0047] The table parsing unit is used to acquire heterogeneous documents and parse their table structure, identify the row and column spanning attributes of merged cells and their coordinate range in the table, extract the internal text content of the identified merged cells and construct the text's grammatical dependency tree, and identify sentence boundaries and semantic integrity boundaries through the grammatical dependency tree.
[0048] The semantic segmentation unit is used to divide the text within the merged cell into multiple semantically independent segments based on the semantic integrity boundary, and to calculate the pixel width and line number requirements of each semantically independent segment in the target rendering environment. Based on the number of original cells covered by the merged cell and the space requirements of each semantically independent segment, an allocation matrix is established between the segments and the split cells so that each segment is evenly distributed in the split cells.
[0049] The cross-page detection unit is used to detect the vertical position of the table in the target document page layout. It determines whether the cross-page condition is triggered by comparing the bottom coordinates of the table with the page height threshold. When the cross-page condition is triggered, it locates the table row index where the cross-page dividing line is located, extracts the height parameters of the row and its adjacent rows, calculates the row height elasticity coefficient and content density of each cell in the row, and determines whether the row supports inline splitting.
[0050] The pagination processing unit is used to insert a soft pagination mark inside the line if inline splitting is supported, and generate a copy of the table header and a line continuation mark at the beginning of the next page. If inline splitting is not supported, the entire line is moved to the next page, and a blank area is filled at the end of the current page to maintain the visual balance of the page. The output is the target document after content splitting and cross-page adaptation processing.
[0051] A third aspect of the present invention provides an electronic device, comprising:
[0052] processor;
[0053] Memory used to store processor-executable instructions;
[0054] The processor is configured to invoke instructions stored in the memory to execute the aforementioned method.
[0055] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.
[0056] This technical solution enables intelligent splitting and reconstruction of complex tables in multi-page scenarios, effectively solving the problems of content fragmentation and formatting errors when merging cells span multiple pages in traditional document processing. By parsing the table structure and identifying the row and column spanning attributes of merged cells, it can accurately extract their internal text content and construct a syntactic dependency tree, thereby dividing the text into multiple independent semantic fragments based on semantic integrity boundaries. This method ensures that the content of each split cell remains semantically self-sufficient and coherent, avoiding semantic fragmentation or ambiguity caused by forced line breaks.
[0057] Based on the spatial requirements of semantic fragments in the target rendering environment, the system establishes an optimized allocation matrix between fragments and their split cells, ensuring that each fragment is evenly and reasonably distributed into its corresponding cell. This not only maintains the original data organization logic of the table but also guarantees the neatness and aesthetics of the split layout. By calculating pixel width and row count requirements, precise matching between content and layout space is achieved, improving the readability and professionalism of the document.
[0058] The system can dynamically detect the vertical position of a table on the page and intelligently determine whether a page crossover condition is triggered by comparing the table's bottom coordinates with a page height threshold. When a page crossover is triggered, it precisely locates the row containing the dividing line and analyzes the row height flexibility and content density of each cell within that row to scientifically assess whether the row supports in-row splitting. This mechanism achieves refined and automated page crossover decision-making, avoiding potential misjudgments that might occur when relying on fixed thresholds or simple rules.
[0059] Based on the evaluation results, the system adopts a differentiated cross-page processing strategy. For rows that support inline splitting, soft pagination marks are inserted, and a copy of the table header and row continuation marker are generated on the next page, ensuring seamless content transitions and readability after crossing pages. For rows that do not support splitting, they are moved to the next page as a whole, and visual balance padding is applied at the end of the current page, thus maintaining the integrity of the content while preserving the overall aesthetics and formatting stability of the page layout. The final output is an optimized target document, significantly improving the quality and efficiency of complex tables when formatting across pages. Attached Figure Description
[0060] Figure 1 A flowchart illustrating the method for splitting and reconstructing complex merged cells across pages;
[0061] Figure 2 A flowchart for calculating the pixel width and row count requirements for each semantically independent fragment in the target rendering environment. Detailed Implementation
[0062] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0063] The technical solution of the present invention will be described in detail below with reference to specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
[0064] Figure 1 This is a flowchart illustrating the method for splitting and reconstructing complex merged cells across pages according to an embodiment of the present invention, as shown below. Figure 1 As shown, the methods for splitting and reconstructing complex merged cells across pages include:
[0065] Obtain heterogeneous documents and parse their table structures, identify the row and column spanning attributes of merged cells and their coordinate range in the table, extract the internal text content of the identified merged cells and construct the text's grammatical dependency tree, and identify sentence boundaries and semantic integrity boundaries through the grammatical dependency tree;
[0066] Based on the semantic integrity boundary, the text within the merged cell is divided into multiple semantically independent fragments, and the pixel width and line count requirements of each semantically independent fragment in the target rendering environment are calculated. According to the number of original cells covered by the merged cell and the space requirements of each semantically independent fragment, an allocation matrix between the fragments and the split cells is established so that each fragment is evenly distributed in the split cells.
[0067] The vertical position of the table in the target document page layout is detected. The table bottom coordinates are compared with the page height threshold to determine whether the cross-page condition is triggered. When the cross-page condition is triggered, the table row index where the cross-page dividing line is located is located, and the height parameters of the row and its adjacent rows are extracted. The row height elasticity coefficient and content density of each cell in the row are calculated to determine whether the row supports inline splitting.
[0068] If inline splitting is supported, a soft pagination mark is inserted inside the line, and a copy of the table header and a line continuation mark are generated at the beginning of the next page. If inline splitting is not supported, the entire line is moved to the next page, and a blank area is filled at the end of the current page to maintain visual balance. The target document after content splitting and cross-page adaptation is output.
[0069] For the identified merged cells, their internal text content is extracted and a grammatical dependency tree is constructed. This grammatical dependency tree is then used to identify sentence boundaries and semantic integrity boundaries, including:
[0070] Lexical analysis is performed on the text content within the merged cells to segment the text into a sequence of tokens and label each token with a part-of-speech tag.
[0071] Syntactic analysis is performed on the word sequence to identify the dominance and subordination relationships between words and to construct a grammatical dependency tree that reflects the hierarchical dependency relationships between words.
[0072] Locate the leaf node with the sentence-ending punctuation mark in the grammar dependency tree, and mark its corresponding text position as the sentence boundary;
[0073] Traverse each subtree of the grammatical dependency tree, identify subtrees with complete subject-verb-object structures or independent semantic units, and mark the text range covered by the subtree as semantic integrity boundaries.
[0074] Extract the dependency arcs representing parallel relationships from the grammatical dependency tree, identify the boundary positions between parallel components, and use these boundary positions as candidate semantic integrity boundaries;
[0075] By combining the marking results of sentence boundaries and semantic integrity boundaries, a set of segmentation points for the text content is generated. Each segmentation point in the set corresponds to the start or end position of a semantic unit in the text content that can stand alone as a segment.
[0076] After extracting the text content within the merged cells, the text is first subjected to lexical analysis. This process scans the text character stream and segments the continuous character sequence into independent tokens according to a pre-defined lexical rule base. Specifically, the lexical analyzer identifies natural language words, punctuation marks, numbers, and special characters in the text, converting them into standardized token sequences. During word segmentation, a dynamic programming algorithm based on the maximum probability path is used to handle ambiguous segmentation issues, ensuring that the generated token sequences conform to the grammatical conventions of the current language environment. After word segmentation, a part-of-speech tag is attached to each token. Tag types include grammatical categories such as nouns, verbs, adjectives, adverbs, prepositions, conjunctions, auxiliary words, and punctuation marks. Part-of-speech tagging employs a sequence tagging method based on Hidden Markov Models. By calculating the conditional probability distribution of tokens in a specific context, the part-of-speech tag with the highest probability is selected as the final tagging result for that token.
[0077] After obtaining the lexical sequence with part-of-speech tags, the syntactic analysis module is initiated to construct a dependency tree. The syntactic analysis process employs a transition-based dependency parsing algorithm. This algorithm maintains a configuration state, which includes a buffer containing lexical units to be processed and a stack structure of processed lexical units. The algorithm gradually establishes dependency relationships between lexical units by executing a series of transition actions, including shifting, left arc connection, and right arc connection. A left arc connection establishes a right-to-left dependency arc between the two lexical units at the top of the stack, indicating that the right-side lexical unit dominates the left-side lexical unit; a right arc connection establishes a left-to-right dependency arc, indicating that the left-side lexical unit dominates the right-side lexical unit. The choice of each transition action is determined by the classifier, which predicts the optimal transition action based on the feature vector of the current configuration state. The feature vector includes the part-of-speech tag of the top-of-the-stack lexical unit, the lexical form of the first lexical unit in the buffer, the type of the established dependency arcs, and the combination features of adjacent lexical units. After multiple rounds of transition operations, a directed tree structure is finally generated. Each node in the tree corresponds to a word element, and the directed edges between nodes represent dominance relationships. The types of dependency relationships are marked on the edges, such as subject-predicate, verb-object, attributive-head, and adverbial-head relationships. The root node of this tree structure is usually the core predicate of the sentence, and all other words elements are connected to the root node through dependency paths.
[0078] After constructing the complete dependency tree, sentence boundaries are identified. All leaf nodes in the dependency tree are traversed, and each leaf node's corresponding lexical element is checked to see if it is a sentence-ending punctuation mark, including periods, question marks, exclamation marks, and other terminal punctuation. When a sentence-ending punctuation mark is detected, its character position in the original text is recorded, and this position is marked as a sentence boundary. For ellipses or dashes that may exist in some texts, their dependency relationships are analyzed to determine whether they constitute a true sentence termination point. If the lexical element before the ellipsis forms a complete subject-verb-object structure with its parent node, and there is no immediately following subordinate clause after the ellipsis, then the ellipsis position is marked as a sentence boundary; otherwise, it is considered an intra-sentence pause and not used as a sentence segmentation point. Furthermore, for semicolons in the text, the dependency relationship between the two clauses they connect is used to determine the boundary. If the two clauses each have independent predicate centers and there is no subordinate dependency arc between them, then the semicolon position is marked as a sentence boundary.
[0079] Building upon sentence boundary recognition, semantic integrity boundaries are further extracted. All subtrees in the dependency tree are traversed, and semantic integrity is evaluated for each subtree. Evaluation criteria include whether the subtree contains a core predicate node, whether it contains necessary argument components, and whether the text segment covered by the subtree can independently express a complete meaning. Specifically, the root node of the subtree is first located, and its part of speech is checked to see if it is a verb or adjective. If it is a verb, it is checked whether it has subject and object arguments; if it is an adjective, it is checked whether it has a modified noun. A semantic integrity score is calculated by statistically analyzing the distribution of each dependency relation type in the subtree. When a subtree contains subject-predicate and verb-object relations and has no dangling additional components, it is considered a semantically complete unit, and the start and end positions of the text interval it covers are marked as semantic integrity boundaries. For multiple semantic units in a complex sentence, the types of dependency arcs connecting different subtrees in the dependency tree are detected to identify semantic relations such as parallel, adversative, and causal relationships, and semantic integrity boundaries are marked at the corresponding connection points.
[0080] Identifying parallel relationships is crucial for determining semantic boundaries. In the dependency tree, dependency arcs marked as parallel relationships are searched. These arcs typically connect two lexical units with the same syntactic status, such as two subject nouns, two parallel predicates, or two object components. The positions of the two lexical units connected by the parallel dependency arcs are extracted in the text, and the connecting words, such as "and," "with," "or," and "as well as," are located. The positions of these connecting words are marked as candidate semantic integrity boundaries. Further analysis of the semantic similarity of the parallel components is conducted. If the semantic topics described by the two parallel components differ significantly, the priority of the candidate boundary is strengthened; if the two components differ only in details but are semantically closely related, the priority of the candidate boundary is reduced. For complex cases with multi-level parallel relationships, the parallel relationships at each level of the dependency tree are recursively analyzed, and candidate boundaries are marked from bottom to top to ensure that each independent parallel component can be correctly segmented.
[0081] After marking sentence boundaries and semantic integrity boundaries, all marked points are comprehensively evaluated and ranked. Sentence boundary points are assigned the highest priority because they correspond to the natural sentence breaks in the text, resulting in the clearest segmentation. Semantic unit boundaries with complete subject-verb-object structures are assigned the second highest priority because they ensure the semantic independence of the segmented fragments. Parallel relationship boundaries are assigned the third priority level as an auxiliary segmentation criterion. When multiple candidate boundaries exist and spatial constraints require selection, higher-priority boundaries are prioritized as the actual segmentation points. In the generated set of segmentation points, each segmentation point records its character offset in the text, its boundary type, its priority value, and its associated dependency tree node reference. This set provides a precise segmentation basis for subsequent text fragment allocation, ensuring that each text fragment assigned to a different cell is semantically self-contained, avoiding semantic fragmentation or incomplete information.
[0082] To improve the accuracy of boundary recognition, optimizations are made based on the linguistic characteristics of texts in specific domains. For texts containing technical terms or compound nouns, terminology recognition is performed before constructing the dependency tree, treating multi-word terms as single lexical units to avoid semantic damage caused by incorrect internal segmentation of terms. For texts containing enumeration structures, enumeration markers such as numerical sequences, letter markers, or bullet points are identified, treating each enumeration item as an independent semantic unit, and marking semantic integrity boundaries between enumeration items. For texts containing quotations or parenthetical explanations, the dependency relationship between the content within the parentheses and the main clause is analyzed. If the content within the parentheses is an independent supplementary explanation and has no direct dependency relationship with the main clause, boundaries are marked before and after the parentheses; if the content within the parentheses is a necessary component of the main clause, it is treated as a whole with the main clause, and no boundaries are marked between them. Through these targeted optimization strategies, the generated set of segmentation points better conforms to the semantic structure of the actual text, laying a solid foundation for subsequent cell content allocation and cross-page processing.
[0083] Traverse each subtree of the grammatical dependency tree, identify subtrees with complete subject-verb-object structures or independent semantic units, and mark the text range covered by the subtree as semantic integrity boundaries, including:
[0084] Starting from the root node of the grammar dependency tree, perform a depth-first traversal and extract each non-leaf node as a candidate subtree root node;
[0085] For each candidate subtree root node, collect all its direct and indirect child nodes to form the node set of the candidate subtree;
[0086] Detect whether the node set of the candidate subtree contains nodes labeled as subject, predicate, and object simultaneously;
[0087] When the simultaneous presence of subject, verb, and object nodes is detected, the starting and ending positions of the lexical units corresponding to all nodes in the candidate subtree node set are extracted in the original text, and the text interval between the starting and ending positions is marked as the semantic integrity boundary with a complete subject-verb-object structure.
[0088] When it is detected that the subject, verb, and object nodes do not exist simultaneously, it is further detected whether the candidate subtree contains a specific dependency relationship type with independent semantic function. The specific dependency relationship type includes combinations of adverbial-head relation, verb-object relation, and modifier-head relation.
[0089] If the candidate subtree contains the specific dependency relationship type, then the text range covered by the candidate subtree is marked as a semantic integrity boundary with independent semantic units.
[0090] In identifying semantic integrity boundaries, deep analysis based on dependency trees is necessary. A dependency tree is a tree-like structure where each node represents a word in the text, and the connections between nodes represent the grammatical dependencies between words. This structure clearly expresses the grammatical hierarchy and semantic organization of a sentence, providing a foundation for subsequent semantic boundary identification.
[0091] When performing a depth-first traversal starting from the root node of the dependency tree, the root node is first pushed onto the traversal stack. During the traversal, a node is popped from the top of the stack for processing, and all its child nodes are pushed onto the stack in right-to-left order, ensuring that the traversal adheres to the depth-first principle. During the traversal, for each visited node, it is determined whether it is a leaf node. A leaf node is a node without any child nodes, typically corresponding to independent words in the text. For non-leaf nodes, they are extracted as candidate subtree root nodes and stored in a candidate root node list. This traversal process continues until the stack is empty, at which point the depth-first traversal of the entire dependency tree is complete, and a set of all candidate subtree root nodes is obtained.
[0092] For each candidate subtree root node, all its direct and indirect child nodes need to be collected to form the complete node set of that candidate subtree. Direct child nodes are nodes with a parent-child dependency relationship to the candidate root node, while indirect child nodes are nodes indirectly connected to the candidate root node through multiple layers of dependency relationships. To collect these nodes, a depth-first traversal is performed again starting from the candidate root node, adding all nodes visited during the traversal to the node set. During the collection process, the part-of-speech tagging information and dependency relationship type information of each node also need to be recorded. Part-of-speech tagging information is used to identify the grammatical function of the node in the sentence, such as noun, verb, adjective, etc. Dependency relationship type information describes the grammatical relationship between the node and its parent node, such as subject-verb, verb-object, attributive-head, etc. This information is crucial for subsequent semantic integrity judgment.
[0093] After obtaining the node set of the candidate subtree, part-of-speech (POS) analysis is performed on the nodes in this set to check whether there are nodes simultaneously labeled as subject, predicate, and object. Subject POS typically corresponds to nominal components, and its dependency relationship type is labeled as subject-predicate, indicating that the node acts as the performer of the action in the sentence. Predicate POS typically corresponds to verb components, which are the core components of the sentence, representing the action or state. Object POS also corresponds to nominal components, and its dependency relationship type is labeled as verb-object, indicating that the node acts as the receiver of the action in the sentence. By traversing the candidate subtree node set, the number of nodes of each POS type is counted, and their position index in the set is recorded.
[0094] When a candidate subtree node set is detected to contain nodes of all three types—subject, predicate, and object—it is considered to have a complete subject-verb-object structure and can express a complete semantic unit. At this point, it is necessary to extract the position information of the corresponding words in the original text for all nodes in the candidate subtree node set. Each node has already been associated with the start and end character indices of its corresponding word in the original text during parsing. By traversing the node set, the minimum start position and maximum end position of the corresponding words for all nodes are found. The text interval between these two positions is the complete text range covered by the candidate subtree. The start and end positions of this text interval are recorded, and a record is added to the semantic boundary list, marking the interval as a semantic integrity boundary with a complete subject-verb-object structure. This marking indicates that the text interval contains a semantically complete sentence fragment and can be processed as an independent semantic unit.
[0095] When it is detected that the candidate subtree node set does not simultaneously contain all three types of nodes (subject, verb, and object), it cannot be directly concluded that the candidate subtree lacks semantic completeness. In practical applications, many text fragments, although lacking a complete subject-verb-object structure, can still express independent semantic functions. For example, attributive phrases can express modification relationships, adverbial phrases can express conditional or manner relationships, and verb-object phrases can express the relationship between an action and its object. Although these structures are not complete sentences, they still have independent semantic value in specific contexts. Therefore, it is necessary to further detect whether the candidate subtree contains specific dependency relation types with independent semantic functions.
[0096] Specific dependency relation types include adverbial-head relation, verb-object relation, and attributive-head relation. An adverbial-head relation refers to the modifying relationship between an adverb and its head noun, typically expressing additional information such as time, place, manner, and condition. A verb-object relation refers to the governing relationship between a verb and its object, expressing the connection between the action and the receiver. An attributive-head relation refers to the modifying relationship between a modifier and its head noun, including cases where an attributive modifies a noun or an adverbial modifies a verb. During the detection process, all dependency relations in the candidate subtree node set are traversed, and the type label for each dependency relation is extracted. It is then determined whether these type labels contain adverbial-head, verb-object, or attributive-head relations. If these specific dependency relation types exist in a candidate subtree, and these relations form a valid combination, then the candidate subtree is considered to possess the characteristics of an independent semantic unit.
[0097] Determining the compositional form requires considering the hierarchical structure and semantic coherence between dependency relations. For example, a candidate subtree containing verb-object and modifier-head relations may express a modified action object, possessing independent semantic function. A candidate subtree containing adverbial-head and verb-object relations may express an action with conditional or manner-limited constraints, also possessing independent semantic value. During the determination process, it is necessary to check whether these dependency relations form a coherent semantic chain within the candidate subtree; that is, whether the nodes corresponding to each dependency relation form a tight semantic whole through dependency connections. If there are breaks in the dependency relations—that is, some nodes are not dependently connected to other nodes in the candidate subtree—then the compositional form is not considered to have independent semantic function.
[0098] If a candidate subtree contains the specific dependency relationship types mentioned above, and these relationships form a valid combination, then the text interval covered by that candidate subtree is extracted. Similar to the method used when processing complete subject-verb-object structures, the candidate subtree node set is traversed, and the minimum start position and maximum end position of the corresponding lexical units in the original text are found. This text interval is then marked as a semantic integrity boundary with independent semantic units. A record is added to the semantic boundary list, indicating that the boundary type is an independent semantic unit, and recording its dependency relationship combination form for differentiation during subsequent processing.
[0099] Through the aforementioned traversal and detection process, all semantically complete subtree structures in the dependency tree can be comprehensively identified. The text intervals corresponding to these subtrees constitute a set of semantic integrity boundaries, providing a precise basis for subsequent text segmentation and cell content allocation. In practical processing, the identification results of semantic integrity boundaries directly affect the quality of merging cell content decomposition, ensuring that each semantic fragment remains semantically independent and complete, avoiding semantic breaks or logical inconsistencies between fragments, thereby improving the accuracy and readability of document reconstruction.
[0100] For table rows at page breakpoints, calculate the row height flexibility coefficient and content density of each cell within that row to determine whether the row supports inline splitting, including:
[0101] Extract the rendering height and minimum allowed height of each cell in the table row at the page break line, and calculate the height compression margin for each cell;
[0102] Divide the height compression margin of each cell by the rendering height of that cell to obtain the row height elasticity coefficient of that cell;
[0103] Count the total number of text characters and the equivalent number of non-text elements in each cell within the table row at the page break line, and sum the two to get the total content of that cell;
[0104] Divide the total content of each cell by the rendered area of that cell to obtain the content density of that cell;
[0105] Iterate through all cells in the table row at the page break line and determine if there are any cells with a row height elasticity coefficient lower than the elasticity threshold or a content density higher than the density threshold.
[0106] If there are cells with a row height elasticity coefficient lower than the elasticity threshold or a content density higher than the density threshold, then the table row is determined to not support inline splitting.
[0107] If the row height elasticity coefficient of all cells is not lower than the elasticity threshold and the content density is not higher than the density threshold, then the table row is determined to support inline splitting.
[0108] In handling multi-page spreads in complex documents, when the vertical span of a table exceeds the capacity of a single page, it is necessary to accurately determine whether the row containing the page break line is feasible for in-line splitting. This determination relies on a quantitative analysis of cell spatial characteristics and content distribution features.
[0109] Once a table triggers a page break condition, the precise row index of the page break line within the table is located first. This line is typically located at the bottom of the available content area of the current page, and its vertical coordinate can be calculated by subtracting parameters such as page margins and footer space from the page height. After obtaining the complete structural information of the row, all cells within that row are traversed, including regular cells and cells that have undergone the aforementioned decomposition process.
[0110] For each cell within a row, two key height parameters need to be extracted. The first parameter is the actual rendered height of the cell in the current rendering environment. This height is determined by factors such as the number of text lines, font size, line spacing, and padding within the cell, and can be obtained directly through the layout engine's measurement interface. The second parameter is the minimum allowed height of the cell. This parameter defines the minimum vertical size to which the cell can be compressed without compromising readability. The minimum allowed height is typically set to at least accommodate one line of text plus necessary top and bottom padding. For cells containing images or other non-text elements, the minimum allowed height must also consider the minimum display size requirements of these elements.
[0111] After obtaining the two height parameters mentioned above, calculate the height compression margin for each cell. The height compression margin is defined as the difference between the rendered height and the minimum allowed height; physically, it represents the number of pixels that can be compressed in the vertical dimension of the cell. The larger this value, the greater the flexibility in adjusting the cell's height.
[0112] Based on the height compression margin, the row height flexibility coefficient for each cell is further calculated. This coefficient is obtained by dividing the height compression margin by the rendering height, representing the proportion of the cell's compressible space to its total height. For example, if a cell's rendering height is 120 pixels and the minimum allowed height is 80 pixels, then its height compression margin is 40 pixels, and its row height flexibility coefficient is 0.333. The value of this coefficient ranges from 0 to 1; the closer the coefficient is to 1, the greater the cell's height flexibility, and the easier it is to adapt to the space compression requirements when splitting pages.
[0113] In addition to spatial flexibility analysis, it is also necessary to quantitatively evaluate the content density of cells. Content density reflects the compactness of information within a cell, directly affecting the visual effect of inline segmentation and the efficiency of information delivery.
[0114] When calculating content density, the total number of text characters within a cell is first counted. This count includes all visible characters, covering letters, numbers, punctuation marks, and Chinese characters. For both fixed-width Chinese and fixed-width English characters, a uniform character counting method is typically used.
[0115] For non-text elements within a cell, they need to be converted to an equivalent character count to be included in the total content calculation. Non-text elements include images, icons, embedded objects, etc. The calculation of the equivalent character count is based on the element's rendering size and can be done using the following conversion rule: divide the element's pixel area by the average area occupied by a single standard character. For example, if the average area occupied by a standard character under the current font settings is 144 square pixels, and the rendering size of an image is 300 pixels by 200 pixels, then the equivalent character count of the image is approximately 417 characters.
[0116] The total number of text characters is added to the equivalent number of non-text characters to obtain the total content of the cell. This indicator comprehensively reflects the amount of information carried by the cell.
[0117] At the same time, the rendered area of the cell is calculated, which is the product of the cell width and the rendered height. The rendered area represents the size of the two-dimensional space occupied by the cell on the page.
[0118] Dividing the total content area by the rendered area gives the content density of the cell, which can be expressed as pixels per square meter. Higher content density means a denser distribution of information within the cell, and more content is contained within a unit of space. When high-density cells are inlined, further compression, due to the already compact content, leads to crowded text, excessively small line spacing, and severely impacts readability.
[0119] After calculating the row height elasticity coefficient and content density of all cells in the row, these quantitative indicators need to be compared with preset thresholds to determine whether the row supports inline splitting.
[0120] The system maintains two key threshold parameters: the elasticity threshold and the density threshold. The elasticity threshold defines the minimum acceptable value for the cell's row height elasticity coefficient, typically ranging from 0.2 to 0.3, indicating that the cell should have at least 20% to 30% height compression space. The density threshold defines the maximum acceptable value for the cell's content density. Its specific value needs to be determined based on the target document's font size and expected reading experience, and is usually set as a critical density value that ensures the text line spacing remains within a comfortable reading range after inline splitting.
[0121] Iterate through all cells within the table row at the page break line, checking the row height elasticity and content density of each cell. If any cell's row height elasticity is below the elasticity threshold, it indicates insufficient height compression margin. Splitting the cell within the row at the page break line will cause excessive compression of the cell's content, failing to meet minimum display requirements. Similarly, if any cell's content density exceeds the density threshold, it means the information within the cell is already too dense. Continuing to split the cell at that row position will make the cell content crowded, reducing the document's readability and professionalism.
[0122] If any cell does not meet the above conditions, the table row is determined to not support in-row splitting. In this case, the entire row is moved to the next page to ensure that it is presented as a complete unit, avoiding content display defects caused by forced splitting.
[0123] If the traversal results show that the row height elasticity coefficient of all cells in the row is not lower than the elasticity threshold, and the content density is not higher than the density threshold, then the table row is determined to support in-row splitting. Under this premise, the system can insert a soft pagination mark inside the row, intelligently splitting the content of the row at the page break line. The upper part is retained on the current page, and the lower part continues to the next page. A copy of the table header and a row continuation mark are generated at the beginning of the next page to ensure the continuity of the table structure and the complete transmission of information after crossing pages.
[0124] Through the dual evaluation mechanism based on row elasticity coefficient and content density, the suitability of table row segmentation can be accurately identified, enabling flexible cross-page processing while ensuring document visual quality, and effectively avoiding content truncation and decreased reading experience caused by mechanical segmentation.
[0125] Extract the rendering height and minimum allowed height of each cell within the table row at the page break line, and calculate the height compression margin for each cell, including:
[0126] In the target rendering environment, the layout calculation is performed on the table rows at the page split line, and the actual height occupied by each cell in the row in the current page coordinate system is obtained as the rendering height.
[0127] Extract the font parameters and line spacing parameters of the text content in each cell, and calculate the standard line height of a single line of text in that cell;
[0128] Count the number of lines of text content that must be displayed in each cell, multiply the number of lines required by the standard row height, and add the top and bottom padding of the cell to get the minimum allowable height of the cell;
[0129] For cells containing non-text elements, extract the inherent height and vertical margin of the non-text elements, and use the sum of the inherent height and vertical margin of the non-text elements as an additional component of the minimum allowed height of the cell.
[0130] Subtract the rendered height of each cell from the minimum allowed height of that cell to obtain the height compression margin of that cell;
[0131] The calculated height compression margin of each cell is stored in a data structure corresponding to the cell coordinates for use in subsequent calculations of row elasticity coefficients.
[0132] When performing cross-page processing, a fine-grained height analysis of the table rows located at the cross-page divider is first required. In the target rendering environment, the table rows to be analyzed are positioned according to their position in the document coordinate system, and the rendering engine calculates the actual space occupied by each cell within that row in the page layout. This actual occupied height is the rendering height, which reflects the natural display state of the cells without any compression strategies. Obtaining the rendering height requires considering the drawing width of the cell borders, padding settings, and the vertical alignment of the internal content to ensure that the measured value accurately reflects the true vertical occupancy of the cells on the page.
[0133] For each cell in the row, extract the font attribute parameters of its internal text content, including basic parameters such as font name, font size, and font weight, and also obtain the line spacing configuration parameters for that cell. Line spacing parameters may be stored as fixed pixel values, relative font size multiples, or percentages; these need to be uniformly converted to absolute pixel values in the target rendering environment. Based on the extracted font size and the converted absolute line spacing values, calculate the standard line height of a single line of text within that cell. The calculation of the standard line height needs to take into account the font's baseline height, rise height, and fall height to ensure complete display space for the font. For cells containing special typographic styles, such as superscripts, subscripts, or inline icons, an additional compensation coefficient needs to be introduced into the standard line height calculation to prevent content from being truncated.
[0134] After obtaining the standard row height, the required number of lines to display the text content within the cell is calculated. By analyzing the positions of forced line breaks and line breaks caused by automatic line wrapping in the text content, the minimum number of lines required to fully display the text content under the current cell width constraint is determined. For cases containing indivisible long words or consecutive non-whitespace character sequences, it is necessary to check for overflow risk; if overflow exists, that line is included in the required display line count. The calculated required display line count is multiplied by the previously calculated standard row height to obtain the vertical space requirement of the text content. Based on this, the top and bottom padding values of the cell are extracted. These padding values are usually defined by the table style or inherited from the global document style. The sum of the top and bottom paddings is added to the vertical space requirement of the text content to obtain the minimum allowable height of the cell. The minimum allowable height represents the lowest height limit to which the cell can be compressed while ensuring the content is complete, readable, and conforms to basic typesetting standards.
[0135] When a cell contains non-text elements, such as embedded images, charts, formulas, or custom graphics, a specialized height analysis is required. Extract the inherent height attribute of the non-text element, which reflects its original vertical size without any scaling transformations. For vector graphics elements, their inherent height is determined by their defined viewport boundaries; for bitmap elements, the inherent height is determined by the number of rows in the pixel matrix and the set display resolution. Simultaneously, obtain the vertical margin settings of the non-text element within the cell, including the distance between the element and the top and bottom edges of the cell. Add the inherent height of the non-text element to its top and bottom vertical margins to obtain the total vertical space occupied by the element. Use this value as an additional component of the minimum allowable height of the cell containing the non-text element. When a cell contains both text content and non-text elements, compare the vertical space requirement of the text content with the sum of the space occupied by the non-text elements, and take the larger value as the baseline for the minimum allowable height of the cell. If a side-by-side or overlay layout is used, the calculation logic for the minimum allowable height needs to be adjusted according to the specific layout rules.
[0136] After completing the above analysis, calculate the height compression margin for each cell in the table row. Calculate the difference between the cell's rendered height and its corresponding minimum allowable height; the positive value of the difference is the cell's height compression margin. A positive height compression margin indicates that the cell has compressible redundancy in the current rendering state and can be moderately compressed during page-crossing adjustments to save vertical space. A zero or negative height compression margin indicates that the cell is already at its minimum allowable height or has been over-compressed and does not have the conditions for further compression; it should be prioritized for protection in subsequent row height adjustments. For merged cells, the calculation of their height compression margin needs to consider the cumulative effect of the multiple original cell rows they span. By summing the theoretical compression space of each spanned row and subtracting the minimum allowable height of the merged cell itself, the comprehensive compression margin of the merged cell is obtained.
[0137] The calculated height compression margin for each cell is stored in a specially designed data structure based on the cell's row and column coordinates in the table. This data structure can be a two-dimensional array, using row and column indices as access keys. Each array element stores a triplet of the corresponding cell's compression margin value, rendered height, and minimum allowed height, facilitating subsequent lookup and comparison operations. For cells with row or column merging, the data structure associates the coordinates of all original cells it covers with a merge identifier, ensuring that accessing any covered location correctly maps to the compression margin information of the merged cell. To improve data access efficiency, a fast lookup table for row indexes can be built into the data structure, using hash mapping or a balanced tree structure to accelerate location operations.
[0138] The stored compression margin data will be used by the subsequent row height elasticity coefficient calculation module. The calculation of the row height elasticity coefficient requires comprehensive consideration of the compression margin distribution of all cells within the row. By analyzing the variance and extreme value difference of the compression margin of each cell, the plasticity and risk of the row during overall height adjustment are assessed. The compression margin data will also be used in the cross-page decision logic. When the system determines whether a row supports in-line splitting, it retrieves the compression margin of each cell in that row. If the compression margin of all cells is positive and evenly distributed, the row has good in-line splitting conditions. If multiple cells have zero or negative compression margins, splitting the row may result in incomplete content display, and the system will tend to adopt a whole-row migration strategy to ensure readability and layout quality. By accurately calculating and structurally storing the height compression margin, a quantitative reference is provided for subsequent cross-page splitting decisions, enabling cross-page processing to achieve an optimal balance between ensuring content integrity and visual coherence.
[0139] Figure 2The flowchart describes the calculation of pixel width and line count requirements for each semantically independent fragment in the target rendering environment. Based on the semantic integrity boundary, the text within a merged cell is divided into multiple semantically independent fragments, and the pixel width and line count requirements for each semantically independent fragment in the target rendering environment are calculated, including:
[0140] Based on the position of the split point of the semantic integrity boundary marker, the text content in the merged cell is divided into multiple sub-text segments, each sub-text segment corresponding to a semantically independent fragment;
[0141] Extract the font rendering engine interface of the target rendering environment, pass the text content of each semantically independent segment into the font rendering engine, and obtain the character-by-character rendering width of the segment under the specified font and font size;
[0142] The total pixel width of a semantically independent segment is obtained by summing the rendering widths of all characters within that segment.
[0143] Get the available content width of the split cell, divide the total pixel width of each semantically independent fragment by the available content width and round up to get the row requirement of the semantically independent fragment in the cell;
[0144] The calculated pixel width and row count requirements are mapped to the corresponding semantically independent segments and stored in the segment attribute set for use in subsequent allocation matrix construction.
[0145] After extracting the semantic integrity boundaries of the text within the merged cells, the text content needs to be precisely segmented according to these boundaries, and the space requirements of each segment need to be calculated. In actual processing, the character position indices corresponding to the semantic integrity boundary markers are first read. These indices identify the start and end positions of each semantic unit in the text. Taking a merged cell containing multiple product specifications as an example, its content might be: "Dimensions: Length 200mm, Width 150mm, Height 100mm. Weight: Net weight 2.5kg, Gross weight 3.0kg. Material: Outer shell is made of aluminum alloy frame, inner lining is made of high-density polyethylene foam." After semantic analysis, three independent descriptive units can be identified, corresponding to the size information, weight information, and material information, respectively.
[0146] Based on the positional information of the semantic boundary markers, the character index values of each segmentation point are extracted sequentially, and the original text is decomposed into multiple sub-text segments through string truncation operations. During the segmentation process, the original formatting information of each segment must be preserved, including paragraph marks, line breaks, tabs, and other whitespace characters, as these formatting elements affect the final rendering width calculation. For text segments containing special symbols or mixed multilingual text, their character encoding types need to be individually marked to ensure that the subsequent rendering engine can correctly handle the width measurement requirements of different character sets. Each sub-text segment is encapsulated into an independent text object after segmentation. This object contains not only the text content itself but also metadata information such as the original cell coordinates, semantic type labels, and formatting attributes.
[0147] After obtaining all semantically independent fragments, the font rendering engine of the target rendering environment needs to be called to measure the actual display width of each fragment. Different rendering environments use different font rendering technologies. For example, Windows platforms commonly use GDI+ or DirectWrite interfaces, Linux platforms use the FreeType library, and web environments use the Canvas API's measureText method. Regardless of the rendering engine used, the core process is to pass the character sequence of the text fragment character by character to the rendering engine and obtain the rendering width value of each character under specified font, font size, font weight, and other parameters. For monospaced fonts, all characters have the same width, making the calculation relatively simple; however, for proportional fonts, the width of each character is different. The width of the letter "i" may only be one-third of the width of the letter "m," so precise measurement must be performed character by character.
[0148] In the actual measurement process, the font family name, font size, bolding, italics, and other typographic attributes applied to the cell are first read from the target document's style configuration. These attribute parameters are then passed to the font rendering engine's initialization interface to create a font object instance. Subsequently, each character in the semantically independent fragment is traversed, and the font object's character width measurement method is called to obtain the character's pixel width under the current font settings. For ASCII characters, the width is typically between 5 and 15 pixels; for Chinese characters, the width is usually around 12 pixels in 12-point font; for full-width symbols, the width is comparable to that of Chinese characters. It is important to note that some character combinations have kerning adjustments, meaning the spacing between adjacent characters is fine-tuned based on glyph characteristics. For example, the letters "AV" may be closer together in some fonts. In such cases, the rendering engine's kerning adjustment interface needs to be called to obtain additional spacing offsets.
[0149] The rendering widths of all characters within a semantically independent fragment are summed sequentially to obtain the total pixel width of the fragment without line breaks. During this summation, the width of spaces between words, the width of punctuation marks, and the placeholder width for any inline images or special symbols must also be considered. For text fragments containing superscripts or subscripts, their widths need to be recalculated based on the font size scaling of the superscripts and subscripts. For text with strikethroughs or underscores, although these decorative elements do not affect the width of the characters themselves, they may add extra margins in some rendering engines, and therefore also need to be included in the total width calculation. After calculation, the total pixel width value is recorded in the attribute field of the semantically independent fragment.
[0150] After obtaining the total pixel width of each semantically independent fragment, it is necessary to calculate the required number of rows in the split cells. First, obtain the available content width of the split cells, which is equal to the total cell width minus the left and right inner margins and border width. For example, if the total cell width is 200 pixels, the left and right inner margins are each 5 pixels, and the border width is each 1 pixel, then the available content width is 200 minus 5 minus 5 minus 1 minus 1, which equals 188 pixels. Divide the total pixel width of the semantically independent fragment by the available content width to obtain the number of rows that fragment needs to occupy in the cell. Since the number of rows must be an integer, the division result needs to be rounded up. For example, if a fragment has a total pixel width of 450 pixels and an available content width of 188 pixels, then 450 divided by 188 is approximately 2.39, which, after rounding up, results in a row requirement of 3 rows.
[0151] When calculating line count requirements, it's also necessary to consider the text's line wrapping rules and word segmentation strategies. For Chinese text, line breaks are typically allowed at any character, with no word integrity constraints. However, for English text, line breaks should ideally occur at word boundaries to avoid splitting a single word across two lines. Therefore, when calculating line counts, the rendering engine's automatic line wrapping algorithm needs to be invoked to simulate the text layout within the actual cell width and obtain the true line count requirements. Some rendering engines provide a text measurement interface that allows direct input of text content and container width, returning the actual number of lines occupied by the text under that width constraint and the start and end character indices of each line. This method is more accurate than simple width division. For text fragments containing non-breakable elements, such as long URLs or continuous numeric strings, if their width exceeds the available cell width, they must be forcibly broken or a horizontal scrolling strategy must be used. In this case, line count calculations require additional handling of overflow situations.
[0152] In addition to the basic pixel width and line count requirements, the height requirement for each semantically independent segment also needs to be calculated. The text height is primarily determined by the font size and line spacing. The font size parameter, usually in points or pixels, is read from the style configuration. Multiplying the font size by the line spacing factor yields the actual height of a single line of text. The line spacing factor is generally set between 1.2 and 1.5 to ensure sufficient vertical spacing between adjacent lines. Multiplying the single-line height by the previously calculated line count requirement gives the total height occupied by the semantically independent segment within the cell. For segments containing multi-level lists or indented paragraphs, additional height values for paragraph before and after margins are required.
[0153] The calculated space occupancy parameters, such as pixel width, row count requirement, and height requirement, are uniformly encapsulated into attribute objects for semantically independent fragments. These attribute objects are stored using a key-value pair structure and contain multiple fields, including text content, starting character index, ending character index, semantic type label, total pixel width, row count requirement, height requirement, font parameters, and color parameters. All attribute objects for semantically independent fragments are organized into a collection data structure, which can be an array, list, or hash table, facilitating subsequent sequential access or querying based on specific conditions. When constructing the attribute collection, a unique identifier is assigned to each fragment to ensure accurate referencing and positioning of each fragment during subsequent allocation matrix construction and cell content filling.
[0154] This set of attributes will serve as the input data source for the subsequent allocation matrix construction. The allocation matrix needs to determine which cells(s) each segment should be filled into, based on the space requirements of each semantically independent segment and the available space in the split cells. By pre-calculating the precise space occupancy of each segment, problems such as content overflow, wasted cell space, or text truncation during the actual filling process can be avoided, ensuring that the split table can fully present all the information of the original merged cells while maintaining a good visual layout and readability.
[0155] A second aspect of this invention provides a system for splitting and reconstructing complex merged cells across pages, comprising:
[0156] The table parsing unit is used to acquire heterogeneous documents and parse their table structure, identify the row and column spanning attributes of merged cells and their coordinate range in the table, extract the internal text content of the identified merged cells and construct the text's grammatical dependency tree, and identify sentence boundaries and semantic integrity boundaries through the grammatical dependency tree.
[0157] The semantic segmentation unit is used to divide the text within the merged cell into multiple semantically independent segments based on the semantic integrity boundary, and to calculate the pixel width and line number requirements of each semantically independent segment in the target rendering environment. Based on the number of original cells covered by the merged cell and the space requirements of each semantically independent segment, an allocation matrix is established between the segments and the split cells so that each segment is evenly distributed in the split cells.
[0158] The cross-page detection unit is used to detect the vertical position of the table in the target document page layout. It determines whether the cross-page condition is triggered by comparing the bottom coordinates of the table with the page height threshold. When the cross-page condition is triggered, it locates the table row index where the cross-page dividing line is located, extracts the height parameters of the row and its adjacent rows, calculates the row height elasticity coefficient and content density of each cell in the row, and determines whether the row supports inline splitting.
[0159] The pagination processing unit is used to insert a soft pagination mark inside the line if inline splitting is supported, and generate a copy of the table header and a line continuation mark at the beginning of the next page. If inline splitting is not supported, the entire line is moved to the next page, and a blank area is filled at the end of the current page to maintain the visual balance of the page. The output is the target document after content splitting and cross-page adaptation processing.
[0160] A third aspect of the present invention provides an electronic device, comprising:
[0161] processor;
[0162] Memory used to store processor-executable instructions;
[0163] The processor is configured to invoke instructions stored in the memory to execute the aforementioned method.
[0164] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.
[0165] This invention can be a method, apparatus, system, and / or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for performing various aspects of the invention.
[0166] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for splitting and reconstructing complex merged cells across pages, characterized in that, include: Obtain heterogeneous documents and parse their table structures, identify the row and column spanning attributes of merged cells and their coordinate range in the table, extract the internal text content of the identified merged cells and construct the text's grammatical dependency tree, and identify sentence boundaries and semantic integrity boundaries through the grammatical dependency tree; Based on the semantic integrity boundary, the text within the merged cell is divided into multiple semantically independent fragments, and the pixel width and line count requirements of each semantically independent fragment in the target rendering environment are calculated. According to the number of original cells covered by the merged cell and the space requirements of each semantically independent fragment, an allocation matrix between the fragments and the split cells is established so that each fragment is evenly distributed in the split cells. The vertical position of the table in the target document page layout is detected. The table bottom coordinates are compared with the page height threshold to determine whether the cross-page condition is triggered. When the cross-page condition is triggered, the table row index where the cross-page dividing line is located is located, and the height parameters of the row and its adjacent rows are extracted. The row height elasticity coefficient and content density of each cell in the row are calculated to determine whether the row supports inline splitting. If inline splitting is supported, a soft pagination mark is inserted inside the line, and a copy of the table header and a line continuation mark are generated at the beginning of the next page. If inline splitting is not supported, the entire line is moved to the next page, and a blank area is filled at the end of the current page to maintain visual balance. The target document after content splitting and cross-page adaptation is output.
2. The method according to claim 1, characterized in that, For the identified merged cells, their internal text content is extracted and a grammatical dependency tree is constructed. This grammatical dependency tree is then used to identify sentence boundaries and semantic integrity boundaries, including: Lexical analysis is performed on the text content within the merged cells to segment the text into a sequence of tokens and label each token with a part-of-speech tag. Syntactic analysis is performed on the word sequence to identify the dominance and subordination relationships between words and to construct a grammatical dependency tree that reflects the hierarchical dependency relationships between words. Locate the leaf node with the sentence-ending punctuation mark in the grammar dependency tree, and mark its corresponding text position as the sentence boundary; Traverse each subtree of the grammatical dependency tree, identify subtrees with complete subject-verb-object structures or independent semantic units, and mark the text range covered by the subtree as semantic integrity boundaries. Extract the dependency arcs representing parallel relationships from the grammatical dependency tree, identify the boundary positions between parallel components, and use these boundary positions as candidate semantic integrity boundaries; By combining the marking results of sentence boundaries and semantic integrity boundaries, a set of segmentation points for the text content is generated. Each segmentation point in the set corresponds to the start or end position of a semantic unit in the text content that can stand alone as a segment.
3. The method according to claim 2, characterized in that, Traverse each subtree of the grammatical dependency tree, identify subtrees with complete subject-verb-object structures or independent semantic units, and mark the text range covered by the subtree as semantic integrity boundaries, including: Starting from the root node of the grammar dependency tree, perform a depth-first traversal and extract each non-leaf node as a candidate subtree root node; For each candidate subtree root node, collect all its direct and indirect child nodes to form the node set of the candidate subtree; Detect whether the node set of the candidate subtree contains nodes labeled as subject, predicate, and object simultaneously; When the simultaneous presence of subject, verb, and object nodes is detected, the starting and ending positions of the lexical units corresponding to all nodes in the candidate subtree node set are extracted in the original text, and the text interval between the starting and ending positions is marked as the semantic integrity boundary with a complete subject-verb-object structure. When it is detected that the subject, verb, and object nodes do not exist simultaneously, it is further detected whether the candidate subtree contains a specific dependency relationship type with independent semantic function. The specific dependency relationship type includes combinations of adverbial-head relation, verb-object relation, and modifier-head relation. If the candidate subtree contains the specific dependency relationship type, then the text range covered by the candidate subtree is marked as a semantic integrity boundary with independent semantic units.
4. The method according to claim 1, characterized in that, For table rows at page breakpoints, calculate the row height flexibility coefficient and content density of each cell within that row to determine whether the row supports inline splitting, including: Extract the rendering height and minimum allowed height of each cell in the table row at the page break line, and calculate the height compression margin for each cell; Divide the height compression margin of each cell by the rendering height of that cell to obtain the row height elasticity coefficient of that cell; Count the total number of text characters and the equivalent number of non-text elements in each cell within the table row at the page break line, and sum the two to get the total content of that cell; Divide the total content of each cell by the rendered area of that cell to obtain the content density of that cell; Iterate through all cells in the table row at the page break line and determine if there are any cells with a row height elasticity coefficient lower than the elasticity threshold or a content density higher than the density threshold. If there are cells with a row height elasticity coefficient lower than the elasticity threshold or a content density higher than the density threshold, then the table row is determined to not support inline splitting. If the row height elasticity coefficient of all cells is not lower than the elasticity threshold and the content density is not higher than the density threshold, then the table row is determined to support inline splitting.
5. The method according to claim 4, characterized in that, Extract the rendering height and minimum allowed height of each cell within the table row at the page break line, and calculate the height compression margin for each cell, including: In the target rendering environment, the layout calculation is performed on the table rows at the page split line, and the actual height occupied by each cell in the row in the current page coordinate system is obtained as the rendering height. Extract the font parameters and line spacing parameters of the text content in each cell, and calculate the standard line height of a single line of text in that cell; Count the number of lines of text content that must be displayed in each cell, multiply the number of lines required by the standard row height, and add the top and bottom padding of the cell to get the minimum allowable height of the cell; For cells containing non-text elements, extract the inherent height and vertical margin of the non-text elements, and use the sum of the inherent height and vertical margin of the non-text elements as an additional component of the minimum allowed height of the cell. Subtract the rendered height of each cell from the minimum allowed height of that cell to obtain the height compression margin of that cell; The calculated height compression margin of each cell is stored in a data structure corresponding to the cell coordinates for use in subsequent calculations of row elasticity coefficients.
6. The method according to claim 1, characterized in that, Based on the semantic integrity boundary, the text within the merged cell is divided into multiple semantically independent segments, and the pixel width and line count requirements for each semantically independent segment in the target rendering environment are calculated, including: Based on the position of the split point of the semantic integrity boundary marker, the text content in the merged cell is divided into multiple sub-text segments, each sub-text segment corresponding to a semantically independent fragment; Extract the font rendering engine interface of the target rendering environment, pass the text content of each semantically independent segment into the font rendering engine, and obtain the character-by-character rendering width of the segment under the specified font and font size; The total pixel width of a semantically independent segment is obtained by summing the rendering widths of all characters within that segment. Get the available content width of the split cell, divide the total pixel width of each semantically independent fragment by the available content width and round up to get the row requirement of the semantically independent fragment in the cell; The calculated pixel width and row count requirements are mapped to the corresponding semantically independent segments and stored in the segment attribute set for use in subsequent allocation matrix construction.
7. A system for splitting and reconstructing complex merged cells across pages, used to implement the method as described in any one of claims 1-6, characterized in that, include: The table parsing unit is used to acquire heterogeneous documents and parse their table structure, identify the row and column spanning attributes of merged cells and their coordinate range in the table, extract the internal text content of the identified merged cells and construct the text's grammatical dependency tree, and identify sentence boundaries and semantic integrity boundaries through the grammatical dependency tree. The semantic segmentation unit is used to divide the text within the merged cell into multiple semantically independent segments based on the semantic integrity boundary, and to calculate the pixel width and line number requirements of each semantically independent segment in the target rendering environment. Based on the number of original cells covered by the merged cell and the space requirements of each semantically independent segment, an allocation matrix is established between the segments and the split cells so that each segment is evenly distributed in the split cells. The cross-page detection unit is used to detect the vertical position of the table in the target document page layout. It determines whether the cross-page condition is triggered by comparing the bottom coordinates of the table with the page height threshold. When the cross-page condition is triggered, it locates the table row index where the cross-page dividing line is located, extracts the height parameters of the row and its adjacent rows, calculates the row height elasticity coefficient and content density of each cell in the row, and determines whether the row supports inline splitting. The pagination processing unit is used to insert a soft pagination mark inside the line if inline splitting is supported, and generate a copy of the table header and a line continuation mark at the beginning of the next page. If inline splitting is not supported, the entire line is moved to the next page, and a blank area is filled at the end of the current page to maintain the visual balance of the page. The output is the target document after content splitting and cross-page adaptation processing.
8. An electronic device, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1 to 6.
9. A computer-readable storage medium having computer program instructions stored thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1 to 6.