A table analysis method, device, equipment and storage medium
By obtaining the header and data cells in the table and parsing the table based on the hierarchical relationship, the problem of poor applicability of table parsing in existing technologies is solved, and effective parsing of various types of tables is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- IFLYTEK (SUZHOU) TECH CO LTD
- Filing Date
- 2022-09-08
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies are not effectively applicable to parsing various types of tables, resulting in poor parsing applicability.
By obtaining the header cells and data cells in the target table, and establishing the correspondence between the header cells and data cells based on the hierarchical relationship of the header cells, the table is parsed using semantic and spatial features.
It achieves applicability parsing for various types of tables, suitable for tables with simple and complex hierarchies, thus improving the generalization ability of the parsing.
Smart Images

Figure CN116306566B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of table parsing technology, and in particular to a table parsing method, apparatus, device, and storage medium. Background Technology
[0002] Tables are a very common way to display data, and their intuitiveness is conducive to the expression of structured information. However, in some scenarios such as table-based Q&A where tables need to be parsed, there are many types of tables to be parsed, and existing methods can only parse a certain type of table. For example, the method of pre-determining table parsing rules requires different parsing rules to be formulated for different types of tables. One rule cannot achieve applicability to parsing various types of tables, resulting in poor applicability of table parsing.
[0003] Therefore, it is of great significance to understand how to analyze the applicability of various tables. Summary of the Invention
[0004] The main technical problem addressed by this application is to provide a table parsing method, apparatus, device, and storage medium that can parse various types of tables, with strong applicability and generalization ability.
[0005] To solve the above-mentioned technical problems, one technical solution adopted in this application is to provide a table parsing method, which includes: obtaining a target table to be parsed; determining a number of header cells and a number of data cells contained in the target table; obtaining the hierarchical relationship between the header cells based on the text information in the header cells; and obtaining the corresponding result between the data cells and at least one header cell based on the hierarchical relationship between the header cells.
[0006] The process of determining the header cells and data cells contained in the target table includes: obtaining the semantic features of each cell based on the text information in each cell of the target table, and obtaining the spatial features of each cell based on the spatial information of each cell in the target table; and using the semantic features and spatial features of each cell, determining whether each cell is a table cell or a data cell.
[0007] The process of obtaining semantic features of each cell based on the text information in each cell of the target table includes: for each cell, encoding the text information of the cell to obtain the text features of the cell, and determining the auxiliary features of the cell, which include at least one of the following: attribute features that characterize the attributes of the text information of the cell, and layout features that characterize the layout information of the cell in the target table; and using the text features and auxiliary features of the cell to obtain the semantic features of the cell.
[0008] The attributes of the text information include at least one of the following: the length of the text information, whether the text information is a date, whether the text information is purely numeric, the proportion of numeric values in the text information, and whether the text information begins with a numeric value; and / or, the layout information includes at least one of the following: the row and column of the cell, the number of neighboring cells of the cell, and the number of child cells contained in the cell; and / or, the semantic features of the cell are obtained by using the text features and auxiliary features of the cell, including: fusing the text features and auxiliary features of the cell to obtain a first fused feature; and performing semantic parsing on the first fused feature to obtain the semantic features of the cell.
[0009] The process of obtaining spatial features of each cell based on its spatial information in the target table includes: taking each cell as a target cell and constructing a graph representation of the target cell. The graph representation of the target cell includes a target node representing the target cell and at least one neighboring node representing at least one neighboring cell of the target cell. Each neighboring node is connected to the target node by a connecting edge, and the type of the connecting edge between the neighboring node and the target node matches the positional relationship between the corresponding neighboring cell and the target cell. The graph representation of the target cell is then encoded to obtain the spatial features of the target cell.
[0010] The process of obtaining the hierarchical relationship between header cells based on text information in several header cells includes: obtaining the header category of each header cell based on text information in each header cell; determining the hierarchical relationship between header cells using the header category of each header cell; and / or obtaining the text representation of at least two header cells based on text information in at least two header cells, and determining the hierarchical relationship of each target header cell pair based on the text representation of each target header cell pair, wherein the target header cell pair contains two header cells located in the same row or column of at least two header cells.
[0011] The header category of the header cell includes at least two of the following: table item name, table item, attribute name, total, and title; and / or, based on the text information in each header cell, the header category of each header cell is obtained, including: for each header cell, fusing the semantic features, spatial features, and category features of the header cell to obtain the second fused feature of the header cell, wherein the semantic features of the header cell are determined based on the text information in the header cell, and the category features of the header cell are determined based on the cell category corresponding to the header cell; classifying the second fused feature of the header cell to obtain the header category of the header cell.
[0012] Specifically, based on the hierarchical relationship between the header cells, the corresponding results between several data cells and at least one header cell are obtained, including: determining the data parsing direction of each header cell in the target table according to the hierarchical relationship between the header cells; and for each header cell, determining the correspondence between at least one data cell in the target table located in the data parsing direction of the header cell and the header cell according to the data parsing direction.
[0013] Specifically, based on the hierarchical relationship between the header cells, the data parsing direction of each header cell in the target table is determined, including: treating each header cell as a cell to be parsed; determining the data parsing direction of the cell to be parsed as downward or upward in response to the parallel hierarchical relationship between the cell to be parsed and its neighboring header cells in the same row; and determining the data parsing direction of the cell to be parsed as right or left in response to the parallel hierarchical relationship between the cell to be parsed and its neighboring header cells in the same column.
[0014] The process of obtaining the target table to be parsed includes: obtaining at least one original table; finding a regular expression corresponding to the table to be parsed from a preset database; and determining the original table as the target table in response to the similarity between the table name of the original table and the regular expression reaching a preset threshold.
[0015] To solve the above-mentioned technical problems, another technical solution adopted in this application is: to provide a table parsing device, which includes: an acquisition module for acquiring a target table to be parsed; a determination module for determining a plurality of header cells and a plurality of data cells contained in the target table; a hierarchy relationship determination module for obtaining the hierarchy relationship between the header cells based on the text information in the header cells; and a correspondence relationship determination module for obtaining the correspondence between the data cells and at least one header cell based on the hierarchy relationship between the header cells.
[0016] To solve the above-mentioned technical problems, another technical solution adopted in this application is: to provide an electronic device, including a memory and a processor coupled to each other, wherein the memory stores program instructions; and the processor is used to execute the program instructions stored in the memory to implement the above-mentioned method.
[0017] To solve the above-mentioned technical problems, another technical solution adopted in this application is to provide a computer-readable storage medium for storing program instructions that can be executed to implement the above-mentioned method.
[0018] The beneficial effects of this application are as follows: After obtaining the target table to be parsed, this application first determines several header cells and several data cells contained in the target table. Then, based on the text information in the header cells, it obtains the hierarchical relationship between the header cells. Furthermore, based on the hierarchical relationship between the header cells, it obtains the corresponding results between several data cells and at least one header cell, thus establishing a connection between the header cells and data cells, thereby realizing table parsing. Compared with rule-based table parsing methods, this application does not need to formulate corresponding parsing rules for various types of tables with different levels. It can directly parse the table based on the obtained hierarchical relationship of each header cell. Therefore, the table parsing method of this application is applicable to the parsing of various types of tables with different levels (e.g., simple hierarchical tables, as well as complex combined tables and nested tables), and has strong applicability and generalization ability. Attached Figure Description
[0019] Figure 1 This is a flowchart illustrating an embodiment of the table parsing method provided in this application;
[0020] Figure 2 yes Figure 1 The illustrated step S14 is a partial flowchart of one embodiment.
[0021] Figure 3 yes Figure 1 The flowchart of step S11 shown is a schematic diagram of one embodiment.
[0022] Figure 4 yes Figure 1 A partial flowchart of one embodiment of step S12 is shown below;
[0023] Figure 5 yes Figure 1 A partial flowchart of one embodiment of step S12 is shown below;
[0024] Figure 6 It is the target cell graph representation constructed in the table parsing method provided in this application;
[0025] Figure 7 This is a schematic diagram of the framework of an embodiment of the table parsing method provided in this application;
[0026] Figure 8 This is a schematic diagram of the framework of an embodiment of the table parsing device provided in this application;
[0027] Figure 9 This is a schematic diagram of the structure of an embodiment of the electronic device provided in this application;
[0028] Figure 10 This is a schematic diagram of the structure of the computer-readable storage medium provided in this application. Detailed Implementation
[0029] To make the purpose, technical solution and effects of this application clearer and more explicit, the following describes this application in further detail with reference to the accompanying drawings and embodiments.
[0030] It should be noted that if the embodiments of this application involve descriptions such as "first" or "second," these descriptions are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, features defined with "first" or "second" may explicitly or implicitly include at least one of those features. Furthermore, the technical solutions of the various embodiments can be combined with each other, but this must be based on the ability of those skilled in the art to implement them. When the combination of technical solutions is contradictory or impossible to implement, it should be considered that such a combination of technical solutions does not exist and is not within the scope of protection claimed in this application.
[0031] Please see Figure 1 , Figure 1 This is a flowchart illustrating an embodiment of the table parsing method provided in this application. It should be noted that if substantially the same result is obtained, this embodiment does not necessarily reflect that result. Figure 1 The illustrated process sequence is limited. For example... Figure 1 As shown, this embodiment includes:
[0032] S11: Obtain the target table to be parsed.
[0033] The method in this embodiment is used to obtain the hierarchical relationship between several header cells through the text information of each header cell, and then obtain the corresponding result between the data cell and the header cell based on the hierarchical relationship between the header cells.
[0034] The target table to be parsed can be a table or an image of a table existing in a document such as a PDF or Word document. There can be one or more target tables to be parsed, depending on the actual needs.
[0035] In one embodiment, the target table to be parsed is one or more tables existing in a PDF or Word document, which can be extracted from the corresponding PDF or Word document using extraction logic. In another embodiment, the target table to be parsed can be obtained by taking a picture or optical character recognition. Of course, in other methods, when the target table needs to be obtained, it can be obtained from local storage or cloud storage. The specific method for obtaining the target table to be parsed can be determined according to the actual scenario, and is not specifically limited here.
[0036] S12: Determine the header cells and data cells contained in the target table.
[0037] In one embodiment, the target table includes header cells and data cells. In other embodiments, in addition to header cells and data cells, the target table also includes at least one of the following: a table name corresponding to the target table and context information associated with the table. Here, "several" indicates that the corresponding quantity is at least one; it can be one or more. The specific number of header cells and data cells in the target table and the information contained in the target table can be determined according to actual needs and are not specifically limited here.
[0038] In one embodiment, the semantic and spatial information of each cell in the target table can be used to determine a number of header cells and a number of data cells contained in the target table. Specifically, the semantic information of each cell is obtained based on the text information in each cell of the target table and is used to represent the semantic information of the corresponding cell; the spatial information of each cell is obtained based on the spatial information of the cell in the target table and is used to represent the positional relationship with each adjacent cell in the target table.
[0039] It should be noted that in some scenarios, for simple tables, such as those with a single header and a single data element, the corresponding header and data cells can be directly determined based on text and spatial information. However, for other scenarios with complex table hierarchies, relying solely on the text and spatial information of cells is insufficient to fully express the semantic information of each cell, hindering semantic parsing of the table. Therefore, auxiliary information representing the semantics of cells can be pre-constructed to extract auxiliary features of each cell. By combining text and auxiliary information, the semantic information of each cell is enriched, thereby facilitating subsequent table parsing. This auxiliary information can include, for example, information related to the cell (row and column, number of neighboring cells, etc.) or information related to the text within the cell (text length, whether it is purely numeric, etc.).
[0040] Of course, in other embodiments, the semantic or spatial information of each cell in the target table can also be used to determine the header cells and data cells contained in the target table. The specific method for determining the header cells and data cells in the target table can be determined according to the complexity of the target table's structural hierarchy, and is not specifically limited here.
[0041] S13: Based on the text information in several header cells, obtain the hierarchical relationship between the header cells.
[0042] Please refer to Table 1. As shown in Table 1, the text information in the cells is the corresponding text information in the cells. For example, the text information in the cell containing "Serial Number" is "Serial Number". The hierarchical relationship includes parallel and subordinate relationships. Specifically, each pair of "Serial Number", "Customer Name", "Sales Region", "Sales Content" and "Sales Amount" in the table header is in a parallel relationship, while each of the table headers "Serial Number", "Customer Name", "Sales Region", "Sales Content" and "Sales Amount" belongs to 2019 and is therefore in a subordinate relationship.
[0043] Table 1. Sales Revenue for 2019
[0044]
[0045] In one implementation, for target tables with relatively simple hierarchical relationships, the hierarchical relationship of each header cell can be directly determined based on the header category corresponding to each header cell. For example, the header category of each header cell can be obtained based on the text information in each header cell; then, the hierarchical relationship between header cells can be directly determined using the header categories of each header cell. Specifically, for each header cell, the semantic features, spatial features, and category features of the header cell are fused (e.g., by concatenating the features) to obtain the second fused feature of the header cell. Then, a classification model is used to classify the second fused feature of the header cell to obtain the header category of the header cell. Similar to step S12, the semantic features of the header cell are determined based on the text information in the header cell, and the category features of the header cell are determined based on the cell category corresponding to the header cell, for example, by vector transformation of the cell category label corresponding to the header cell.
[0046] The header cells include at least two of the following: table item name, table item, attribute name, total, and title. As shown in Table 1, "Serial Number" and "Customer Name" are table item names, 1, 2, and 3 under "Serial Number" and A, B, and C under "Customer Name" are table items, "Sales Region", "Sales Content", and "Sales Amount" are attribute names, and "2019" is the title.
[0047] It should be noted that there is a hierarchical relationship between the header categories of each header cell. Therefore, after determining the header category of each header, the hierarchical relationship between the header cells can be directly determined. Specifically, adjacent headers of the same category are parallel, table items must belong to table item names, and table item names and attribute names belong to headings. As shown in Table 1, the table item names ("Serial Number" and "Customer Name") and attribute names ("Sales Region", "Sales Content", and "Sales Amount") in the second row of the target table all belong to the heading "2019" in the first row; the table item names "Serial Number" and "Customer Name" are parallel; the three table items A, B, and C all belong to the table item name "Customer Name", and they are parallel to each other, etc.
[0048] In another embodiment, for tables with complex hierarchical relationships, after determining each header cell in the target table, the text representation of at least two header cells can be obtained based on the text information in at least two header cells, and the hierarchical relationship of each target header cell pair can be determined based on the text representation of each target header cell pair, wherein the target header cell pair includes two header cells located in the same row or column of at least two header cells.
[0049] For example, text information in header cells located in the same row or column is concatenated according to a preset rule to obtain the corresponding concatenated text. The preset rule may include, but is not limited to, separating the text in each cell with "#" to distinguish each cell. As shown in Table 2, "Assets#Ending Balance#Last Year's Ending Balance" in the same row and "Assets#Current Assets:#Cash and Cash Equivalents#Settlement Reserves*#Lending Funds*" in the same column are concatenated to obtain each concatenated text. Then, the corresponding concatenated text is encoded to obtain the corresponding concatenated text features. It can be understood that the obtained concatenated text features include the text features of each header cell located in the same row or column. The text features of every two header cells in each concatenated text are combined to obtain the text representation of each target header cell pair. Then, through a fully connected layer, each target header cell pair is classified to obtain the hierarchical relationship of each target header cell pair, that is, the hierarchical relationship of every two header cells in the same row or column.
[0050] Table 2 Asset Statement
[0051] assets Ending balance Last year's year-end balance Current assets: Cash and cash Settlement reserve fund* Funds to be lent out*
[0052] S14: Based on the hierarchical relationship between the header cells, obtain the corresponding results between several data cells and at least one header cell.
[0053] In one embodiment, the data parsing direction of each header cell in the target table can be determined based on the hierarchical relationship between the header cells. Then, based on the data parsing direction, the correspondence between at least one data cell in the target table located in the data parsing direction of the header cell and the header cell can be determined, thereby obtaining the correspondence between several data cells and at least one header cell.
[0054] Specifically, please refer to Figure 2 , Figure 2 yes Figure 1 The diagram shows a partial flowchart of one embodiment of step S14. It should be noted that if substantially the same result is achieved, this embodiment does not necessarily follow the same pattern. Figure 2 The illustrated process sequence is limited. For example... Figure 2 As shown, in this embodiment, the data parsing direction of each header cell in the target table is determined based on the hierarchical relationship between the header cells, specifically including:
[0055] S21: Treat each header cell as a cell to be parsed.
[0056] Since the header and data are in a corresponding relationship, when determining the data parsing direction of each header cell in the target table, it is necessary to first determine the cell to be parsed. In this embodiment, each header cell is taken as the cell to be parsed.
[0057] S22: In response to the fact that the hierarchical relationship between the cell to be parsed and the neighboring header cell in the same row is parallel, the data parsing direction of the cell to be parsed is determined to be downward or upward.
[0058] When the cell to be parsed and its neighboring header cell in the same row are in a parallel hierarchical relationship, the direction of the data cell corresponding to the cell to be parsed can be determined as either above or below the row containing the cell to be parsed. In other words, the data parsing direction of the cell to be parsed is determined to be downwards or upwards. The upward or downward parsing direction can be determined based on the position of the cell to be parsed in the target table. For example, if the cell to be parsed is in the first row of the target table, the parsing direction is downwards; if the cell to be parsed is in the last row of the target table, the parsing direction is upwards.
[0059] S23: In response to the fact that the hierarchical relationship between the cell to be parsed and the neighboring header cell in the same column is parallel, the data parsing direction of the cell to be parsed is determined to be either right or left.
[0060] When the cell to be parsed and its neighboring header cell in the same column are in a parallel hierarchical relationship, the direction of the data cell corresponding to the cell to be parsed can be determined as either to the left or right of the row containing the cell to be parsed. The parsing direction (left or right) is determined by the cell's position in the target table. For example, if the cell to be parsed is in the left column of the target table, the parsing direction is to the right; if it is in the right column, the parsing direction is to the left.
[0061] In this embodiment, after obtaining the target table to be parsed, the number of header cells and data cells contained in the target table are first determined. Then, based on the text information in the header cells, the hierarchical relationship between the header cells is obtained. Furthermore, based on the hierarchical relationship between the header cells, the corresponding results between the data cells and at least one header cell are obtained, thus establishing a connection between the header cells and the data cells, thereby realizing the table parsing. Compared with rule-based table parsing methods, this application does not require the formulation of corresponding parsing rules for various types of tables with different levels. It can directly parse the table based on the obtained hierarchical relationship of each header cell. Therefore, the table parsing method of this application is applicable to the parsing of various types of tables with different levels (e.g., simple hierarchical tables, as well as complex combined tables and nested tables), and has strong applicability and generalization ability.
[0062] Please see Figure 3 , Figure 3 yes Figure 1 The diagram shows a flowchart of one embodiment of step S11. It should be noted that if substantially the same result is achieved, this embodiment does not necessarily follow the same pattern. Figure 3 The illustrated process sequence is limited. For example... Figure 3 As shown, this embodiment includes:
[0063] S31: Obtain at least one original table.
[0064] It should be noted that, for example, in a table-based question-and-answer scenario, where it is necessary to find the target table to be parsed from a large number of tables, the method in this embodiment can be used to extract the target table to be parsed from at least one original table.
[0065] As described in step S11, the original table can be a table or an image corresponding to a table existing in a document such as a PDF or Word document. In one embodiment, when the original table needs to be obtained, it can be retrieved from local storage or cloud storage.
[0066] S32: Find the regular expression corresponding to the table to be parsed from the preset database.
[0067] In this embodiment, a pre-stored database contains regular expressions for various tables, such as tables related to sales or rankings. In practical applications, key information related to the table to be parsed can be determined according to actual needs. For example, to extract an asset table, the database stores all regular expressions related to asset tables. This key information (e.g., asset table) can be used to find the regular expression corresponding to the table to be parsed from the pre-stored database. After finding the regular expression corresponding to the table to be parsed, the similarity between the table name and the corresponding regular expression of at least one original table is calculated. The target table to be parsed is found based on the calculated similarity result. The table name can be the text above the table structure.
[0068] In one embodiment, the table name can be encoded to obtain a corresponding encoding vector v1, and the regular expression in the preset database can be encoded to obtain an encoding vector v2. The similarity between the table name and the corresponding regular expression of the original table is calculated according to the formula: Sim = cos(v1, v2). Here, sim represents the similarity and cos represents the cosine function.
[0069] S33: In response to the fact that the similarity between the table name of the original table and the regular expression reaches a preset threshold, the original table is determined to be the target table.
[0070] In this embodiment, if the similarity between the table name of the original table and the regular expression reaches a preset threshold, the corresponding original table is determined to be the target table. The preset similarity threshold can be determined according to actual circumstances and is not specifically limited here.
[0071] Of course, in other methods, the table names in the extraction logic can be used to filter the table names corresponding to at least one original table to obtain the target table to be parsed. Specifically, each extraction logic corresponding to each table to be extracted can be stored in the database in advance. The extraction logic defines standard table names. The table names of at least one original table are encoded to obtain an encoding vector v1, and the standard table name in the extraction logic corresponding to the table to be extracted is encoded to obtain an encoding vector v2. The similarity between the table name of the original table and the standard table name in the extraction logic is calculated using the formula Sim = cos(v1, v2). When the similarity reaches a preset threshold, the corresponding original table is determined as the target table to be extracted.
[0072] Please see Figure 4 , Figure 4 yes Figure 1 The diagram shows a partial flowchart of one embodiment of step S12. It should be noted that if substantially the same result is achieved, this embodiment does not necessarily follow the same pattern. Figure 4 The illustrated process sequence is limited. For example... Figure 4As shown, in this embodiment, the semantic features of each cell are obtained based on the text information in each cell of the target table, specifically including:
[0073] S41: For each cell, encode the text information of the cell to obtain the text features of the cell, and determine the auxiliary features of the cell.
[0074] In this embodiment, the auxiliary features are obtained based on pre-constructed auxiliary information representing the semantics of the cells. Specifically, auxiliary features for representing the semantics of cells are constructed based on the pre-constructed auxiliary information. These auxiliary features include at least one of the following: attribute features characterizing the text information of the cell, and layout features characterizing the layout information of the cell in the target table. The text information attributes include at least one of the following: the length of the text information, whether the text information is a date, whether the text information is purely numeric, the proportion of numbers in the text information, and whether the text information begins with a number. The layout information includes at least one of the following: the row and column of the cell, the number of neighboring cells, and the number of sub-cells contained in the cell. The above-mentioned text information attribute information and layout information are only examples and do not limit the scope of information contained in the attribute information and layout information. The specific information contained in the attribute information and layout information can be determined according to the actual application scenario, and is not specifically limited here.
[0075] It should be noted that the features contained in the auxiliary features correspond to the pre-built auxiliary information. Therefore, in real-world scenarios, auxiliary information can be constructed to represent the semantics of cells according to actual needs, so that the features contained in the auxiliary features can more fully represent the semantic information contained in the corresponding cells.
[0076] S42: Utilize the text features and auxiliary features of the cell to obtain the semantic features of the cell.
[0077] In one embodiment, after obtaining the text features and auxiliary features of a cell, the text features and auxiliary features of the cell are fused to obtain a first fused feature. Then, the first fused feature is semantically parsed to obtain the semantic features of the cell. The semantic parsing method can be a Bi-GRU network or other parsing models.
[0078] The text features and auxiliary features of the fused cell can be, for example, a process of concatenating the text features and auxiliary features. For example, the text features are 50-dimensional and the auxiliary features are 60-dimensional. The first fused feature is a 110-dimensional fused feature obtained by concatenating the text features and auxiliary features.
[0079] In some implementations, a neural network or feature extraction algorithm may be used to extract features from the first fused features in order to perform semantic parsing on the first fused features and obtain the semantic features of the cell.
[0080] Please see Figure 5 , Figure 5 yes Figure 1 The diagram shows a partial flowchart of one embodiment of step S12. It should be noted that if substantially the same result is achieved, this embodiment does not necessarily follow the same pattern. Figure 5 The illustrated process sequence is limited. For example... Figure 5 As shown, in this embodiment, the spatial features of each cell are obtained based on the spatial information of each cell in the target table, specifically including:
[0081] S51: Use each cell as the target cell to construct a graphical representation of the target cell.
[0082] The graphical representation of a target cell includes a target node representing the target cell, and at least one neighboring node representing at least one neighboring cell of the target cell. "Neighboring" indicates a certain proximity relationship, such as... Figure 6 As shown, hollow circles represent the target node of the target cell, and solid circles represent the neighboring nodes of the target node. The neighboring nodes of the target node are located above, below, left, right, upper left, upper right, and lower left, lower right, and lower right of the target node in the target cell.
[0083] Each neighboring node is connected to the target node by a connecting edge, and the type of the connecting edge between the neighboring node and the target node matches the positional relationship between the corresponding neighboring cell and the target cell. In other words, the type of the connecting edge between the neighboring node and the target node is the positional relationship between the corresponding neighboring cell and the target cell.
[0084] S52: Encode the graphical representation of the target cell to obtain the spatial features of the target cell.
[0085] In one embodiment, a graph convolutional neural network can be used to encode the graph representation of the target cell to obtain the spatial features of the target cell.
[0086] In one specific implementation, such as Figure 7As shown, the text in each cell is encoded to obtain text features, and auxiliary features of each cell are obtained through pre-constructed auxiliary information. The text features and auxiliary features of the cells are fused (concatenated) to obtain the first fused feature. Then, the first fused feature is semantically parsed using a method such as Bi-GRU network to obtain the semantic features of the cells. At the same time, a graph representation of each cell in the target table is constructed, and the graph representation of the target cells is encoded to obtain the spatial features of the target cells. Further, the semantic information and spatial information of the cells are concatenated and fused to obtain the features of each cell. Then, the features of each cell are extracted using a pre-trained classification model to obtain the category features of each cell, that is, to obtain each header cell and each data cell.
[0087] Please see Figure 8 , Figure 8 This is a schematic diagram of a framework of an embodiment of the table parsing device provided in this application. In this embodiment, the table parsing device includes an acquisition module 81, a determination module 82, a hierarchy relationship determination module 83, and a correspondence relationship determination module 84. The acquisition module 81 is used to acquire the target table to be parsed; the determination module 82 is used to determine the plurality of header cells and the plurality of data cells contained in the target table; the hierarchy relationship determination module 83 is used to obtain the hierarchy relationship between the header cells based on the text information in the header cells; and the correspondence relationship determination module 84 is used to obtain the correspondence result between the plurality of data cells and at least one header cell based on the hierarchy relationship between the header cells.
[0088] In some embodiments, the determining module 82 determines a number of header cells and a number of data cells contained in the target table, including: obtaining the semantic features of each cell based on the text information in each cell of the target table, and obtaining the spatial features of each cell based on the spatial information of each cell in the target table; and using the semantic features and spatial features of each cell, determining whether each cell is a table cell or a data cell.
[0089] In some embodiments, the acquisition module 81 acquires the semantic features of each cell based on the text information in each cell of the target table, including: for each cell, encoding the text information of the cell to obtain the text features of the cell, and determining the auxiliary features of the cell, the auxiliary features including at least one of the following: attribute features that characterize the attributes of the text information of the cell, layout features that characterize the layout information of the cell in the target table; and using the text features and auxiliary features of the cell to obtain the semantic features of the cell.
[0090] In some embodiments, the attributes of the text information include at least one of the following: the length of the text information, whether the text information is a date, whether the text information is purely numeric, the proportion of numeric values in the text information, and whether the text information begins with a numeric value; and / or, the layout information includes at least one of the following: the row and column of the cell, the number of neighboring cells of the cell, and the number of child cells contained in the cell; and / or, the semantic features of the cell are obtained by utilizing the text features and auxiliary features of the cell, including: fusing the text features and auxiliary features of the cell to obtain a first fused feature; and performing semantic parsing on the first fused feature to obtain the semantic features of the cell.
[0091] In some embodiments, the acquisition module 81 acquires the spatial features of each cell based on the spatial information of each cell in the target table, including: taking each cell as a target cell, constructing a graph representation of the target cell, the graph representation of the target cell including a target node representing the target cell, and at least one neighboring node representing at least one neighboring cell of the target cell, each neighboring node being connected to the target node by a connecting edge, and the type of the connecting edge between the neighboring node and the target node matching the positional relationship between the corresponding neighboring cell and the target cell; encoding the graph representation of the target cell to obtain the spatial features of the target cell.
[0092] In some embodiments, the hierarchy determination module 83 obtains the hierarchy relationship between header cells based on the text information in several header cells, including: obtaining the header category of each header cell based on the text information in each header cell; determining the hierarchy relationship between header cells using the header category of each header cell; and / or, obtaining the text representation of at least two header cells based on the text information in at least two header cells, and determining the hierarchy relationship of each target header cell pair based on the text representation of each target header cell pair, wherein the target header cell pair includes two header cells located in the same row or column of at least two header cells.
[0093] In some embodiments, the header category of a header cell includes at least two of the following: item name, item, attribute name, total, and title; and / or, based on the text information in each header cell, the header category of each header cell is obtained, including: for each header cell, fusing the semantic features, spatial features, and category features of the header cell to obtain a second fused feature of the header cell, wherein the semantic features of the header cell are determined based on the text information in the header cell, and the category features of the header cell are determined based on the cell category corresponding to the header cell; classifying the second fused feature of the header cell to obtain the header category of the header cell.
[0094] In some embodiments, the correspondence determination module 84 obtains the correspondence results between several data cells and at least one header cell based on the hierarchical relationship between each header cell, including: determining the data parsing direction of each header cell in the target table according to the hierarchical relationship between each header cell; and for each header cell, determining the correspondence result between at least one data cell in the target table located in the data parsing direction of the header cell and the header cell according to the data parsing direction.
[0095] In some embodiments, determining the data parsing direction of each header cell in the target table based on the hierarchical relationship between the header cells includes: treating each header cell as a cell to be parsed; determining the data parsing direction of the cell to be parsed as downward or upward in response to the hierarchical relationship between the cell to be parsed and its neighboring header cells in the same row being parallel; and determining the data parsing direction of the cell to be parsed as right or left in response to the hierarchical relationship between the cell to be parsed and its neighboring header cells in the same column being parallel.
[0096] In some embodiments, the acquisition module 81 acquires the target table to be parsed, including: acquiring at least one original table; finding a regular expression corresponding to the table to be parsed from a preset database; and determining the original table as the target table in response to the similarity between the table name of the original table and the regular expression reaching a preset threshold.
[0097] Please see Figure 9 , Figure 9 This is a schematic diagram of an embodiment of the electronic device provided in this application. In this embodiment, the electronic device 90 includes a processor 91 and a memory 92.
[0098] Processor 91 can also be referred to as CPU (Central Processing Unit). Processor 91 may be an integrated circuit chip with signal processing capabilities. Processor 91 can also be a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component. A general-purpose processor can be a microprocessor, or processor 91 can be any conventional processor 91, etc.
[0099] The memory 92 in the electronic device 90 is used to store the program instructions required for the processor 91 to run.
[0100] The processor 91 is used to execute program instructions to implement the methods provided in any of the above embodiments and any non-conflicting combinations thereof.
[0101] Please see Figure 10 , Figure 10 This is a schematic diagram of the structure of the computer-readable storage medium provided in this application. The computer-readable storage medium 100 of this application embodiment stores program instructions 101, which, when executed, implement the methods provided in any of the above embodiments and any non-conflicting combinations. The program instructions 101 can form a program file and be stored in the computer-readable storage medium 100 in the form of a software product, so that a computer device (which may be a personal computer, server, or network device, etc.) can execute all or part of the steps of the methods of various embodiments of this application. The aforementioned computer-readable storage medium 100 includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks, or terminal devices such as computers, servers, mobile phones, and tablets.
[0102] The above description is merely an embodiment of this application and does not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. A table parsing method characterized by comprising: The method includes: Obtain the target table to be parsed; Using the semantic and spatial features of each cell in the target table, the cell category of each cell is determined to be either a header cell or a data cell. The semantic features are determined based on the text information in the corresponding cell, and the spatial features are obtained by encoding a graph representation constructed using the corresponding cell and at least one of its neighboring cells. The graph representation includes: a target node representing the corresponding cell, and at least one neighboring node representing at least one neighboring cell of the cell. Each neighboring node is connected to the target node through connecting edges, and the type of the connecting edges matches the positional relationship between the neighboring cells and the corresponding cell. Based on the text information in several header cells, the hierarchical relationship between the header cells is obtained. Based on the hierarchical relationship between the header cells, the corresponding results between several data cells and at least one header cell are obtained; The step of obtaining the hierarchical relationship between the header cells based on the text information in several header cells includes: For each header cell, the semantic features, spatial features, and category features of the header cell are fused to obtain the second fused feature of the header cell; the category feature determines the cell category corresponding to the header cell. The second fusion feature of the header cell is classified to obtain the header category of the header cell; the header category of the header cell includes at least two of the following: table item name, table item, attribute name, total, and title; The hierarchical relationship between the header cells is determined by using the header category of each header cell.
2. The method of claim 1, wherein, The semantic features of each cell in the target table are obtained based on the text information in each cell, including: For each cell, the text information of the cell is encoded to obtain the text feature of the cell, and the auxiliary features of the cell are determined. The auxiliary features include at least one of the following: attribute features that characterize the attributes of the text information of the cell, and layout features that characterize the layout information of the cell in the target table. The semantic features of the cell are obtained by using the text features and auxiliary features of the cell.
3. The method of claim 2, wherein, The attributes of the text information include at least one of the following: the length of the text information, whether the text information is a date, whether the text information is purely numeric, the proportion of numeric values in the text information, and whether the text information begins with a numeric value; and / or, The layout information includes at least one of the following: the row and column of the cell, the number of neighboring cells of the cell, and the number of child cells contained in the cell; And / or, The step of obtaining the semantic features of the cell by utilizing the text features and auxiliary features of the cell includes: The text features and auxiliary features of the cell are fused to obtain the first fused feature; Semantic parsing is performed on the first fused feature to obtain the semantic features of the cell.
4. The method of claim 1, wherein, The step of obtaining the hierarchical relationship between the header cells based on the text information in several header cells also includes: Based on the text information in at least two of the header cells, the text representations of at least two of the header cells are obtained, and the hierarchical relationship of each target header cell pair is determined based on the text representations of each target header cell pair. The target header cell pair includes two header cells located in the same row or column of the at least two header cells.
5. The method of claim 1, wherein, The process of obtaining the correspondence between several data cells and at least one header cell based on the hierarchical relationship between the header cells includes: Based on the hierarchical relationship between the header cells, determine the data parsing direction of each header cell in the target table; For each of the header cells, based on the data parsing direction, determine the correspondence between at least one data cell in the target table located in the data parsing direction of the header cell and the header cell.
6. The method of claim 5, wherein, The step of determining the data parsing direction of each header cell in the target table based on the hierarchical relationship between the header cells includes: Each of the aforementioned header cells is treated as a cell to be parsed; In response to the fact that the hierarchical relationship between the cell to be parsed and the neighboring header cell in the same row is parallel, the data parsing direction of the cell to be parsed is determined to be downward or upward; In response to the fact that the hierarchical relationship between the cell to be parsed and the neighboring header cell in the same column is parallel, the data parsing direction of the cell to be parsed is determined to be either right or left.
7. The method of claim 1, wherein, The process of obtaining the target table to be parsed includes: Obtain at least one original table; Find the regular expression corresponding to the table to be parsed from the preset database; In response to the fact that the similarity between the table name of the original table and the regular expression reaches a preset threshold, the original table is determined to be the target table.
8. A table parsing apparatus characterized by comprising: The device includes: The acquisition module is used to acquire the target table to be parsed; The determination module is used to determine whether each cell in the target table is a header cell or a data cell by utilizing the semantic and spatial features of each cell. The semantic features are determined based on the text information in the corresponding cell, and the spatial features are obtained by encoding a graph representation constructed using the corresponding cell and at least one of its neighboring cells. The graph representation includes: a target node representing the corresponding cell, and at least one neighboring node representing at least one neighboring cell of the cell. Each neighboring node is connected to the target node by a connecting edge, and the type of the connecting edge matches the positional relationship between the neighboring cell and the corresponding cell. The hierarchy determination module is used to determine the hierarchy between header cells based on the text information in several header cells; The correspondence determination module is used to obtain the correspondence between several data cells and at least one of the header cells based on the hierarchical relationship between each header cell; The step of obtaining the hierarchical relationship between the header cells based on the text information in several header cells includes: For each header cell, the semantic features, spatial features, and category features of the header cell are fused to obtain the second fused feature of the header cell; the category feature determines the cell category corresponding to the header cell. The second fusion feature of the header cell is classified to obtain the header category of the header cell; the header category of the header cell includes at least two of the following: table item name, table item, attribute name, total, and title; The hierarchical relationship between the header cells is determined by using the header category of each header cell.
9. An electronic device, comprising: Including interconnected memory and processor, The memory stores program instructions; The processor is used to execute program instructions stored in the memory to implement the method according to any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store program instructions that can be executed to implement the method of any one of claims 1-7.