Structured method of image table and training method of image table structuring model
By setting virtual anchor lines in the image and performing feature extraction and regression processing, the problem of inaccurate table line recognition in existing technologies is solved, achieving efficient and accurate table structuring.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG E COMMERCE BANK CO LTD
- Filing Date
- 2022-08-12
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, the recognition of table lines in images is not accurate enough and has a large error. In particular, methods based on semantic segmentation and object detection have a large computational load and unsatisfactory recognition results when recognizing large images.
The concept of virtual anchor lines is adopted, and multiple virtual anchor lines are set in the image. The coordinates of the table lines are generated through feature extraction and regression processing. The position of the table lines is determined by regression processing using feature vectors. An end-to-end network model is used for feature extraction and regression.
It improves the accuracy and efficiency of table line recognition, reduces the amount of computation, and can handle problems such as image rotation and tilt, thus achieving efficient table structuring.
Smart Images

Figure CN115457582B_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the field of image recognition technology, and in particular to a method for structuring image tables and a training method for image table structuring models. Background Technology
[0002] Tables are an important form of information presentation, organizing data into a standardized structure that facilitates information retrieval and comparison. However, when tables exist as images, information retrieval and comparison of the table content within the image cannot be performed directly. Therefore, image recognition is required to extract the data and structural information from the table, revealing the distribution of rows and columns and the logical structure between cells. Ultimately, this allows for the reconstruction of the table document, enabling users to perform efficient and accurate data analysis based on the electronic table.
[0003] In related technologies, the recognition of table lines in images is not accurate enough and has a large error. Summary of the Invention
[0004] To overcome the problems existing in related technologies, this specification provides a method for structuring image tables and a method for training image table structuring models.
[0005] According to a first aspect of the embodiments of this specification, a method for structuring an image table is provided, the image containing a table to be identified, the method comprising:
[0006] Multiple virtual anchor lines are set in the image, and the virtual anchor lines are defined by the original coordinates of a set of virtual anchor points arranged along the virtual anchor lines on the image.
[0007] Feature extraction is performed on the image to obtain the feature map corresponding to the image;
[0008] For each virtual anchor line, a corresponding feature vector is generated; wherein, the feature vector corresponding to any virtual anchor line is generated in the following way: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map;
[0009] Regression processing is performed on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and the corresponding regression coordinates are generated based on the feature vectors corresponding to the target virtual anchor lines. The regression coordinates are the coordinates of the table line of the table to be identified in the image.
[0010] Based on the generated regression coordinates, a structured table corresponding to the table to be identified is generated.
[0011] According to a second aspect of the embodiments of this specification, a method for training an image table structured model is provided, the method comprising:
[0012] Obtain a training sample image set. Each sample image in the training sample image set includes a table line with known real coordinates, and multiple virtual anchor lines are set in the sample image. The virtual anchor lines are defined by a set of virtual anchor points arranged along the virtual anchor lines and their corresponding original coordinates on the sample image.
[0013] The training sample image set is input into the structured model to be trained, the structured model comprising: a feature extraction layer, a feature vector determination layer, and a regression layer; wherein:
[0014] The feature extraction layer is used to extract features from the image to obtain the feature map corresponding to the image;
[0015] The feature vector determination layer is used to generate a corresponding feature vector for each virtual anchor line. The feature vector corresponding to any virtual anchor line is generated in the following way: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map.
[0016] The regression layer is used to perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and to generate corresponding regression coordinates based on the feature vectors corresponding to the target virtual anchor line. The regression coordinates are the coordinates of the table line of the table to be identified in the image.
[0017] The structured model is optimized based on the difference between the generated regression coordinates and the true coordinates corresponding to the sample image.
[0018] According to a third aspect of the embodiments of this specification, an image table structuring apparatus is provided, wherein the image contains a table to be identified, the apparatus comprising:
[0019] Setting unit, used to set multiple virtual anchor lines in the image, wherein the virtual anchor lines are defined by the original coordinates of a set of virtual anchor points arranged along the virtual anchor lines on the image;
[0020] A feature extraction unit is used to extract features from the image to obtain a feature map corresponding to the image;
[0021] The feature vector generation unit is used to generate a corresponding feature vector for each virtual anchor line; wherein, the feature vector corresponding to any virtual anchor line is generated in the following manner: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map.
[0022] The regression unit is used to perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and to generate corresponding regression coordinates based on the feature vectors corresponding to the target virtual anchor line. The regression coordinates are the coordinates of the table line of the table to be identified in the image.
[0023] The structured unit is used to generate a structured table corresponding to the table to be identified based on the generated regression coordinates.
[0024] According to a fourth aspect of the embodiments of this specification, a training apparatus for an image table structured model is provided, the apparatus comprising:
[0025] An acquisition unit is used to acquire a training sample image set. Each sample image in the training sample image set includes a table line with known real coordinates, and multiple virtual anchor lines are set in the sample image. The virtual anchor lines are defined by a set of virtual anchor points arranged along the virtual anchor lines and their corresponding original coordinates on the sample image.
[0026] The input unit is used to input the training sample image set into the structured model to be trained, the structured model including: a feature extraction layer, a feature vector determination layer, and a regression layer; wherein:
[0027] The feature extraction layer is used to extract features from the image to obtain the feature map corresponding to the image;
[0028] The feature vector determination layer is used to generate a corresponding feature vector for each virtual anchor line. The feature vector corresponding to any virtual anchor line is generated in the following way: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map.
[0029] The regression layer is used to perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and to generate corresponding regression coordinates based on the feature vectors corresponding to the target virtual anchor line. The regression coordinates are the coordinates of the table line of the table to be identified in the image.
[0030] An optimization unit is used to optimize the structured model based on the difference between the generated regression coordinates and the true coordinates corresponding to the sample image.
[0031] According to a fifth aspect of the embodiments of this specification, an electronic device is provided, comprising:
[0032] A processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the steps of the method described in the first or second aspect above.
[0033] According to a sixth aspect of the embodiments of this specification, a computer-readable storage medium is provided, on which executable instructions are stored; wherein, when executed by a processor, the instructions implement the steps of the method described in the first or second aspect above.
[0034] The technical solutions provided in this specification can achieve the following beneficial effects:
[0035] This specification proposes the concept of virtual anchor lines for objects with a large aspect ratio, such as table lines. By setting multiple virtual anchor lines in the image, a feature vector corresponding to each virtual anchor line can be generated based on the mapping relationship between the image and the corresponding feature map. Finally, regression processing is performed on the feature vector to obtain the coordinates of the table line and realize the structuring of the table. Since the virtual anchor lines are specifically designed for objects with a large aspect ratio, the final recognition result can be guaranteed to have high accuracy and is not affected by image rotation, tilt, etc. At the same time, since only feature vectors need to be classified and regressed, there is no need to classify and label each pixel, which effectively reduces the amount of computation and improves the efficiency of table line recognition.
[0036] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this specification. Attached Figure Description
[0037] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this specification and, together with the description, serve to explain the principles of this specification.
[0038] Figure 1 This is a flowchart illustrating a structuring method for an image table provided in an exemplary embodiment of this specification;
[0039] Figure 2 This is a schematic diagram of a set of virtual anchor lines provided in an exemplary embodiment of this specification;
[0040] Figure 3 This is a schematic diagram of an input image provided in an exemplary embodiment of this specification;
[0041] Figure 4 This is a schematic diagram of a virtual anchor point position provided in an exemplary embodiment of this specification;
[0042] Figure 5 This is a schematic diagram illustrating a feature extraction method provided in an exemplary embodiment of this specification;
[0043] Figure 6 This is a schematic diagram illustrating the mapping relationship between an input image and a feature map provided in an exemplary embodiment of this specification;
[0044] Figure 7 This is a schematic diagram illustrating the calculation of mapped coordinates on a feature map, provided in an exemplary embodiment of this specification.
[0045] Figure 8 This is a schematic diagram illustrating the calculation of mapped coordinates on another feature map provided in an exemplary embodiment of this specification;
[0046] Figure 9 This is a schematic diagram of a structured model provided in an exemplary embodiment of this specification;
[0047] Figure 10 This is a schematic diagram of the structure of a feature extraction layer provided in an exemplary embodiment of this specification;
[0048] Figure 11 This is a schematic diagram of the structure of a feature extraction layer provided in an exemplary embodiment of this specification;
[0049] Figure 12 This is a schematic diagram of an input image and feature map provided in an exemplary embodiment of this specification;
[0050] Figure 13 This is a flowchart illustrating an exemplary embodiment of an image table structuring method based on a structured model provided in this specification.
[0051] Figure 14 This is a flowchart illustrating a post-processing method provided in an exemplary embodiment of this specification;
[0052] Figure 15 This is a flowchart illustrating a training method for an image table structured model provided in an exemplary embodiment of this specification;
[0053] Figure 16 This is a schematic structural diagram of an electronic device provided in an exemplary embodiment of this specification;
[0054] Figure 17 This is a block diagram of a structured apparatus for an image table provided in an exemplary embodiment of this specification;
[0055] Figure 18 This is a block diagram of a training apparatus for an image table structured model provided in an exemplary embodiment of this specification. Detailed Implementation
[0056] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with one or more embodiments of this specification. Rather, they are merely examples of apparatuses and methods consistent with some aspects of one or more embodiments of this specification.
[0057] It should be noted that the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification in other embodiments. In some other embodiments, the methods may include more or fewer steps than those described in this specification. Furthermore, a single step described in this specification may be broken down into multiple steps in other embodiments; conversely, multiple steps described in this specification may be combined into a single step in other embodiments. It should be understood that although the terms first, second, third, etc., may be used in this specification to describe various information, this information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of this specification, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein can be interpreted as "when," "in response to a determination," or "in the event of a determination."
[0058] In related technologies, image table structuring methods based on deep learning mainly include semantic segmentation-based and object detection-based methods. Semantic segmentation-based methods assign labels to each pixel of the input image. However, when the input image is large, this approach inevitably suffers from high computational complexity. Furthermore, if a pixel is misidentified, it can cause breaks in the table lines, affecting the final table line recognition result and leading to significant errors. Object detection-based methods generate a series of quadrilateral candidate boxes from the input image using specific algorithms. Then, a deep neural network extracts features and classifies the content of each quadrilateral candidate box. Finally, candidate box position regression and redundancy removal are performed to obtain the object detection result. However, because table lines are a special shape with a large aspect ratio, the aforementioned quadrilateral candidate boxes often cannot accurately and completely cover the table lines, resulting in unsatisfactory recognition results.
[0059] In view of this, this specification proposes a structured scheme for image tables to improve upon the object detection-based structured methods in related technologies, thereby solving the aforementioned technical problems existing in related technologies. The following section combines... Figures 1-9 This document provides a detailed explanation of the image table structuring method and the training method for the image table structuring model described herein.
[0060] Figure 1 This is a flowchart of an exemplary embodiment of the present specification, which provides a method for structuring an image table and may include the following steps:
[0061] Step 102: Set multiple virtual anchor lines in the image. The virtual anchor lines are defined by the original coordinates of a set of virtual anchor points arranged along the virtual anchor lines on the image.
[0062] As mentioned earlier, unlike the quadrilateral candidate boxes used in object detection, this specification proposes the concept of virtual anchor lines for table lines, which have a large aspect ratio, to accurately match table lines. In one embodiment, multiple virtual anchor lines can be set as follows: a starting virtual anchor point is selected at the image boundary; a set of virtual anchor lines is generated for each starting virtual anchor point; wherein, a set of virtual anchor lines corresponding to any starting virtual anchor point is generated by: determining a set of rays in the plane where the image is located, with any starting virtual anchor line as the endpoint, wherein adjacent rays have a preset angular interval; determining a set of remaining virtual anchor points on each ray, and combining any starting virtual anchor point and each set of remaining virtual anchor points into multiple sets of virtual anchor points to define a set of virtual anchor lines corresponding to any starting virtual anchor point. Figure 2 As shown, Figure 2 This is a schematic diagram of a set of virtual anchor lines provided in an exemplary embodiment of this specification. For example... Figure 2 As shown, assume the input image size is 400×400×3 (e.g., Figure 3 As shown), the input image can be considered to have a height (H) of 400, a width (W) of 400, and 3 channels. The number of channels refers to how many values each pixel can store. For example, in an RGB color image, each pixel can store 3 values, corresponding to 3 channels. Therefore, the 400×400×3 input image can also be considered a 400×400 RGB color image. This input image includes the table to be recognized, 201. At this time, it can be located at the image boundaries (such as...). Figure 2 The lower boundary of the image shown is selected from the starting virtual anchor point A. Of course, to maximize the alignment with the table lines, the starting virtual anchor points can be arranged densely and evenly at the image boundary, depending on the image size. For example,... Figure 2Taking the image shown as an example, since its size is 400×400, a starting virtual anchor point can be set every two pixels. This will densely distribute the virtual anchor lines within the image plane, increasing the probability of hitting the table lines. Furthermore, since the starting point of the table lines may be located on any side of the image boundary, starting virtual anchor points can be set at each boundary of the image. For example, in... Figure 2 The image shown selects a starting virtual anchor point every two pixels along its four edges, resulting in a total of 800 starting virtual anchor points (200 on each boundary). Of course, the starting virtual anchor points can also be set within the image, not necessarily at its boundaries, and can be adjusted appropriately based on the input image. From this starting virtual anchor point A, a set of rays can be obtained, with a preset angular interval between adjacent rays. For example, within an angle range of 0 to 180 degrees, if the preset angular interval is set to 10 degrees, then starting virtual anchor point A can correspond to 17 rays, resulting in a total of 13,600 rays in the image (200×4×17). At this point, a set of remaining virtual anchor points is determined on each ray, which can then be combined with the starting virtual anchor point to form multiple sets of virtual anchor lines.
[0063] In one embodiment, when an original coordinate system is established with the boundary where the initial virtual anchor point is located as the horizontal axis and the straight line perpendicular to the boundary as the vertical axis, a set of corresponding positioning points on each ray can be determined according to a pre-set set of fixed vertical axis coordinate values, so that the set of corresponding positioning points can be used as the set of remaining virtual anchor points for that ray. Figure 4 As shown, the input image size is 400×400. A set of y-axis coordinate values can be fixed, for example, 40, 80, 120, 160, 200, 240, 280, 320, 360, and 400. After fixing a set of y-axis coordinate values, since each ray is determined, a positioning point corresponding to the fixed y-axis coordinate values can be determined on any ray. Based on the above positioning points and the starting virtual anchor point, a virtual anchor line can be connected, namely the starting virtual anchor point A and the remaining virtual anchor points (e.g., ...). Figure 4 The points B, C, D, E, F, G, H, I, J, and K shown can be combined to form a virtual anchor line. To more clearly label the remaining virtual anchor points, in... Figure 4 The form to be identified, 201, as described above, is not shown in the image. It should be noted that although in... Figure 4In the illustrated embodiment, this set of y-axis coordinates is evenly spaced. However, in some embodiments, this set of fixed y-axis coordinates can also be non-evenly spaced, for example, it can be 40, 80, 130, 190, etc. Of course, the specific number of this set of y-axis coordinate values can also be adjusted according to actual needs, and is not necessarily as shown. Figure 4 The 10 fixed y-axis coordinate values shown are not limited in this specification.
[0064] In practice, after fixing a set of y-axis coordinates, a virtual anchor line can be represented by a corresponding set of horizontal axis coordinates and included angles. In other words, when establishing an original coordinate system with the boundary where the initial virtual anchor point is located as the horizontal axis and the line perpendicular to the boundary as the vertical axis, a set of remaining virtual anchor points are selected on each ray according to a preset vertical axis interval. Then, the horizontal axis coordinates corresponding to a set of virtual anchor points can be used to represent a set of virtual anchor lines corresponding to that set of virtual anchor points. For example, suppose the fixed vertical axis coordinates are 40, 80, 120, 160, 200, 240, 280, 320, 360, and 400. That is, 10 fixed values are used as the coordinates of the virtual anchor lines on the vertical axis. If the included angle corresponding to the initial virtual anchor point A is 90 degrees, i.e., perpendicular to the lower boundary of the image, and the coordinates are (100, 0), then the anchor line can be uniquely represented only by the horizontal axis coordinates and the angle, i.e., (100, 100, 100, 100, 100, 100, 100, 100, 100, 100) and the included angle of 90 degrees. This represents the virtual anchor line perpendicular to the image boundary with the initial virtual anchor point A, and the corresponding vertical axis coordinates are no longer required. For example, if the tangent of this angle is 3 / 4 (i.e., tanθ = 3 / 4), then the corresponding virtual anchor line with the initial virtual anchor point A can be uniquely represented by a set of horizontal axis coordinates (130, 160, 190, 220, 250, 280, 310, 340, 370, 400).
[0065] Step 104: Extract features from the image to obtain the feature map corresponding to the image.
[0066] After setting up the virtual anchor lines, feature extraction can be performed on the input image. Feature extraction involves mapping the input digital image onto a matrix processed by the computer, where each matrix value corresponds to a feature point. In one embodiment, feature extraction can be achieved based on a convolution operation on the input image, thereby obtaining the image's corresponding features. Figure 5 As shown, Figure 5 This is a schematic diagram illustrating a feature extraction method provided in an exemplary embodiment of this specification. For example... Figure 5 As shown, the convolution operation essentially utilizes the relationship between a pixel and its neighboring pixels to obtain a weighted sum. The convolution kernel is a fixed-size array of weights, such as... Figure 5 The diagram shows a 3×3 convolution kernel. The specific operation for the value "-8" on the feature map is: 4×0+0×0+0×0+0×0+0×1+0×1+0×0+0×1+(-4)×2=-8. The size of the feature map obtained after the weighted summation operation depends on the size of the convolution kernel, the stride, and the number of padding loops. The downsampling factor refers to the factor by which the feature map is reduced compared to the corresponding input image. The number of channels in the feature map is related to the number of convolution kernels. That is, if there are N convolution kernels, a feature map with N channels is obtained, thus achieving dimensionality increase or decrease. For example, if the input image is a three-channel RGB color image, after convolution with 32 convolution kernels, a 32-channel feature map can be obtained, thus achieving dimensionality increase.
[0067] Step 106: Generate a corresponding feature vector for each virtual anchor line; wherein, the feature vector corresponding to any virtual anchor line is generated in the following manner: based on the mapping relationship between the image and the feature map, determine the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map, and generate a feature vector based on the pixel features located at the mapping coordinates in the feature map.
[0068] After obtaining the feature map of the input image, a corresponding feature vector can be generated for each virtual anchor line on the input image, still using... Figure 4 Taking the 10 virtual anchor points (B, C, D, E, F, G, H, I, J, and K) as an example, based on the mapping relationship between the input image and the feature map, the corresponding mapped virtual anchor points can be determined on the feature map. Therefore, based on the pixel features of the mapped virtual anchor points on the feature map, the corresponding feature vectors can be obtained. Assuming the feature map has N channels and the number of virtual anchor points is 10, a 10×N feature vector can be obtained. Of course, if... Figure 2 As shown in the embodiment, since 13,600 virtual anchor lines can be set in the input image, 13,600 10×N feature vectors can be obtained accordingly.
[0069] In one embodiment, an original coordinate system can be established using the boundary where the initial virtual anchor point is located at the image boundary as the horizontal axis, the straight line perpendicular to the boundary as the vertical axis, and the vertex of the boundary where the initial virtual anchor point is located as the origin. This determines a set of original coordinates corresponding to the virtual anchor point in the original coordinate system. Based on the downsampling factor between the image and the feature map, the vertical axis coordinate value in the original coordinate system is reduced by an equal factor, and the reduced value is used as the vertical axis coordinate value in the mapped coordinate system corresponding to the virtual anchor point. Based on the vertical axis coordinate value in the mapped coordinate system and the angle between the virtual anchor line and the horizontal axis of the coordinate system, the horizontal axis mapping value of the virtual anchor point on the feature map is determined. The sum of the horizontal axis mapping value and the value after reducing the horizontal axis coordinate value in the original coordinate system by the downsampling factor is calculated and used as the horizontal axis coordinate value in the mapped coordinate system corresponding to the virtual anchor point.
[0070] Because there is a mapping relationship between the obtained feature map and the input image (e.g. Figure 6 As shown), based on this mapping relationship, we can obtain the mapped coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map. Let's assume the original coordinates of the starting virtual anchor point A corresponding to the virtual anchor line on the input image are defined as (X...). orig ,Y orig ), and the mapped coordinates of the virtual anchor point B on the input image corresponding to the mapped anchor point B' are (X j Y j Then the corresponding conversion formula can be shown below:
[0071]
[0072] To better understand the above conversion formula, please refer to [link / reference]. Figure 7 , where δ back As mentioned earlier, the downsampling factor, and since the y-axis coordinates of the virtual anchor points in the input image can be a set of fixed values, the y-axis coordinates corresponding to the mapped coordinates can be directly obtained after scaling down according to the downsampling factor. In other words, Y... j Since this is known, the value of the horizontal axis mapping A'P can be obtained using trigonometric functions, thereby determining the mapping coordinates (X) corresponding to the virtual anchor point B'. j Y j ).
[0073] Since the starting virtual anchor point of the virtual anchor line can be located at the horizontal axis of the coordinate system, then the corresponding Y... orig The value of can be zero, so the above conversion formula can be simplified to:
[0074]
[0075] Although, Figure 7 In the illustrated embodiment, the origin of the coordinate system is located at the boundary vertex of the input image, i.e. Figure 7 The zero point is shown, but in some embodiments, the starting virtual anchor point can also be used as the origin of the coordinate system, such as... Figure 8 As shown, Figure 8 This is a schematic diagram illustrating the calculation of mapped coordinates on another feature map provided in an exemplary embodiment of this specification, with its virtual anchor point as the origin of the coordinate system. The corresponding transformation formula can be expressed as:
[0076]
[0077] Therefore, the conversion formula described in this specification calculates the horizontal axis coordinate value in the mapped coordinate system based on the included angle of the virtual anchor line and a fixed vertical axis coordinate. By adjusting the conversion formula at different coordinate origins or different starting virtual anchor point positions, the formula can be simplified, thereby reducing the computational load. Of course, the specific setting of the coordinate origin or the position of the starting virtual anchor point can be determined according to actual needs; this specification does not impose any restrictions. It is even possible to set the origin of the coordinate system in the input image as the boundary vertex of the input image, and set the origin of the coordinate system on the feature map as the mapped starting virtual anchor point of the corresponding starting virtual anchor point on the feature map.
[0078] After obtaining the mapped coordinates corresponding to the virtual anchor points, the corresponding pixel features can be determined based on these coordinates. Taking a single-channel feature map as an example, the mapped coordinates are (2, 3), (3, 4), and (3, 5), and the corresponding pixel feature values are 5, 3, and 4, respectively. The resulting feature vector can then be represented as [5, 3, 4] in matrix form. Since feature maps often have multiple channels, the corresponding feature vector can also be a multi-dimensional vector corresponding to the number of channels.
[0079] Step 108: Perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and generate corresponding regression coordinates based on the feature vectors corresponding to the target virtual anchor lines. The regression coordinates are the coordinates of the table line of the table to be identified in the image.
[0080] The feature vectors corresponding to the obtained virtual anchor lines can be regressed to determine whether a corresponding table line exists for the virtual anchor line corresponding to that feature vector. If a table line exists, the virtual anchor line can be called the target virtual anchor line, and the corresponding regression coordinates are obtained. In statistics, regression refers to using a set of known sample data to predict the output value corresponding to the input data. It can be modeled using a prediction function, thereby determining the specific parameters in the prediction model based on the known sample data. The parameters can be determined using mathematical analytical methods (i.e., analytical solutions) or gradient descent. However, when the number of samples is large, the computational load of mathematical analytical methods increases exponentially, resulting in low efficiency. Therefore, compared with mathematical analytical methods, iteratively determining parameters using gradient descent has higher computational efficiency and a wider range of applications. The corresponding regression coordinates can be obtained through the prediction function, and these regression coordinates can be used as the coordinates of the table line of the table to be identified in the image. Furthermore, determining parameters based on gradient descent has a faster determination speed and less computational load compared to mathematical analytical methods, thus improving the recognition efficiency of table lines.
[0081] In one embodiment, the above feature vectors can also be averaged before regression processing. Taking feature vectors [5, 3, 4] as an example, after averaging them, the average feature vector [4] is obtained.
[0082] As mentioned earlier, coordinate regression can be achieved using a mathematical model established by a prediction function. Therefore, corresponding models can be set for steps 102, 104, and 106 to quickly obtain the regressed coordinates based on these models. However, using multiple models inevitably leads to coordination issues, as each model has a different task. This can result in significant errors in the final regressed coordinates, and the errors from the initial models accumulate in the last model, further affecting the accuracy of the results. Therefore, a single model can be used, divided into multiple different network layers. This allows the image to be directly input into the model, and the output is the corresponding regressed coordinates—an end-to-end network model that doesn't require individual sub-models to solve the problem. This specification refers to this end-to-end network model as a structured model, which may include a feature extraction layer, a feature vector determination layer, and a regression layer. For an introduction to this model, please refer to [link to relevant documentation]. Figure 9 The embodiments shown are not described in detail here.
[0083] Step 110: Generate a structured table corresponding to the table to be identified based on the generated regression coordinates.
[0084] Based on the relative positions of the regression coordinates, the corresponding order of the identified table lines can be determined, and corresponding cells can be constructed. If the table to be identified includes text content, the text content obtained from the image based on optical character recognition (OCR) is filled into the corresponding cells that match the text boxes, thus generating a structured table corresponding to the table to be identified. After obtaining the regression coordinates, the rows and columns can be sorted according to their relative positions. The intersection of rows and columns yields each cell in the table to be identified, and the cells are sorted according to their row and column (i.e., the cell's border coordinates). Finally, based on optical character recognition (OCR) technology and the IOU (Intersection over Union) matching method, the text boxes are matched one-to-one with the corresponding cells, and the obtained text content is filled into the corresponding cells, completing the table structuring process.
[0085] In one embodiment, the image can be input into a pre-trained structured model as described above, such as... Figure 9 As shown, Figure 9 This is a schematic diagram of a structured model provided in an exemplary embodiment of this specification; the structured model may include: a feature extraction layer 901, a feature vector determination layer 902, and a regression layer 903; wherein:
[0086] The feature extraction layer 901 is used to extract features from the image to obtain feature maps of N channels corresponding to the image.
[0087] The feature vector determination layer 902 is used to generate a corresponding M×N feature vector for the pixel features of the M virtual anchor points of each virtual anchor line on the feature map of the N channels, and to perform global average pooling on the generated M×N feature vector to obtain a 1×N feature vector composed of a single average pixel feature.
[0088] The regression layer 903 is used to perform regression processing on the 1×N feature vectors composed of single average pixel features corresponding to the multiple virtual anchor lines, so as to determine the target virtual anchor line that matches the table line of the table to be identified, and obtain the 1×M regression coordinate vector corresponding to the target virtual anchor line, wherein each element in the regression coordinate vector is used to represent the horizontal axis coordinate value corresponding to the table line regressed by the virtual anchor line.
[0089] The structured model described in this specification can be based on a CNN model, with adjustments made to it. A CNN model may include an input layer, convolutional layers, pooling layers, fully connected layers, and an output layer, thereby extracting features from the input image and obtaining multi-channel feature maps through convolutions and pooling operations. There are many types of CNN models, such as ResNet, VGG, and DenseNet. This specification does not limit itself to a specific CNN model; for example, ResNet or VGG can be used as the baseline model. The following section uses ResNet-50 as the baseline model to provide a detailed description of each network layer of the structured model described in this specification.
[0090] The feature extraction layer 901 can extract features from the image to obtain the corresponding feature map. For example... Figure 10 As shown, ResNet-50 can be divided into five stages: stage 0, stage 1, stage 2, stage 3, and stage 4. Stage 0 has a relatively simple structure and can be considered as preprocessing of the input. Stages 1, 2, 3, and 4 are actually composed of bottleneck layers A and B (Bottleneck layer). Figure 11 As shown in the diagram, the only difference is the number of bottleneck layers. Stage 1 contains 3 bottleneck layers, Stage 2 contains 4 bottleneck layers, Stage 3 contains 6 bottleneck layers, and Stage 4 contains 3 bottleneck layers. In fact, the most significant feature of ResNet-50 is its configuration as shown in the diagram. Figure 11 The bottleneck layer shown significantly reduces computation through its 1×1 convolutional kernel. Furthermore, based on the residual structure within the bottleneck layer (i.e., directly introducing short connections to the ReLU layer at the input, resulting in two inputs to the ReLU layer), even a network with a depth of 50 layers like the ResNet-50 can effectively avoid degradation issues. In essence, the ResNet-50 can be viewed as continuously convolving and pooling the input image to obtain the final feature map. Figure 10 The input (224×224×3) marked in the figure can be considered as an input image with a height of 224, a width of 224, and 3 channels. Figure 10In stage 0, the input image, after convolution, needs to be processed by Batch Normalization (BN) and ReLU layers. The role of the BN layer is to pull the input distribution back to a normal distribution with a mean of 0 and a variance of 1, so that the value of the input activation function can generate a more obvious gradient during backpropagation, making it easier to converge and avoiding the gradient vanishing problem. ReLU is the activation function used by ResNet-50. The activation function ultimately determines whether to transmit a signal and the content to be emitted to the next neuron. The reason for introducing activation functions is to utilize their non-linear properties, thereby introducing non-linear elements into neurons, allowing the neural network to approximate any other non-linear function. Furthermore, because the ReLU activation function is a piecewise non-linear function, it has a faster computation speed compared to general activation functions such as sigmoid and tanh. Figure 10 Taking the input image (224×224×3) as an example, after processing by the convolutional layer shown in stage 0, a 64-channel feature map with an output of 112×112×64 can be obtained. For details on the changes in the input and output images, please refer to [link to relevant documentation]. Figure 12 .like Figure 12 As shown, the number of output feature maps is actually equal to the number of convolutional kernels in the convolutional layer, which is also the number of channels in the output image. Therefore, after processing the input image (224×224×3) through stages 0, 1, 2, 3, and 4, the output is a 3D feature map of (7×7×2048), meaning the feature map size is 7×7, and the final number of feature maps, or channels, is 2048, corresponding to a downsampling factor of 32. Of course, as mentioned earlier, by adjusting parameters such as the stride of the convolutional kernels in this ResNet-50 model, different sizes of feature maps can be obtained, and by adjusting the number of convolutional kernels in the model, different numbers of channels can be obtained. In other words, Figure 10 The parameters (H height, W width, and C number of channels) of the output feature map shown are merely illustrative and can be adjusted by changing the parameters of the model according to actual needs.
[0091] The feature vector determination layer 902 is used to generate M×N feature vectors for the pixel features of the M virtual anchor points of each virtual anchor line on the feature maps of the N channels, and then performs global average pooling on the generated M×N feature vectors to obtain 1×N feature vectors composed of individual average pixel features; Figure 4Taking the 10 virtual anchor points (B, C, D, E, F, G, H, I, J, and K) shown, the input image as 400×400×3, and the feature map as 50×50×1024 (i.e., the downsampling factor is 8) as an example, the pixel features of the 10 virtual anchor points on the feature map corresponding to each channel can be determined, thus obtaining a 10×1024 feature vector. After global average pooling to reduce the number of parameters, a 1×1024 feature vector can be obtained.
[0092] Regression layer 903 performs regression processing on the 1×N feature vectors, which are composed of single average pixel features, corresponding to multiple virtual anchor lines, to determine the target virtual anchor line matching the table line to be identified, and obtains a 1×M regression coordinate vector corresponding to the target virtual anchor line. Each element in the regression coordinate vector represents the horizontal axis coordinate value corresponding to the table line regressed from the virtual anchor line. Taking the 1×1024 feature vector as an example, the 1×1024 feature vector can be input into a 1024×10 fully connected regression layer. The output of this fully connected regression layer is the coordinate point of the corresponding table line, that is, the output is a 1×10 regression coordinate vector, where each element of the regression coordinate vector represents the horizontal axis coordinate value of the regressed table line. Because a set of y-axis coordinate values is fixed, only a set of horizontal axis coordinate values is needed to represent a corresponding table line.
[0093] Based on the structure of the above-mentioned structured model, an image can be directly input into the structured model, and its output is the coordinates of the corresponding table lines. Through this end-to-end structured model, the accuracy of the table line coordinates is guaranteed, while also accelerating the efficiency of table line recognition.
[0094] Figure 13 This is a flowchart illustrating an exemplary embodiment of a method for structuring image tables based on a structured model, as provided in this specification. Figure 13 As shown, the following steps may be included:
[0095] Step 1302: Input the image into the feature extraction layer.
[0096] As mentioned earlier, the feature extraction layer can be based on ResNet-50 to obtain the corresponding feature map.
[0097] Step 1304: Anchor-based Feature Pooling.
[0098] Feature pooling can refer to, for example, after selecting pixel features with mapped coordinates on a mapping map, if there are 10 virtual anchor points, concatenating them into a corresponding 10×1024 feature vector, and then performing global average pooling on the concatenated feature vector to obtain a 1×1024 averaged feature vector.
[0099] Step 1306: Input the feature vector into the fully connected regression layer.
[0100] When a 1×1024 feature vector is input into a 1024×10 fully connected regression layer, a corresponding 1×10 regression coordinate vector can be obtained. Each element in this regression coordinate vector can represent the horizontal axis coordinate of the corresponding table line.
[0101] After obtaining the horizontal axis coordinates of the corresponding table lines, further post-processing can be performed to complete the table structuring, such as... Figure 14 As shown, Figure 14 This is a flowchart of a post-processing procedure provided in an exemplary embodiment of this specification, which may include the following steps:
[0102] Step 1402: Obtain the coordinates of the table lines.
[0103] The table line coordinates are the regression coordinates as described above. For example, after obtaining a 1×10 regression coordinate vector, 10 corresponding horizontal axis coordinate values can be obtained. These horizontal axis coordinate values are the horizontal axis coordinate values corresponding to the regression table lines. Based on these horizontal axis coordinate values, and according to a pre-fixed set of vertical axis coordinate values, an identified table line can be represented.
[0104] Step 1404: Sort the rows and columns.
[0105] Based on the relative positions of the regression coordinates, the corresponding table rows and columns can be obtained, and the row and column lines can be further sorted. In fact, during the sorting process, overlapping rows and columns can also be filtered at the same time. Since the starting virtual anchor point can be set at the upper and lower boundaries of the image, there may be cases where the table line matched by the starting virtual anchor point located at the upper boundary and the table line matched by the starting virtual anchor point located at the lower boundary are actually the same table line. In this case, during the sorting process, overlapping table lines can be filtered at the same time, thereby filtering two overlapping table lines into one, improving the accuracy of table line recognition.
[0106] Step 1406: Cell generation and sorting.
[0107] The intersection of row and column lines can be used to obtain the corresponding cells and determine the border coordinates of the corresponding cells. Based on the relative position of the border coordinates, the cells are sorted to obtain a neat electronic spreadsheet.
[0108] Step 1408: OCR text box matching.
[0109] Based on OCR technology, corresponding text boxes containing text content can be obtained. Then, using the IOU matching method, the text boxes can be matched one-to-one with their corresponding cells, thus filling the text content into the cells. Of course, if the input image only contains a table with blank content, then text filling is unnecessary, and the digitized table can be used directly as the final result.
[0110] In one embodiment, OCR technology can also be used to fine-tune the table lines, thereby improving the recognition accuracy of the final table. For example, if the final table lines are continuous line segments, but the corresponding table lines in the input image are discontinuous, then the regressed table lines can be fine-tuned based on the text boxes recognized by OCR technology to make them the corresponding discontinuous line segments, thus ensuring recognition accuracy. As another example, suppose multiple cells in the table contain multiple Chinese characters, and these Chinese characters have long strokes, visually forming a shape similar to table lines. In this case, OCR technology can also be used for fine-tuning to ensure the final recognition accuracy and reduce errors.
[0111] Step 1410: Obtain the structured table.
[0112] Based on the post-processing of the regression coordinates described above, a complete structured process can be achieved, enabling users to analyze data using the resulting structured tables.
[0113] Because structured models, after pre-training, can guarantee ideal output results and ensure the accuracy of recognition results, this specification also provides a training method for an image table structured model, such as... Figure 15 As shown, Figure 15 This is a flowchart illustrating a training method for an image table structured model provided in an exemplary embodiment of this specification. The method may include the following steps:
[0114] Step 1502: Obtain a training sample image set. Each sample image in the training sample image set includes a table line with known real coordinates, and multiple virtual anchor lines are set in the sample image. The virtual anchor lines are defined by a set of virtual anchor points arranged along the virtual anchor lines and their corresponding original coordinates on the sample image.
[0115] Step 1504: Input the training sample image set into the structured model to be trained. The structured model includes: a feature extraction layer, a feature vector determination layer, and a regression layer; wherein: the feature extraction layer is used to extract features from the image to obtain a feature map corresponding to the image; the feature vector determination layer is used to generate a corresponding feature vector for each virtual anchor line, wherein the feature vector corresponding to any virtual anchor line is generated in the following way: based on the mapping relationship between the image and the feature map, determine the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map, and generate a feature vector based on the pixel features located at the mapping coordinates in the feature map; the regression layer is used to perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line matching the table line of the table to be identified, and generate corresponding regression coordinates based on the feature vector corresponding to the target virtual anchor line, wherein the regression coordinates are the coordinates of the table line of the table to be identified in the image.
[0116] Step 1506: Optimize the structured model based on the difference between the generated regression coordinates and the true coordinates corresponding to the sample image.
[0117] Training a structured model is essentially a process of optimizing various parameters within the model to regress accurate coordinates. As mentioned earlier, model parameters can be determined using the gradient descent algorithm. In neural network models, the gradient descent algorithm can take the form of backpropagation. Backpropagation involves first calculating the error between the regressed coordinates and the true values, and then propagating this error from the output layer to the hidden layers, and finally to the input layer. During backpropagation, the values of various parameters are adjusted based on the error. This process is iterated until the error reaches a preset threshold or is minimized. The error can be calculated using a loss function. In training this structured model, the Smooth L1 loss function can be used. The specific Smooth L1 loss function can be expressed as follows:
[0118]
[0119] As shown in the formula above, Smooth L1 Loss is a piecewise function that combines the advantages of both L1 and L2 loss functions. It uses a smooth L2 loss when x is small and a stable L1 loss when x is large, thus perfectly avoiding the drawbacks of both L1 and L2 loss. The value of x is the difference between the regression coordinates and the corresponding horizontal axis coordinates in the ground truth coordinates. Training stops when the error reaches or falls below a preset threshold, and is considered complete. It should be noted that structured model parameters can also be optimized using both L1 and L2 loss functions; therefore, this specification does not restrict the specific loss function.
[0120] The above-described embodiments of the specification detail a method for structuring image tables. For objects with a large aspect ratio, such as table lines, the concept of virtual anchor lines is proposed. By setting multiple virtual anchor lines in the image, a feature vector corresponding to each virtual anchor line can be generated based on the mapping relationship between the image and the corresponding feature map. Finally, regression processing is performed on the feature vector to obtain the coordinates of the table lines and achieve table structuring. Since the virtual anchor lines are specifically designed for objects with a large aspect ratio, such as table lines, the final recognition result can be ensured to have high accuracy and is unaffected by image rotation, tilt, or other issues. Furthermore, because only feature vectors need to be classified and regressed, pixel-by-pixel classification and labeling are no longer required, effectively reducing the computational load and improving the efficiency of table line recognition.
[0121] Corresponding to the embodiments of the foregoing methods, this specification also provides embodiments of apparatus, electronic devices, and storage media.
[0122] Figure 16 This is a schematic structural diagram of an electronic device provided in an exemplary embodiment. Please refer to... Figure 16 At the hardware level, the device includes a processor 1601, a network interface 1602, memory 1603, non-volatile memory 1604, and an internal bus 1605, and may also include other hardware required for business operations. One or more embodiments of this specification can be implemented in software, such as the processor 1601 reading the corresponding computer program from the non-volatile memory 1604 into memory 1603 and then running it. Of course, in addition to software implementation, one or more embodiments of this specification do not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.
[0123] Figure 17This is a block diagram of a structured apparatus for an image table, provided in an exemplary embodiment. Please refer to... Figure 17 The device includes:
[0124] Setting unit 1702 is used to set multiple virtual anchor lines in the image, wherein the virtual anchor lines are defined by the original coordinates of a set of virtual anchor points arranged along the virtual anchor lines on the image.
[0125] The feature extraction unit 1704 is used to extract features from the image to obtain a feature map corresponding to the image;
[0126] The feature vector generation unit 1706 is used to generate a corresponding feature vector for each virtual anchor line; wherein, the feature vector corresponding to any virtual anchor line is generated in the following manner: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map.
[0127] The regression unit 1708 is used to perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and to generate corresponding regression coordinates based on the feature vectors corresponding to the target virtual anchor line. The regression coordinates are the coordinates of the table line of the table to be identified in the image.
[0128] The generation unit 1710 is used to generate a structured table corresponding to the table to be identified based on the generated regression coordinates.
[0129] Optionally, the setting unit 1702 is specifically used for:
[0130] Select a starting virtual anchor point at the image boundary;
[0131] A set of virtual anchor lines is generated for each starting virtual anchor point; wherein, the set of virtual anchor lines corresponding to any starting virtual anchor point is generated in the following manner: taking the any starting virtual anchor line as the endpoint, a set of rays is determined in the plane where the image is located, wherein there is a preset angle interval between adjacent rays; a set of remaining virtual anchor points is determined on each ray, and the any starting virtual anchor point and each set of remaining virtual anchor points are combined into multiple sets of virtual anchor points to define the set of virtual anchor lines corresponding to the any starting virtual anchor point.
[0132] Optionally, the setting unit 1702 is specifically used for:
[0133] On each ray, a set of corresponding positioning points are determined according to a pre-set set of fixed vertical axis coordinate values, so as to use the set of corresponding positioning points as the set of remaining virtual anchor points of the ray.
[0134] Optionally, the feature vector generation unit 1706 is specifically used for:
[0135] An original coordinate system is established by taking the boundary where the starting virtual anchor point is located at the boundary of the image as the horizontal axis, the straight line perpendicular to the boundary as the vertical axis, and the vertex of the boundary where the starting virtual anchor point is located as the origin, so as to determine a set of original coordinates corresponding to the virtual anchor point of the virtual anchor line in the original coordinate system.
[0136] Based on the downsampling factor between the image and the feature map, the ordinate value in the original coordinates is reduced by an equal factor, so that the reduced value is used as the ordinate value in the mapped coordinates corresponding to the virtual anchor point;
[0137] Based on the vertical axis coordinate value in the mapped coordinates and the angle between the virtual anchor line and the horizontal axis of the coordinate system, the horizontal axis mapping value of the virtual anchor point on the feature map is determined, and the sum of the horizontal axis mapping value and the value after downsampling the horizontal axis coordinate value in the original coordinates is calculated as the horizontal axis coordinate value in the mapped coordinates corresponding to the virtual anchor point.
[0138] Optionally, the feature vector generation unit 1706 is specifically used for:
[0139] An original coordinate system is established with the boundary where the starting virtual anchor point selected at the boundary of the image as the horizontal axis, the line perpendicular to the boundary as the vertical axis, and the starting virtual anchor point as the origin, so as to determine a set of original coordinates of the virtual anchor point corresponding to the virtual anchor line in the original coordinate system.
[0140] Based on the downsampling factor between the image and the feature map, the ordinate value in the original coordinates is reduced by an equal factor, so that the reduced value is used as the ordinate value in the mapped coordinates corresponding to the virtual anchor point;
[0141] Based on the vertical axis coordinate value in the mapped coordinates and the angle between the virtual anchor line and the horizontal axis of the coordinate system, the horizontal axis mapping value of the virtual anchor point on the feature map is determined, and used as the horizontal axis coordinate value in the mapped coordinates corresponding to the virtual anchor point.
[0142] Optionally, the device further includes:
[0143] Input unit 1712 is used to input the image into a pre-trained structured model; the structured model includes: a feature extraction layer, a feature vector determination layer, and a regression layer; wherein:
[0144] The feature extraction layer is used to extract features from the image to obtain feature maps of N channels corresponding to the image;
[0145] The feature vector determination layer is used to generate a corresponding M×N feature vector for the pixel features of the M virtual anchor points of each virtual anchor line on the feature map of the N channels, and to perform global average pooling on the generated M×N feature vector to obtain a 1×N feature vector composed of a single average pixel feature.
[0146] The regression layer is used to perform regression processing on the 1×N feature vectors composed of single average pixel features corresponding to the multiple virtual anchor lines, so as to determine the target virtual anchor line that matches the table line of the table to be identified, and obtain the 1×M regression coordinate vector corresponding to the target virtual anchor line, wherein each element in the regression coordinate vector is used to represent the horizontal axis coordinate value corresponding to the table line regressed by the virtual anchor line.
[0147] Optionally, the generation unit 1710 is specifically used for:
[0148] Based on the relative positional relationship of the regression coordinates, determine the corresponding order of the identified table lines and construct the corresponding cells;
[0149] If the table to be recognized includes text content, the text content obtained from the image based on optical character recognition is filled into the corresponding cell that matches the text box to generate a structured table corresponding to the table to be recognized.
[0150] Figure 18 This is a block diagram of a training apparatus for an image table structured model, provided in an exemplary embodiment. Please refer to... Figure 18 The device includes:
[0151] The acquisition unit 1802 is used to acquire a training sample image set. Any sample image in the training sample image set includes a table line with known real coordinates, and multiple virtual anchor lines are set in the sample image. The virtual anchor lines are defined by a set of virtual anchor points arranged along the virtual anchor lines and their corresponding original coordinates on the sample image.
[0152] The sample image input unit 1804 is used to input the training sample image set into the structured model to be trained, the structured model including: a feature extraction layer, a feature vector determination layer, and a regression layer; wherein:
[0153] The feature extraction layer is used to extract features from the image to obtain the feature map corresponding to the image;
[0154] The feature vector determination layer is used to generate a corresponding feature vector for each virtual anchor line. The feature vector corresponding to any virtual anchor line is generated in the following way: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map.
[0155] The regression layer is used to perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and to generate corresponding regression coordinates based on the feature vectors corresponding to the target virtual anchor line. The regression coordinates are the coordinates of the table line of the table to be identified in the image.
[0156] The optimization unit 1806 is used to optimize the structured model based on the difference between the generated regression coordinates and the true coordinates corresponding to the sample image.
[0157] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of the solution in this specification according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0158] In a typical configuration, a computer device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.
[0159] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0160] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.
[0161] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0162] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this specification. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this specification as detailed in the appended claims.
[0163] The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of this specification. The singular forms “a,” “the,” and “the” as used in this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
[0164] It should be understood that although the terms first, second, third, etc., may be used in this specification to describe various information, this information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this specification, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."
[0165] The above description is merely a preferred embodiment of this specification and is not intended to limit this specification. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of protection of this specification.
Claims
1. A method for structuring an image table, the image containing a table to be identified, the method comprising: Multiple virtual anchor lines are set in the image, and the virtual anchor lines are defined by the original coordinates of a set of virtual anchor points arranged along the virtual anchor lines on the image. Feature extraction is performed on the image to obtain the feature map corresponding to the image; For each virtual anchor line, a corresponding feature vector is generated; wherein, the feature vector corresponding to any virtual anchor line is generated in the following way: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map; Regression processing is performed on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and the corresponding regression coordinates are generated based on the feature vectors corresponding to the target virtual anchor lines. The regression coordinates are the coordinates of the table line of the table to be identified in the image. Based on the generated regression coordinates, a structured table corresponding to the table to be identified is generated; The provision of setting multiple virtual anchor lines in the image includes: Select a starting virtual anchor point at the image boundary; A set of virtual anchor lines is generated for each starting virtual anchor point; wherein, the set of virtual anchor lines corresponding to any starting virtual anchor point is generated in the following manner: taking the any starting virtual anchor point as the endpoint, a set of rays is determined in the plane where the image is located, wherein there is a preset angle interval between adjacent rays; a set of remaining virtual anchor points is determined on each ray, and the any starting virtual anchor point and each set of remaining virtual anchor points are combined into multiple sets of virtual anchor points to define the set of virtual anchor lines corresponding to the any starting virtual anchor point.
2. The method according to claim 1, wherein an original coordinate system is established with the boundary where the initial virtual anchor point is located as the horizontal axis and the straight line perpendicular to the boundary as the vertical axis, determining a set of remaining virtual anchor points on each ray includes: On each ray, a set of corresponding positioning points are determined according to a pre-set set of fixed vertical axis coordinate values, so as to use the set of corresponding positioning points as the set of remaining virtual anchor points of that ray.
3. The method according to claim 1, wherein determining the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map based on the mapping relationship between the image and the feature map includes: An original coordinate system is established by taking the boundary where the starting virtual anchor point is located at the boundary of the image as the horizontal axis, the straight line perpendicular to the boundary as the vertical axis, and the vertex of the boundary where the starting virtual anchor point is located as the origin, so as to determine a set of original coordinates corresponding to the virtual anchor point of the virtual anchor line in the original coordinate system. Based on the downsampling factor between the image and the feature map, the ordinate value in the original coordinates is reduced by an equal factor, so that the reduced value is used as the ordinate value in the mapped coordinates corresponding to the virtual anchor point; Based on the vertical axis coordinate value in the mapped coordinates and the angle between the virtual anchor line and the horizontal axis of the original coordinate system, the horizontal axis mapping value of the virtual anchor point on the feature map is determined, and the sum of the horizontal axis mapping value and the value after downsampling the horizontal axis coordinate value in the original coordinates is calculated as the horizontal axis coordinate value in the mapped coordinates corresponding to the virtual anchor point.
4. The method according to claim 1, wherein determining the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map based on the mapping relationship between the image and the feature map comprises: An original coordinate system is established with the boundary where the starting virtual anchor point selected at the boundary of the image as the horizontal axis, the line perpendicular to the boundary as the vertical axis, and the starting virtual anchor point as the origin, so as to determine a set of original coordinates of the virtual anchor point corresponding to the virtual anchor line in the original coordinate system. Based on the downsampling factor between the image and the feature map, the ordinate value in the original coordinates is reduced by an equal factor, so that the reduced value is used as the ordinate value in the mapped coordinates corresponding to the virtual anchor point; Based on the vertical axis coordinate value in the mapped coordinates and the angle between the virtual anchor line and the horizontal axis of the original coordinate system, the horizontal axis mapping value of the virtual anchor point on the feature map is determined, and used as the horizontal axis coordinate value in the mapped coordinates corresponding to the virtual anchor point.
5. The method according to claim 1, further comprising: The image is input into a pre-trained structured model; The structured model includes: a feature extraction layer, a feature vector determination layer, and a regression layer; wherein: The feature extraction layer is used to extract features from the image to obtain feature maps of N channels corresponding to the image; The feature vector determination layer is used to generate a corresponding M×N feature vector for the pixel features of the M virtual anchor points of each virtual anchor line on the feature map of the N channels, and to perform global average pooling on the generated M×N feature vector to obtain a 1×N feature vector composed of a single average pixel feature. The regression layer is used to perform regression processing on the 1×N feature vectors composed of single average pixel features corresponding to the multiple virtual anchor lines, so as to determine the target virtual anchor line that matches the table line of the table to be identified, and obtain the 1×M regression coordinate vector corresponding to the target virtual anchor line, wherein each element in the regression coordinate vector is used to represent the horizontal axis coordinate value corresponding to the table line regressed by the virtual anchor line.
6. The method according to claim 1, wherein generating the structured table corresponding to the table to be identified based on the generated regression coordinates comprises: Based on the relative positional relationship of the regression coordinates, determine the corresponding order of the identified table lines and construct the corresponding cells; If the table to be recognized includes text content, the text content obtained from the image based on optical character recognition is filled into the corresponding cell that matches the text box to generate a structured table corresponding to the table to be recognized.
7. A training method for an image table structured model, the method comprising: Obtain a training sample image set. Each sample image in the training sample image set includes a table line with known real coordinates of the table to be identified. The sample image contains multiple virtual anchor lines. The virtual anchor lines are defined by a set of virtual anchor points arranged along the virtual anchor lines and their corresponding original coordinates on the sample image. The training sample image set is input into the structured model to be trained, the structured model comprising: a feature extraction layer, a feature vector determination layer, and a regression layer; wherein: The feature extraction layer is used to extract features from the image to obtain the feature map corresponding to the image; The feature vector determination layer is used to generate a corresponding feature vector for each virtual anchor line. The feature vector corresponding to any virtual anchor line is generated in the following way: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map. The regression layer is used to perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and to generate corresponding regression coordinates based on the feature vectors corresponding to the target virtual anchor line. The regression coordinates are the coordinates of the table line of the table to be identified in the image. The structured model is optimized based on the difference between the generated regression coordinates and the true coordinates corresponding to the sample image. Multiple virtual anchor lines are set in the above-mentioned configuration using the following methods: Select a starting virtual anchor point at the boundary of the sample image; A set of virtual anchor lines is generated for each starting virtual anchor point; wherein, the set of virtual anchor lines corresponding to any starting virtual anchor point is generated in the following manner: taking the any starting virtual anchor point as the endpoint, a set of rays is determined in the plane where the sample image is located, wherein there is a preset angle interval between adjacent rays; a set of remaining virtual anchor points is determined on each ray, and the any starting virtual anchor point and each set of remaining virtual anchor points are combined into multiple sets of virtual anchor points to define the set of virtual anchor lines corresponding to the any starting virtual anchor point.
8. A structuring apparatus for an image table, the image containing a table to be recognized, the apparatus comprising: Setting unit, used to set multiple virtual anchor lines in the image, wherein the virtual anchor lines are defined by the original coordinates of a set of virtual anchor points arranged along the virtual anchor lines on the image; A feature extraction unit is used to extract features from the image to obtain a feature map corresponding to the image; The feature vector generation unit is used to generate a corresponding feature vector for each virtual anchor line; wherein, the feature vector corresponding to any virtual anchor line is generated in the following manner: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map. The regression unit is used to perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and to generate corresponding regression coordinates based on the feature vectors corresponding to the target virtual anchor line. The regression coordinates are the coordinates of the table line of the table to be identified in the image. The structured unit is used to generate a structured table corresponding to the table to be identified based on the generated regression coordinates. When the setting unit sets multiple virtual anchor lines in the image, it is used for: Select a starting virtual anchor point at the image boundary; A set of virtual anchor lines is generated for each starting virtual anchor point; wherein, the set of virtual anchor lines corresponding to any starting virtual anchor point is generated in the following manner: taking the any starting virtual anchor point as the endpoint, a set of rays is determined in the plane where the image is located, wherein there is a preset angle interval between adjacent rays; a set of remaining virtual anchor points is determined on each ray, and the any starting virtual anchor point and each set of remaining virtual anchor points are combined into multiple sets of virtual anchor points to define the set of virtual anchor lines corresponding to the any starting virtual anchor point.
9. A training apparatus for an image table structured model, the apparatus comprising: The acquisition unit is used to acquire a training sample image set. Each sample image in the training sample image set includes a table line with known real coordinates of the table to be identified, and multiple virtual anchor lines are set in the sample image. The virtual anchor lines are defined by a set of virtual anchor points arranged along the virtual anchor lines and their corresponding original coordinates on the sample image. The input unit is used to input the training sample image set into the structured model to be trained, the structured model including: a feature extraction layer, a feature vector determination layer, and a regression layer; wherein: The feature extraction layer is used to extract features from the image to obtain the feature map corresponding to the image; The feature vector determination layer is used to generate a corresponding feature vector for each virtual anchor line. The feature vector corresponding to any virtual anchor line is generated in the following way: based on the mapping relationship between the image and the feature map, the mapping coordinates of the virtual anchor point corresponding to any virtual anchor line on the feature map are determined, and a feature vector is generated based on the pixel features located at the mapping coordinates in the feature map. The regression layer is used to perform regression processing on the feature vectors corresponding to the multiple virtual anchor lines to determine the target virtual anchor line that matches the table line of the table to be identified, and to generate corresponding regression coordinates based on the feature vectors corresponding to the target virtual anchor line. The regression coordinates are the coordinates of the table line of the table to be identified in the image. An optimization unit is used to optimize the structured model based on the difference between the generated regression coordinates and the true coordinates corresponding to the sample image. The acquisition unit sets up multiple virtual anchor lines in the following manner: Select a starting virtual anchor point at the boundary of the sample image; A set of virtual anchor lines is generated for each starting virtual anchor point; wherein, the set of virtual anchor lines corresponding to any starting virtual anchor point is generated in the following manner: taking the any starting virtual anchor point as the endpoint, a set of rays is determined in the plane where the sample image is located, wherein there is a preset angle interval between adjacent rays; a set of remaining virtual anchor points is determined on each ray, and the any starting virtual anchor point and each set of remaining virtual anchor points are combined into multiple sets of virtual anchor points to define the set of virtual anchor lines corresponding to the any starting virtual anchor point.
10. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method as claimed in any one of claims 1 to 6 or 7.
11. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the steps of the method as claimed in any one of claims 1 to 6 or 7.