Nested table extraction method and device, and storage medium

An extraction method and table technology, which is applied in the field of office and network data acquisition, can solve the problems of affecting the recognition experience of non-editable text, poor user experience, and low processing efficiency, so as to improve recognition accuracy, improve accurate recognition, and reduce editing The effect of work
CN112668289APending Publication Date: 2021-04-16SUZHOU AUNBOX SOFTWARE CO LTD

Patent Information

Authority / Receiving Office
CN Β· China
Current Assignee / Owner
SUZHOU AUNBOX SOFTWARE CO LTD
Publication Date
2021-04-16

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a nested table extraction method and device and a storage medium, and the method comprises the steps: reading data contents in a first type of file, carrying out the analysis, and determining the line segment coordinate information contained in the analyzed data contents; grouping the line segments by taking the display unit as a reference; based on the coordinates of the line segments, merging the line segments of which the transverse or longitudinal distances of adjacent or connected line segments are smaller than corresponding thresholds and the adjacent endpoint intervals of the adjacent line segments are smaller than corresponding fixed thresholds; traversing all the merged line segments, determining whether the line segments are intersected or not, and generating a set; traversing the line segments in the set, determining intersection points between the merged intersection line segments, traversing all the intersection points, and determining intersection points forming a rectangular frame; generating a table by taking an intersection point of which the area of the formed rectangular frame is greater than a set threshold value as a reference; and calculating an inclusion relation of the table based on the generated table, and forming a nested table based on the inclusion relation. According to the invention, the form extraction accuracy is improved, and the editing work is reduced.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The embodiments of the present application relate to office and network data acquisition technologies, and in particular to a method and device for extracting nested tables, and a storage medium. Background technique

[0002] At present, when performing text recognition on non-editable text such as PDF text, the recognition and extraction method for the text part is relatively mature, and the accuracy of recognition and extraction is relatively high. However, when the non-editable text contains tables and other table-like content , the recognition of the table structure itself is quite poor, such as intermittent and uneven lines in the recognized table, which seriously affects the recognition experience for non-editable text, causing users to waste a lot of time repairing the recognized table structure , leading to a rather low processing efficiency and a poor user experience. Contents of the invention

[0003] In view of this, embodiments of the pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More