Nested table extraction method and device, and storage medium

An extraction method and table technology, which is applied in the field of office and network data acquisition, can solve the problems of affecting the recognition experience of non-editable text, poor user experience, and low processing efficiency, so as to improve recognition accuracy, improve accurate recognition, and reduce editing The effect of work

Pending Publication Date: 2021-04-16
SUZHOU AUNBOX SOFTWARE CO LTD
View PDF0 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] At present, when performing text recognition on non-editable text such as PDF text, the recognition and extraction method for the text part is relatively mature, and the accuracy of recognition and extraction is relatively high. However, when the non-editable text contains table-like content such as , the recognition of th

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Nested table extraction method and device, and storage medium
  • Nested table extraction method and device, and storage medium
  • Nested table extraction method and device, and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] The essence of the technical solutions of the embodiments of the present application will be explained in detail below in combination with examples.

[0057] figure 1 Schematic flow chart of the method for extracting nested tables provided in the embodiment of the present application, as shown in figure 1 As shown, the method for extracting nested tables in the embodiment of the present application includes the following processing steps:

[0058] Step 101, read and analyze the data content in the first type of file, and determine the line segment coordinate information included in the analyzed data content.

[0059] In the embodiment of the present application, the first type of file includes non-editable text such as PDF text, screenshot text, picture and other text.

[0060]The first type of file can contain text, tables, especially nested tables, pictures and other content. In this embodiment of the present application, the form includes not only a common form, b...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a nested table extraction method and device and a storage medium, and the method comprises the steps: reading data contents in a first type of file, carrying out the analysis, and determining the line segment coordinate information contained in the analyzed data contents; grouping the line segments by taking the display unit as a reference; based on the coordinates of the line segments, merging the line segments of which the transverse or longitudinal distances of adjacent or connected line segments are smaller than corresponding thresholds and the adjacent endpoint intervals of the adjacent line segments are smaller than corresponding fixed thresholds; traversing all the merged line segments, determining whether the line segments are intersected or not, and generating a set; traversing the line segments in the set, determining intersection points between the merged intersection line segments, traversing all the intersection points, and determining intersection points forming a rectangular frame; generating a table by taking an intersection point of which the area of the formed rectangular frame is greater than a set threshold value as a reference; and calculating an inclusion relation of the table based on the generated table, and forming a nested table based on the inclusion relation. According to the invention, the form extraction accuracy is improved, and the editing work is reduced.

Description

technical field [0001] The embodiments of the present application relate to office and network data acquisition technologies, and in particular to a method and device for extracting nested tables, and a storage medium. Background technique [0002] At present, when performing text recognition on non-editable text such as PDF text, the recognition and extraction method for the text part is relatively mature, and the accuracy of recognition and extraction is relatively high. However, when the non-editable text contains tables and other table-like content , the recognition of the table structure itself is quite poor, such as intermittent and uneven lines in the recognized table, which seriously affects the recognition experience for non-editable text, causing users to waste a lot of time repairing the recognized table structure , leading to a rather low processing efficiency and a poor user experience. Contents of the invention [0003] In view of this, embodiments of the pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/18G06F16/22
Inventor 王春浩程言超周炬马成龙
Owner SUZHOU AUNBOX SOFTWARE CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products