PDF incomplete box line table extraction method, device and equipment and storage medium

A table and frame line technology, applied in the field of PDF document recognition, can solve problems such as low degree of automation, incomplete table structure recognition, and inability to fully automate table extraction.

Active Publication Date: 2021-02-19
北京中科凡语科技有限公司
View PDF12 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Among them, the recognition of complete framed forms is relatively simple, and the current open source PDF form extraction tool can achieve a high accuracy rate; the recognition of incomplete framed forms has many problems: for example, the accuracy of form detection is low (currently camelot , pdfplumber and other open source tools will mistakenly detect the text content out...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF incomplete box line table extraction method, device and equipment and storage medium
  • PDF incomplete box line table extraction method, device and equipment and storage medium
  • PDF incomplete box line table extraction method, device and equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] The present disclosure will be further described in detail below with reference to the drawings and embodiments. It can be understood that the specific implementation manners described here are only used to explain relevant content, rather than to limit the present disclosure. It should also be noted that, for ease of description, only parts related to the present disclosure are shown in the drawings.

[0052] It should be noted that, in the case of no conflict, the implementation modes and the features in the implementation modes in the present disclosure can be combined with each other. The technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings and in combination with implementation manners.

[0053] Unless otherwise specified, the illustrated exemplary embodiments / embodiments are to be understood as exemplary features providing various details of some manner in which the technical idea of ​​the pre...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a PDF incomplete box line table processing method which comprises the following steps: S1, analyzing a PDF page to obtain elements of the PDF page; s2, judging whether the analyzed elements at least comprise horizontal line segment elements and/or vertical line segment elements or not, and judging whether the PDF page comprises a table or not at least based on the characteristics of the horizontal line segment elements; s3, if the PDF page contains the table, judging whether the table is a complete box line table or an incomplete frame line table at least based on the characteristics of the vertical line segment elements; s4, if the table is an incomplete box line table, obtaining all text blocks in the PDF page and position information of each text block, and obtaining a preliminary table area in the PDF page at least based on the position information of each text block; and S5, based on the horizontal line segment elements and/or the vertical line segment elements, correcting the preliminary table area to obtain a corrected table area. The invention further provides a PDF incomplete box line table processing device, electronic equipment and a storage medium.

Description

technical field [0001] The present disclosure relates to a method, device, equipment and storage medium for extracting PDF incomplete boxline tables, and belongs to the technical field of PDF document recognition. Background technique [0002] PDF (Portable Document Format, Portable Document Format) is currently one of the most widely used document formats, mainly used for file exchange and printing, etc., and cannot interact with other computer programs. [0003] With the wide application of PDF in finance, scientific research, education and other fields, automatic recognition of PDF documents and extraction of useful data from them have become a concern. [0004] PDF documents are mainly composed of text, images, tables, formulas, etc. Among them, as an extremely efficient data organization and presentation method, table recognition has become an urgent problem to be solved. Table recognition includes table detection and table structure recognition. Table detection refers...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/163G06F40/174G06F40/18G06K9/00
CPCG06F40/163G06F40/174G06F40/18G06V30/412
Inventor 周玉李小青
Owner 北京中科凡语科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products