A method for table recognition of pdf documents

A table and document technology, applied in the field of PDF document table recognition, can solve the problems affecting the correct rate and efficiency of table recognition, and achieve the effect of improving the correct rate and improving the recognition efficiency.

Active Publication Date: 2018-03-30
TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING)
View PDF5 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In a complex layout where multiple tables coexist on one page, especially multiple three-line tables (tables with only horizontal lines) coexist on one page, only using intersecting table lines and table body text layout features will affect the accuracy and efficiency of table recognition

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for table recognition of pdf documents
  • A method for table recognition of pdf documents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0025] like figure 1 Shown, be the method flow chart of PDF document form identification, described method comprises:

[0026] Obtain the character set in the page, and merge the character set into a row to create a row set;

[0027] Extract the horizontal and vertical lines in the page path, and create a set of lines;

[0028] Detect suspected table headers in the row set and suspected table lines in the lines set;

[0029] If there are suspected table titles and suspected table lines at the same time, use the region growing method based on the table title and line set to identify the table;

[0030] If there are only suspected table lines, use the line set and row set to first detect the full-line table and then the three-line table;

[0031]...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for identifying a form of a PDF document, comprising: acquiring a character set in a page, merging the character sets into lines, and establishing a line set; extracting horizontal lines and vertical lines in a page path, and establishing a line set; detecting the line set Suspected table titles in and suspected table lines in the line set; if there are suspected table titles and suspected table lines at the same time, use the region growing method based on the table title and line set to identify the table; if there are only suspected table lines, use the line The set and row set first detect the full-line table and then the three-line table; if there is only a suspected table title, use the region growing method based on the table title and row set to identify the table; if there is neither suspected table line nor suspected table title, then determine the table There is no table on the page; detect the table header and table attachment elements, and output the table recognition result of this page. The present invention regards the table title, table line and table character arrangement characteristics as three major features of the table, and adopts the idea of ​​parallel growth of regions to accurately locate the table in a complex layout where multiple tables coexist on one page.

Description

technical field [0001] The invention relates to a PDF document table recognition method, which belongs to the field of layout analysis and layout understanding of formatted electronic documents. Background technique [0002] With the rapid development of the domestic digital publishing industry, how to make deep use of publishing resources, how to quickly perform deep processing of document resources, realize resource fragmentation and reorganization, and meet the needs of multi-form, multi-channel, and multi-media digital publishing are the current digital publishing industry. issues that need resolving. Fragmentation of document resources includes indexing of basic metadata such as article titles, authors, keywords, and references, as well as fragmentation of text content such as paragraphs, pictures, tables, and formulas. Layout analysis and understanding technology is the key technology to realize the automatic fragmentation of documents. The PDF table recognition metho...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/24
CPCG06F40/177G06V30/414
Inventor 邹季英袁仁慧梁洵
Owner TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING)
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products