Portable document format (PDF) document form identification method

A form and document technology, applied in the field of PDF document form recognition, can solve problems affecting the accuracy and efficiency of form recognition, and achieve the effect of improving accuracy and recognition efficiency

Active Publication Date: 2016-05-18
TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING)
View PDF5 Cites 48 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In a complex layout where multiple tables coexist on one page, especially multiple three-line tables (tables with only horizontal lines) coexist on one page, only using intersecting table lines and table body text layout features will affect the accuracy and efficiency of table recognition

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Portable document format (PDF) document form identification method
  • Portable document format (PDF) document form identification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0025] Such as figure 1 Shown, be the method flow chart of PDF document form identification, described method comprises:

[0026] Obtain the character set in the page, and merge the character set into a row to create a row set;

[0027] Extract the horizontal and vertical lines in the page path, and create a set of lines;

[0028] Detect suspected table headers in the row set and suspected table lines in the lines set;

[0029] If there are suspected table titles and suspected table lines at the same time, use the region growing method based on the table title and line set to identify the table;

[0030] If there are only suspected table lines, use the line set and row set to first detect the full-line table and then the three-line table;

[00...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a portable document format (PDF) document form identification method. The method comprises the steps of acquiring character sets in a page, combining the character sets in rows, and establishing a row set; extracting horizontal lines and vertical lines in a page path, and establishing a line set; detecting suspected form titles in the row set and suspected form lines in the line set; if both the suspected form titles and the suspected form lines exist, identifying the form by using a region growth method based on the form titles and the line set; if only the suspected form lines exist, firstly detecting an all line form and then detecting a three line form by using the line set and the row set; if only the suspected form titles exist, identifying the form by using a region growing method based on the form titles and the row set; if neither the suspected form lines nor the suspected form titles exist, determining that the page has no form; and detecting form attached elements like a form header, form notes and so on, and outputting a form identification result of the page. According to the method, the form title, form lines and form character arrangement characteristic are deemed as three characteristics of the form, and the form can be located accurately in a complicated page layout with multiple forms on one page by adopting the method of regional parallel growth.

Description

technical field [0001] The invention relates to a PDF document table recognition method, which belongs to the field of layout analysis and layout understanding of formatted electronic documents. Background technique [0002] With the rapid development of the domestic digital publishing industry, how to make deep use of publishing resources, how to quickly perform deep processing of document resources, realize resource fragmentation and reorganization, and meet the needs of multi-form, multi-channel, and multi-media digital publishing are the current digital publishing industry. issues that need resolving. Fragmentation of document resources includes indexing of basic metadata such as article titles, authors, keywords, and references, as well as fragmentation of text content such as paragraphs, pictures, tables, and formulas. Layout analysis and understanding technology is the key technology to realize the automatic fragmentation of documents. The PDF table recognition metho...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/24
CPCG06F40/177G06V30/414
Inventor 邹季英袁仁慧梁洵
Owner TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING)
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products