Unlock instant, AI-driven research and patent intelligence for your innovation.

Form extraction method and device based on PDF (Portable Document Format) file

An extraction method and table technology, applied in the information field, can solve the problem of low accuracy of the extracted table and achieve the effect of improving accuracy

Inactive Publication Date: 2016-10-05
BEIJING UNIV OF POSTS & TELECOMM +1
View PDF4 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in practical applications, it is often found that the accuracy of extracting tables is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Form extraction method and device based on PDF (Portable Document Format) file
  • Form extraction method and device based on PDF (Portable Document Format) file
  • Form extraction method and device based on PDF (Portable Document Format) file

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] figure 1 A schematic flow chart of a method for extracting tables based on PDF files provided by an embodiment of the present invention, such as figure 1 shown, including:

[0027] 101. Analyze the PDF file to obtain the text information of each character and the line information of each line in the PDF file.

[0028] Wherein, the text information includes text character information and text position information; the line information includes line position information, line width and line length; the line position information includes line horizontal axis position and line vertical axis position.

[0029] Specifically, use PDFBox software to analyze the PDF file to obtain the text information in the PDF file; extract the line information in the PDF file according to the operator used to mark the end of the line in the PDF file.

[0030] For example: in the PDF box (PDFBox) software, the words and lines in the PDF file have been re-processed and encapsulated. Both tex...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a form extraction method and device based on a PDF (Portable Document Format) file. The method comprises the steps of analyzing to obtain word information of each word and line information of each line in the PDF file; sorting transverse lines extracted from the same page of the PDF file according to line location information; judging whether two adjacent transverse lines are located in the same form of the page; drawing a form for the transverse lines in the same form of the page according to the line information; filling longitudinal lines extracted from the page in the drawn form according to the line information; filling the word character information of the word information into form cells at locations corresponding to word location information according to the word information of each word in the drawn form, wherein the form cells are formed by the transverse lines and the longitudinal lines. The information of the transverse lines and the longitudinal lines of the form are taken into consideration, and therefore, the accuracy of extracting the form from the PDF file is improved.

Description

technical field [0001] The invention relates to information technology, in particular to a PDF file-based form extraction method and device. Background technique [0002] Due to its cross-platform characteristics, Portable Document Format (PDF) files have been widely used in current mainstream operating systems. More and more e-books, product manuals, company announcements, financial reports, network materials, and scientific literature , E-mail, etc. have begun to use the PDF file form, and has become an ideal document form for electronic document distribution and digital information dissemination. [0003] Since the format of the PDF file itself does not structure the table, there are no small challenges in the detection of the table lines and the restoration of the table. Currently, a table recognition algorithm based on text flow can be used to extract tables in PDF files. However, in practical applications, it is often found that the accuracy of extracting tables is n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22
Inventor 闫丹凤钱直儒唐皓瑾侯宾王家鑫
Owner BEIJING UNIV OF POSTS & TELECOMM