Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

PDF form extraction method

An extraction method and table technology, applied in the field of PDF document data mining and extraction, can solve problems such as insufficient accuracy and efficiency

Active Publication Date: 2017-06-27
南京述酷信息技术有限公司
View PDF3 Cites 42 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The present invention aims at the defect of insufficient accuracy and efficiency in extracting form data in PDF documents in the prior art, and aims to provide a PDF form extraction method for extracting form data in PDF documents with high accuracy and high efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF form extraction method
  • PDF form extraction method
  • PDF form extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0071] The preferred embodiments are listed below, combined with Figure 1 to Figure 10 To illustrate the present invention more clearly and completely.

[0072] The PDF form extracting method of the embodiment of the present invention comprises:

[0073] Such as figure 1 As shown, it is a flow chart of the method for extracting PDF forms of the present invention, that is, the specific implementation process of the method for extracting PDF forms includes: firstly parsing and processing the PDF document to obtain image data, first line data and character data. Image data obtained through PDF parsing is processed by an image recognition algorithm, and second line data corresponding to the form data is obtained from the image data with form data. Graphical algorithms are used to process the first line data obtained through PDF parsing and the second line data obtained through image recognition algorithms to process image data to obtain table frame data of all table row data an...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a PDF form extraction method. According to the technical scheme, the method comprises the steps of analyzing a PDF document by a page number to obtain all image data, first line data and character data, processing the image data in sequence according to the page number by adopting an image identification algorithm, and obtaining second line data corresponding to form data from the image data with the form data; processing the first line data and the second line data in sequence according to the page number by adopting a graphic algorithm, thereby obtaining form frame data with form row data and column data; performing clustering processing on the character data by adopting a clustering algorithm to obtain text data with a character string set; and obtaining all the form data in the PDF document through final all form frames and all text data. According to the method for extracting forms in the PDF document, the accuracy and efficiency of extracting the forms in the PDF document are improved, and more accurate form data can be obtained; and the method is suitable for the field with higher requirements on the accuracy and efficiency of extracting the form data.

Description

technical field [0001] The invention relates to the technical field of PDF document data mining and extraction, in particular to a method for extracting PDF tables. Background technique [0002] PDF (Portable Document Format) is a portable document format, a file format developed by Adobe Systems for file exchange, which has no interaction with applications, operating systems, and other hardware. PDF documents are based on the PostScript language image model, ensuring that PDF documents can have accurate colors and accurate printing effects on any printer, that is, PDF will faithfully reproduce every character, color and image in the PDF document. . With the rapid development of computer and Internet technology, PDF documents are more and more widely used in various fields such as economy, finance, education, scientific research and academics. Since the design purpose of PDF is only to display documents or print documents, it does not have the function of communicating and...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/00
CPCG06V30/413G06V30/414G06V30/10
Inventor 郑龙夏磊
Owner 南京述酷信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products