PDF form extraction method

An extraction method and table technology, applied in the field of PDF document data mining and extraction, can solve problems such as insufficient accuracy and efficiency
CN106897690AActive Publication Date: 2017-06-27南京述酷信息技术有限公司

Patent Information

Authority / Receiving Office
CN · China
Current Assignee / Owner
南京述酷信息技术有限公司
Publication Date
2017-06-27

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a PDF form extraction method. According to the technical scheme, the method comprises the steps of analyzing a PDF document by a page number to obtain all image data, first line data and character data, processing the image data in sequence according to the page number by adopting an image identification algorithm, and obtaining second line data corresponding to form data from the image data with the form data; processing the first line data and the second line data in sequence according to the page number by adopting a graphic algorithm, thereby obtaining form frame data with form row data and column data; performing clustering processing on the character data by adopting a clustering algorithm to obtain text data with a character string set; and obtaining all the form data in the PDF document through final all form frames and all text data. According to the method for extracting forms in the PDF document, the accuracy and efficiency of extracting the forms in the PDF document are improved, and more accurate form data can be obtained; and the method is suitable for the field with higher requirements on the accuracy and efficiency of extracting the form data.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the technical field of PDF document data mining and extraction, in particular to a method for extracting PDF tables. Background technique

[0002] PDF (Portable Document Format) is a portable document format, a file format developed by Adobe Systems for file exchange, which has no interaction with applications, operating systems, and other hardware. PDF documents are based on the PostScript language image model, ensuring that PDF documents can have accurate colors and accurate printing effects on any printer, that is, PDF will faithfully reproduce every character, color and image in the PDF document. . With the rapid development of computer and Internet technology, PDF documents are more and more widely used in various fields such as economy, finance, education, scientific research and academics. Since the design purpose of PDF is only to display documents or print documents, it does not have the function of communicating and...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More