Unlock instant, AI-driven research and patent intelligence for your innovation.

PDF document structured message extraction method and device

A technology of information extraction and document structure, applied in the fields of instruments, character and pattern recognition, computer parts, etc., can solve the problem of inability to easily obtain the structured information of PDF documents, and achieve the effect of avoiding manual processing.

Active Publication Date: 2017-11-17
ZHONGKE DINGFU BEIJING TECH DEV
View PDF13 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] This application provides a method for extracting structured information of a PDF document and a device for extracting structured information of a PDF document, so as to solve the problem that the structured information of a PDF document cannot be obtained conveniently through the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF document structured message extraction method and device
  • PDF document structured message extraction method and device
  • PDF document structured message extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0050] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0051] Please refer to figure 1 , in a specific embodiment, the PDF document structured information extraction method includes:

[0052] S100 acquires the original page of the PDF document.

[0053] S200 Extracting at least one actual page including text content or title from the original page.

[0054]S300 Extract titles of various levels and text content belonging to the titles from the actual page.

[0055] S400 Structured storage of each title and text content belonging to the title.

[0056] Structured information means that the information can be decomposed into multiple interrelated components after analysis, and there is a clear hierarchical structure among the components. In this application, the structured information of a PDF docume...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a PDF document structured message extraction method. The method comprises steps that original pages of a PDG document are acquired; at least one actual page including text content or a title is extracted from the original pages; all levels of titles and the text content attached to the titles are extracted from the actual pages; structured storage of each tile and the text content attached to the tile is carried out. The method is advantaged in that the all levels of titles and the corresponding text content attached to the all levels of titles can be extracted from the PDF document, structured storage is carried out, the structured information is acquired, the structured information of the PDF document can be automatically extracted, manual re-processing can be avoided, and the method is convenient and efficient.

Description

technical field [0001] The present application relates to the field of information extraction of PDF documents, and in particular to a method for extracting structured information of PDF documents. In addition, the present application also relates to a device for extracting structured information from a PDF document. Background technique [0002] PDF (Portable Document Format, Portable Document Format) is a file format developed by Adobe Systems. It is used for file exchange in a way independent of applications, operating systems, and hardware. It belongs to format documents. The pages of PDF are relatively independent, and will faithfully reproduce every character, color, and image of the original manuscript. However, the storage of PDF is an unstructured data storage format, which does not record the logical structure of the document, and has no logical elements such as paragraphs and tables. [0003] The information in the PDF document is extracted, usually using OCR (Op...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/00
CPCG06V30/416G06V30/43G06V30/40
Inventor 徐龙李德彦杨宇
Owner ZHONGKE DINGFU BEIJING TECH DEV