Information extraction method and apparatus for PDF file

An information extraction and document technology, applied in the field of information processing, can solve the problems of effective extraction and inability to finely analyze chart information, and achieve the effect of simplifying the time for analyzing the content of the research report.

Pending Publication Date: 2017-07-14
北京顺通行网络科技有限公司
View PDF12 Cites 33 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The embodiment of the present invention provides an information extraction method and device for PDF files to solve the problems in the prior art that the information content of PDF files cannot be finely analyzed and the chart information can be effectively extracted

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Information extraction method and apparatus for PDF file
  • Information extraction method and apparatus for PDF file
  • Information extraction method and apparatus for PDF file

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0073] The following will clearly and completely describe the technical solutions in the embodiments of the present invention in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0074] In order to solve the problem in the prior art that the information content of the PDF file cannot be finely analyzed and the chart information cannot be effectively extracted, in the embodiment of the present invention, for the PDF file, the title, text, picture, and table are identified and extracted (wherein, this In the embodiment of the invention, pictures and tables are collectively referred to as "charts" below), and they are organ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of information processing, in particular to an information extraction method and apparatus for a PDF file. The method comprises the steps of generating a corresponding tree structure by information in each page, for the PDF file; performing statistics on information of each node in the corresponding tree structure of each page, and identifying and extracting a title, a text, a chart title and a chart end in each page; and performing summarization, performing grade division on titles, extracting charts according to chart titles and chart ends, mapping texts and the charts to corresponding titles and chart titles, and finally forming structured data of the PDF file. According to the method and the apparatus, the titles, the texts, the charts and the like in the PDF file can be subjected to structured extraction, fine analysis and effective extraction of chart information are performed, and data support is provided for realizing search, accurate information locating and content mining in the vertical field of industry research reports, so that the research report analysis time of a user is greatly shortened.

Description

technical field [0001] The invention relates to the field of information processing, in particular to an information extraction method and device for a PDF file. Background technique [0002] In order to facilitate industry analysts to retrieve the desired industry report content from numerous industry research reports and to dig out content fragments with the best quality and best representative of the status quo of industry analysis from a large number of research reports, it is necessary to analyze the industry research reports. The content is finely structured, identifying the parent and child titles to which each title belongs, content fragments, chart content and other information and organizing them organically. [0003] In the prior art, information extraction for PDF files of industry research reports is mainly processed for the text data in them, and there is no better method for parsing pictures and tables in PDF files, especially for pdf files of industry researc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22
CPCG06F40/14G06F40/154
Inventor 兰任马超张道泉赵继广
Owner 北京顺通行网络科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products