Method of PDF file information extraction system based on XML

A technology for extracting system and document information, which is applied in the field of information transformation, can solve the problems of not seeing the document reports of the PDF document information extraction system, and achieve the effect of improving efficiency

Inactive Publication Date: 2005-10-26
FUZHOU UNIV
View PDF0 Cites 55 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] After searching: there is no literature report on the

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method of PDF file information extraction system based on XML
  • Method of PDF file information extraction system based on XML
  • Method of PDF file information extraction system based on XML

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0013] The workflow of PDF document information extraction system design:

[0014] Design of DTD (Document Type Definition)

[0015] To better display the semantic information in the PDF document, the first step is to formulate the DTD document that regulates the rules and interrelationships of the elements and symbols in the XML document. We refer to Simplified DocBook, a subset of popular DocBook elements, and analyze and select the following two types of basic information according to the characteristics of chapter structure and terminology of scientific papers:

[0016] (1) External information metadata (Articleinfo): Metadata describing the external characteristics of scientific and technological papers, including author (author), address (author address), edition (publishing), bibliography (references), etc. External information metadata is an important basis for users to conduct information retrieval.

[0017]

[0018]

[0019]

[0020]

[0021]

[0022]

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for XML-based PDF document information drawing system, an information converting method, belonging to the class of information technique and including the steps: (1) designing DTD, i.e. firstly analyzing and selecting external information cell data and internal information cell data; (2) drawing the semantic information of the PDF document, i.e. firstly drawing the content flow of each page stored in the PDF document for decoding, and then converting the physical structure of the PDF document to a logic structure, and finally drawing the external information cell data and internal information cell data; (3) generating a XML document. The invention can further process the XML document, thus raising the efficiencies of automatically classifying documents and searching user information.

Description

Technical field: [0001] The present invention is a method for information conversion, which belongs to the category of information technology. Specifically, it is a method for extracting system information of PDF documents based on XML. Background technique: [0002] The structured document format PDF is proposed by Adobe Corporation of the United States. With its excellent characteristics, the PDF file format has become an ideal document format for electronic document distribution and formatted information dissemination on the Internet. Currently, it is becoming more and more popular to submit scientific and technical papers in the Internet in PDF format. Such as Wanfang database, etc. However, PDF focuses on describing the printing format of the document, and does not describe the data structure of the original document content. This has become a bottleneck restricting people's information retrieval. Therefore, the research on information extraction from PDF is very im...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 张文德宋艳娟杨传耀朱丹红陈俊林
Owner FUZHOU UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products