Method of PDF file information extraction system based on XML

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology for extracting system and document information, which is applied in the field of information transformation, can solve the problems of not seeing the document reports of the PDF document information extraction system, and achieve the effect of improving efficiency

Inactive Publication Date: 2005-10-26

FUZHOU UNIV

View PDF0 Cites 55 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] After searching: there is no literature report on the method of XML-based PDF document information extraction system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0013] The workflow of PDF document information extraction system design:

[0014] Design of DTD (Document Type Definition)

[0015] To better display the semantic information in the PDF document, the first step is to formulate the DTD document that regulates the rules and interrelationships of the elements and symbols in the XML document. We refer to Simplified DocBook, a subset of popular DocBook elements, and analyze and select the following two types of basic information according to the characteristics of chapter structure and terminology of scientific papers:

[0016] (1) External information metadata (Articleinfo): Metadata describing the external characteristics of scientific and technological papers, including author (author), address (author address), edition (publishing), bibliography (references), etc. External information metadata is an important basis for users to conduct information retrieval.

[0017]

[0018]

[0019]

[0020]

[0021]

[0022]

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for XML-based PDF document information drawing system, an information converting method, belonging to the class of information technique and including the steps: (1) designing DTD, i.e. firstly analyzing and selecting external information cell data and internal information cell data; (2) drawing the semantic information of the PDF document, i.e. firstly drawing the content flow of each page stored in the PDF document for decoding, and then converting the physical structure of the PDF document to a logic structure, and finally drawing the external information cell data and internal information cell data; (3) generating a XML document. The invention can further process the XML document, thus raising the efficiencies of automatically classifying documents and searching user information.

Description

Technical field: [0001] The present invention is a method for information conversion, which belongs to the category of information technology. Specifically, it is a method for extracting system information of PDF documents based on XML. Background technique: [0002] The structured document format PDF is proposed by Adobe Corporation of the United States. With its excellent characteristics, the PDF file format has become an ideal document format for electronic document distribution and formatted information dissemination on the Internet. Currently, it is becoming more and more popular to submit scientific and technical papers in the Internet in PDF format. Such as Wanfang database, etc. However, PDF focuses on describing the printing format of the document, and does not describe the data structure of the original document content. This has become a bottleneck restricting people's information retrieval. Therefore, the research on information extraction from PDF is very im...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor张文德宋艳娟杨传耀朱丹红陈俊林

OwnerFUZHOU UNIV

Method of PDF file information extraction system based on XML

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology