Method for converting PDF file to XML file

A document conversion and document technology, applied in the field of information conversion, can solve problems such as not seeing, and achieve the effect of improving efficiency

Inactive Publication Date: 2007-11-07
FUZHOU UNIV
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] After searching: I have not seen a method of information extraction system including intermediate document generation module, rule generation module, automatic extraction module, and the method of converting from PDF document to XML document.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for converting PDF file to XML file
  • Method for converting PDF file to XML file
  • Method for converting PDF file to XML file

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] 1. The specific design and implementation of the module

[0026] 1. Intermediate document generation module:

[0027] The intermediate document generation module 7 is designed to convert the PDF source document 1 into an easy-to-handle intermediate format, and then perform rule-based automatic XML document conversion on the intermediate format.

[0028] The implementation of this module has two key points:

[0029] (1) Definition of the structure of the intermediate document.

[0030] The requirements for the structure design of the intermediate document are as follows: first, it can describe the format characteristics and layout structure information of the source document, which is the basis for the automatic extraction module 9 rule matching; second, the conversion from the PDF document to the intermediate document should preferably be relatively easy conduct.

[0031] (2) Design a parser for PDF documents to generate intermediate documents that meet the above req...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The method includes following modules and steps: (1) middle file generation module: based on mapping relation between semantic item and text block, system generates middle XML file for marking up semantic items and characters of information blocks automatically pointed to content of PDF sample file; (2) rule generation module: analyzing and processing PDF sample file, calling middle XML file generated by the middle file generation module; through file parser, reading content of PDF source file, and converting it to regular XSLT file; (3) automatic extracting module: obtaining target XML file, which satisfies target DTD file and possesses semantic information, from received regular XSLT file. The invention can do further operation for converted XML file so as to raise efficiency for automatic sorting files and searching users' information.

Description

Technical field: [0001] The present invention is a method for information conversion, which belongs to the category of information technology. Specifically, it includes an information extraction system method of an intermediate document generation module, a rule generation module and an automatic extraction module. Background technique: [0002] With the development of Web technology, more and more information is presented in front of users. How to deal with massive information resources is an important content of digital library research. In order to realize the effective development and utilization of network information resources, operations such as information classification and retrieval are required. All operations on information processing should involve the extraction of document information. Document information extraction refers to the process of extracting a specified type of information from a piece of text, and forming structured data into the database for use...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 张文德宋艳娟陈振标杨传耀陈俊林朱丹红
Owner FUZHOU UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products