Method and device for extracting document structure

A document structure and document technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as cumbersome operations and achieve the effect of improving efficiency

Inactive Publication Date: 2013-01-02
PEKING UNIV FOUNDER GRP CO LTD +1
View PDF2 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The present invention aims to provide a method and device for extracting document structure, so as to solve the problem of cumbersome operations in related technologies

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting document structure
  • Method and device for extracting document structure
  • Method and device for extracting document structure

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0013] The present invention will be described in detail below with reference to the accompanying drawings and in combination with embodiments.

[0014] figure 1 A flowchart showing a method for extracting a document structure according to an embodiment of the present invention, including:

[0015] Step S10, obtaining the object of the document;

[0016] Step S20, converting the object into a predefined standard format;

[0017] Step S30, identifying and labeling each item in the object in the standard format;

[0018] Step S40, extracting the content of each matched item to organize structured data about the document.

[0019] Commonly used electronic documents are in various formats such as PDF and WORD. The existing document structure recognition technology cannot identify objects in documents of different formats at the same time. Therefore, different processing methods and systems can only be used for many different document formats. It is cumbersome, heavy workload, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for extracting a document structure. The method comprises the following steps: acquiring an object of a document; converting the object into a predefined standard format; identifying and marking items in the object in the standard format; and extracting contents of the matched items to form structural data relevant to the document. The invention also provides a device for extracting the document structure. The device comprises an acquisition module for acquiring the object of the document, a conversion module for converting the object into the predefined standard format, a marking module for identifying and marking the items in the object in the standard format, and an extraction module for extracting contents of the matched items to form structural data relevant to the document. By the method and the device, an effect of improving the efficiency for document structure extraction is achieved.

Description

technical field [0001] The present invention relates to the field of digital publishing, in particular, to a method and device for extracting document structure. Background technique [0002] In the field of traditional publishing, the document format of books and newspapers is only to meet the needs of traditional printing. The description of the content is limited to visual elements such as text, graphics, image outline, color, position, etc., without the logical content and internal relationship of the document. In the field of digital publishing, more attention is paid to the logical content, association relationship, and content granularity of documents. Structural processing of documents is a prerequisite for digital content reuse. [0003] At present, the method of structured processing of document content mainly adopts manual processing. According to the predefined rules, the processing personnel visually identify the document content in the document that conforms to...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 曲刚
Owner PEKING UNIV FOUNDER GRP CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products