Method and system for intelligently extracting document structure

A document structure and document technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as unstructured processing, non-paragraph style document fragments cannot be correctly extracted, and achieve the effect of strong flexibility

Inactive Publication Date: 2011-06-22
PEKING UNIV FOUNDER GRP CO LTD +1
View PDF1 Cites 32 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, since this method is completely structured according to the paragraph style, it can only be extracted for documents with paragraph styles set, and cannot be correctly extracted for document fragments without paragraph styles
In other words, this method can only structure documents in a specific format, but cannot be applied to structured processing of any document format.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for intelligently extracting document structure
  • Method and system for intelligently extracting document structure
  • Method and system for intelligently extracting document structure

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0025] figure 1 is a flow chart of the method for intelligently extracting document structure according to the first embodiment of the present invention. refer to figure 1 , the method includes the following steps:

[0026] Step S1, small sample analysis step

[0027] In this step, according to the content of each part contained in the sample of the document to be extracted and its key attributes, the extraction rules of each part and the corresponding structured keywords and the hierarchical relationship between the structured keywords are established, That is to say, the established extraction rules and structured keywords of each part should be able to reflect the content and / or key attributes of the part.

[0028] Among them, the key attribute may be, but not limited to, font style, paragraph style, text attribute and heading level. The extraction rules can be set according to the text content of each part of the content in the sample, and can also be, but not limited ...

no. 2 example

[0045] The difference between this embodiment and the first embodiment is that the sample or document is converted into a logical tree as an intermediate result, and then a unified method is applied to the logical tree with consistent specifications to structure it. In this way, documents of any format can be processed in a uniform and structured way.

[0046] Figure 4 is a flow chart of the method for intelligently extracting document structure according to the second embodiment of the present invention. refer to Figure 4 , the method includes the following steps:

[0047] Step S41, small sample analysis step

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for intelligently extracting a document structure. The method comprises the following steps of: analyzing a document sample and establishing an extraction rule and corresponding structured key words; and extracting document contents by using an extraction rule established for the document of a structure to be extracted so as to form structured contents expressed according to the structured key words. Correspondingly, the invention provides a system for intelligently extracting the document structure. The system comprises a document input unit, an analysis unit,a structured unit, a user setting interface and a document output unit. Certain simple extraction rules are set according to attributes such as styles (including character styles and paragraph styles), character attributes, character contents, title levels and the like in a document, and structured information in the document is intelligently extracted according to the rules, so that structured automatic processing of any document format is realized. On the other hand, an extraction rule can be set by a user through simple operation, and high flexibility is achieved.

Description

technical field [0001] The invention relates to the field of electronic document data processing, in particular to a method and system for intelligently extracting document structure. Background technique [0002] With the in-depth popularization of IT applications, all walks of life have accumulated a large number of information resources, and these information resources are stored in the form of electronic document data. Scientific management and rational development of these internal and external information resources have become the key to making correct decisions and enhancing competitiveness. How to effectively obtain structured content from the electronic document data content of these information resources is also a key problem to be solved in the development of many computer applications. For example, various publishing houses now have a large number of historical book resources, and the formats of the books are various. Publishing houses need to store historical r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 余忠华闫国龙曹学军缪萍曾建英
Owner PEKING UNIV FOUNDER GRP CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products