Method for extracting structured information of continuous page format document

A layout document, structured technology, applied in the field of structured information extraction of continuous page layout documents, can solve the problems of ineffective processing, failure to consider page relevance, low accuracy, etc., to improve the structural accuracy, high accuracy efficiency, improve efficiency
CN110704570APending Publication Date: 2020-01-17北京众信博雅科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Current Assignee / Owner
北京众信博雅科技有限公司
Publication Date
2020-01-17

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
Patent Text Reader

Abstract

The invention relates to a structuring method of a continuous page format document. Characters, fonts, word sizes, positions and other information in a continuous page format document are extracted page by page; page headers and page feet are recognized and removed through preprocessing; footnotes are identified and segmented; and the remaining multiple pages of text contents and footnote contentsare respectively merged into a virtual page, layout analysis, text block merging, column dividing and table processing are carried out on the virtual page to generate a text block table, and outlineextraction is carried out according to rules by utilizing the characteristics of numbers, font sizes, alignment and the like of the text block table so as to restore the logic structure of the whole file. By the adoption of the method, interference texts such as page headers, page feet and footnotes can be effectively removed, the column reading sequence is guaranteed, the structural correctness of the texts is greatly improved, the workload of manual correction is reduced, and the efficiency is improved.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the field of information extraction of format documents, in particular to a method for extracting structured information of continuous page format documents. Background technique

[0002] The layout document format is an electronic document format with a fixed layout rendering effect. The presentation of the layout document has nothing to do with the device. When reading, printing or printing on various devices, the layout rendering results are consistent. Format documents are mainly used in the release, dissemination and archiving of written documents. Common layout document formats include PDF, CEBX, OFD, etc. The layout document format defines the layout presentation data of multiple pages, the presentation position, color, font size and other information of the internal objects (text, image, graphics, etc.) Format to present document content for human readability. The layout document stores unstructured data, without rec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More