Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for extracting structured information of continuous page format document

A layout document, structured technology, applied in the field of structured information extraction of continuous page layout documents, can solve the problems of ineffective processing, failure to consider page relevance, low accuracy, etc., to improve the structural accuracy, high accuracy efficiency, improve efficiency

Pending Publication Date: 2020-01-17
北京众信博雅科技有限公司
View PDF8 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The patent document "A Method and Device for Extracting Structured Information from PDF Documents (Application No. CN201710576556)" describes a structured extraction method suitable for multi-page documents, but it processes pages page by page and does not take into account The correlation between pages, such as the alignment of different pages, the problem of footnotes on the page, cannot be effectively handled when there are cross-pages in the paragraphs of the article, column tables, etc., and the accuracy rate is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting structured information of continuous page format document
  • Method for extracting structured information of continuous page format document

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0013] Exemplary implementations of the present invention are described below in conjunction with the accompanying drawings, which include various details of the implementations of the present invention to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Likewise, for the sake of clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description. A method for structuring documents in continuous page format includes the following steps:

[0014] 1. Analyze the layout document, and obtain its page information and Chinese text block information page by page, among which:

[0015] a) Page information includes page size information

[0016] b) Text block information includes character internal code,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a structuring method of a continuous page format document. Characters, fonts, word sizes, positions and other information in a continuous page format document are extracted page by page; page headers and page feet are recognized and removed through preprocessing; footnotes are identified and segmented; and the remaining multiple pages of text contents and footnote contentsare respectively merged into a virtual page, layout analysis, text block merging, column dividing and table processing are carried out on the virtual page to generate a text block table, and outlineextraction is carried out according to rules by utilizing the characteristics of numbers, font sizes, alignment and the like of the text block table so as to restore the logic structure of the whole file. By the adoption of the method, interference texts such as page headers, page feet and footnotes can be effectively removed, the column reading sequence is guaranteed, the structural correctness of the texts is greatly improved, the workload of manual correction is reduced, and the efficiency is improved.

Description

technical field [0001] The invention relates to the field of information extraction of format documents, in particular to a method for extracting structured information of continuous page format documents. Background technique [0002] The layout document format is an electronic document format with a fixed layout rendering effect. The presentation of the layout document has nothing to do with the device. When reading, printing or printing on various devices, the layout rendering results are consistent. Format documents are mainly used in the release, dissemination and archiving of written documents. Common layout document formats include PDF, CEBX, OFD, etc. The layout document format defines the layout presentation data of multiple pages, the presentation position, color, font size and other information of the internal objects (text, image, graphics, etc.) Format to present document content for human readability. The layout document stores unstructured data, without rec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/31G06F40/279G06F40/258G06F40/205
CPCG06F16/313
Inventor 徐剑波张诗玉王磊赵东岩
Owner 北京众信博雅科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products