Method for analyzing reading order of electronic layout file

A technology of reading order and layout documents, applied in the field of information, which can solve problems such as ambiguous block division

Active Publication Date: 2015-01-07
TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING)
View PDF4 Cites 31 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The recursive division method used in this method has certain defects in the vertical direction, and it is prone to ambiguous block division

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for analyzing reading order of electronic layout file
  • Method for analyzing reading order of electronic layout file
  • Method for analyzing reading order of electronic layout file

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail below with reference to the embodiments and accompanying drawings.

[0031] like figure 1 As shown, it is a method flow for analyzing the reading order of electronic file format files, including the following steps:

[0032] Extract original information from PDF files;

[0033] Identify headers and footers, and merge adjacent text content to obtain row content;

[0034] Merge the content of the text line in blocks to obtain the content of the text block;

[0035] Merge adjacent pictures to get the content of the picture block;

[0036] Analyze the path information to obtain the dividing line in the horizontal direction;

[0037] Project the text quick content and the image block content in the X direction to obtain the horizontally separated block content;

[0038] Using text block content, image block conte...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for analyzing the reading order of an electronic layout file. The method comprises the following steps of: extracting original information in a PDF file; identifying page headers and page footers, combining adjacent text content, and thereby obtaining line content; performing block combination on the text line, and thereby obtaining text block content; combining adjacent pictures, and thereby obtaining picture block content; analyzing path information, and thereby obtaining a parting line in the horizontal direction; projecting the text block content and the picture block content in an X direction, and thereby obtaining horizontal parting block content; topologically sorting elements consisting of the text block content, the picture block content, the horizontal parting line, forms and physical information of the horizontal parting block content, and thereby obtaining the reading order of the PDF file; identifying the text block content by segments based on the reading order; outputting XML format text.

Description

technical field [0001] The invention relates to the field of information technology, in particular to a method for analyzing the reading sequence of an electronic file format file. Background technique [0002] PDF (Portable Document Format, Portable Document Format) is a file format developed by Adobe. Its advantage is that it is cross-platform, can retain the original format (Layout) of the file, and carry out the original file and format with high quality and fidelity. However, PDF is an unstructured data storage format. For information retrieval of text in PDF files or conversion of PDF format to other streaming format files, the extracted text information is not output according to the reading order of the files. The contents of the sequence appear earlier in the output text. [0003] The patent application document with the patent application number of 2010105591353 discloses a method for recognizing the layout reading sequence, including: reading the layout to be rec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/21G06F17/30
Inventor 张斌张晓博张宝亮
Owner TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING)
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products