Unlock instant, AI-driven research and patent intelligence for your innovation.

A method for analyzing the reading order of formatted documents in electronic files

A technology of reading order and layout documents, applied in the field of information, which can solve problems such as ambiguous block division

Active Publication Date: 2018-02-09
TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING)
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The recursive division method used in this method has certain defects in the vertical direction, and it is prone to ambiguous block division

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for analyzing the reading order of formatted documents in electronic files
  • A method for analyzing the reading order of formatted documents in electronic files
  • A method for analyzing the reading order of formatted documents in electronic files

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with embodiments and drawings.

[0031] Such as figure 1 Shown is the method flow of the reading sequence analysis of electronic format files, including the following steps:

[0032] Extract the original information in the PDF file;

[0033] Identify the header and footer, and merge the adjacent text content to get the line content;

[0034] Perform block merging on the text line content to obtain the text block content;

[0035] Combine adjacent pictures to obtain the content of the picture block;

[0036] Analyze the path information to obtain the dividing line in the horizontal direction;

[0037] Project the content of the text block and the image block in the X direction to obtain the content of the horizontally separated block;

[0038] Take text block content, image block content, horizontal dividi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for analyzing the reading sequence of an electronic format file, and the method includes the following steps: extracting original information in a PDF file; identifying page headers and page footers, and merging adjacent text contents to obtain line contents; Merge the content of the text line to obtain the content of the text block; merge the adjacent pictures to obtain the content of the picture block; analyze the path information to obtain the dividing line in the horizontal direction; project the content of the text block and the content of the picture block in the X direction , get the content of the horizontal partition block; take the content of the text block, the content of the picture block, the horizontal partition line, the table and the physical information of the content of the horizontal partition block as elements, perform topological sorting, and obtain the reading order of the PDF file; based on the reading order, the content of the text block is sorted Carry out segmentation recognition; output XML format text.

Description

Technical field [0001] The invention relates to the field of information technology, in particular to a method for analyzing the reading sequence of electronic file format files. Background technique [0002] PDF (Portable Document Format) is a file format developed by Adobe. Its advantage is that it is cross-platform, can retain the original format (Layout) of the file, and carry out the original file and format with high quality and fidelity. However, PDF is an unstructured data storage format. For information retrieval of text in PDF files or conversion of PDF format to other streaming format files, the extracted text information is not output in the reading order of the file, and may exist after The content of the sequence appears in the output text in advance. [0003] The patent application document with the patent application number 2010105591353 discloses a method for recognizing the reading order of the layout, including: reading the layout to be recognized, and analyzing...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/21G06F17/30
Inventor 张斌张晓博张宝亮
Owner TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING)