Method and device for identifying reading sequence of layout

A technology of reading order and layout, which is applied in the fields of instruments, computing, and electronic digital data processing, etc., and can solve the problem of poorly applied complex e-book page reading order recognition, low accuracy rate of reading order recognition and content rearrangement, image elements Do not participate in issues such as sorting

Active Publication Date: 2012-05-30
PEKING UNIV +2
View PDF5 Cites 36 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disadvantage of this method is also that it is only suitable for simple typesetting pages that only contain text. When a large number of images appear, the accuracy of recognition will be reduced due to the lack of semantic information.
[0008] It can be seen from the above methods that none of the existing technologies can be well applied to the reading order recognition in complex e-book pages containing images, and the accuracy rate of reading order recognition and content rearrangement of these complex e-books is low
Moreover, in the prior art, image elements do not participate in sorting, only text objects are extracted and a reading order recognition algorithm is designed for them
With the diversification and beautification of digital document layout, image elements gradually play an important role in the document and express rich information. Coherent or wrong, which is likely to be unacceptable for many users

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for identifying reading sequence of layout
  • Method and device for identifying reading sequence of layout
  • Method and device for identifying reading sequence of layout

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0104] In the present embodiment, adopt electronic book " 21st century computer basic tutorial " (Beijing University of Posts and Telecommunications Press), this electronic book has 317 pages, read the 135th page arbitrarily therefrom (as Figure 8 shown) as the layout to be recognized.

[0105] According to the present invention, identifying the page reading order includes the following steps:

[0106] (1) read Figure 8 the page shown, and analyze the layout to obtain layout information and object properties of character text objects and image objects.

[0107] (2) The character text object and the image object obtained in step (1) are carried out logical paragraph recognition, specifically as follows:

[0108] Step S21, calculate the plate center as C(65, 36, 387, 529), where the coordinate unit is pound, the coordinate origin is the lower left vertex, and the plate center is as Figure 8 Shown in the dotted rectangle box;

[0109] Step S22, judge whether the text chara...

no. 2 example

[0122] In this embodiment, the layout to be identified is such as Figure 10 a and Figure 10 As shown in b, in this layout, there is a circular layout.

[0123] After passing through the above logical paragraph recognition steps, the paragraph recognition result of this layout is as follows Figure 10 a and Figure 10 Each rectangular frame in b is shown. After observation, it is found that the paragraphs on this page have no cutting position in the X direction and Y direction, and essentially constitute a whole ring. At this time, if only the global sorting method is used, the cutting algorithm will not be executed and will jump out directly, so the obtained reading order is the natural output order of the pages, that is, the typesetting order. And the layout order of this page is as follows Figure 10 As shown in b, it is in the order of "text near the page number - text in the left column - title - text in the right column - legend in the left column - legend in the r...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for identifying a reading sequence of a layout, which comprises the following steps of: reading the layout to be identified and analyzing the layout to obtain layout information and object properties of character text objects and image objects; according to the layout information and the object properties, merging the character text objects to form text paragraphs and identifying the image objects into image paragraphs; and determining a reading sequence of the text paragraphs and the image paragraphs by adopting a mode of combining global recursion cutting with local sequence judgment, wherein the global cutting is carried out by projection and for packets which still comprise a plurality of paragraphs after global cutting, a sequence of the paragraphs is judged by adopting a local judging method. Correspondingly, the invention provides a device for identifying the reading sequence of the layout. According to the invention, both characters and images are identified into the paragraphs and the reading sequence of the paragraphs is identified by adopting the mode of combining global recursion cutting with local sequence judgment, so that the correct identification of texts and the images in the complex layout is realized and the efficiency and the accuracy are high.

Description

technical field [0001] The invention relates to the technical field of digital document processing, in particular to a method and device for identifying the layout reading order of digital documents. Background technique [0002] With the rapid development of computer and network technology, the application of digital documents is becoming more and more extensive, especially for format documents such as PDF and CEBX, because of their fixed layout rendering effect, they are ideal for electronic document publishing, digital information dissemination and archiving The document format has been widely used in e-books, journal papers, electronic documents, product descriptions, company announcements, network materials, e-mails, etc. However, it is precisely because the rendering effect of these layout documents is fixed, and they do not have logical structure information, so that they cannot be directly re-edited, typeset, logical structure extracted, and information extracted in ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/21
Inventor 房婧高良才汤帜陶欣
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products