Method for conducting words reading sequence recovery for newspaper pages

A technology of reading order and text, applied in the fields of electronic digital data processing, instrumentation, calculation, etc., can solve the problem of unfavorable information reuse and deep processing such as retrieval, utilization, transaction, rewriting, supplementing, sorting, time complexity increase, and lack of chapter independence. Issues such as reading order and structure

Active Publication Date: 2009-12-09
PEKING UNIV FOUNDER R & D CENT +1
View PDF2 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, the mainstream OCR digital software processes the layout of styled documents, ignoring the restoration of reading order and semantic structure, and converts them into styled electronic documents such as PDF and HTML for re-release, but it is not conducive to the reuse and further processing of information such as retrieval, utilization, and transaction. , rewriting, supplementing, sorting, etc., especially for multi-chapter newspaper layouts, the lack of independent reading order and structure of chapters makes reuse more difficult
There are mainly two types of methods for reading order recovery: one is to use style and spatial relationship information, such as the document "Layout Analysis, Understanding and Reconstruction of Complex Chinese Newspapers" (authors Chen Ming, Ding Xiaoqing, Liang Jian. Journal of Tsinghua University Natural Science Edition 2001, Volume 41, Issue 1. Pages 29-32, 59) and the document "Integrated Algorithms for Newspaper Page Decomposition and Article Tracking" published in Proceedings of the Fifth International Conference on Document Analysis and Recognition in 1999 (author B.Gatos, S.L.Mantzaris, K.V.Chandrinos, A.Tsigris, S.J.Perantonis. Pages 559~562), regard the newspaper layout as a collection of multiple independent text blocks, and use rules to merge and read text blocks based on the principle of homogeneous style of the same article The order is determined, and the rule method can only deal with layouts with simple styles and spatial relationships, such as books and journal papers. However, the characteristics of newspaper layout diversity and object correlation make it possible to restore the reading order between text blocks in complex layouts only by using styles and rules. The correct rate is too low; the other is the use of semantic and spatial relationship information, in 2002, Aiello M, Monz C, Todoran L et al. in the literature "Document understanding for a broad class of documents" (International Journal on DocumentAnalysis and Recognition, 2002, 5(1): 1~16.) discloses a method of using semantic information to determine the reading order for the first time, making a permutation and combination of all possible reading orders, and then selecting the best result according to the part-of-speech weight formula, but The time complexity increases exponentially with the increase of the number of text blocks, and the independent reading order cannot be extracted, and the semantic information used is too little, which affects the accuracy rate
Among the above technologies, the various potential information in the newspaper layout documents are not fully utilized in order to obtain a more accurate reading order effect, and a unified mathematical model has not been formed.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for conducting words reading sequence recovery for newspaper pages
  • Method for conducting words reading sequence recovery for newspaper pages
  • Method for conducting words reading sequence recovery for newspaper pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The present invention will be further described below in conjunction with the accompanying drawings and examples.

[0050] In this embodiment, we have selected the newspaper document scanned by OCR as the example data, such as figure 1 As shown, a method for restoring the text reading order of a newspaper layout includes the following steps:

[0051] 1. Read in documents with style layout information, including scanned paper newspapers and OCR-recognized documents, PDFs, documents generated by professional typesetting software such as Founder Feiteng, etc. Style information mainly refers to the position and size information of each word . Layout analysis merges text with the same style into text blocks from bottom to top according to the principle of partial style homogeneity; the classification of text blocks is divided into text blocks and non-text blocks according to the style of text blocks and the number of lines, such as figure 2 As shown in , the solid line re...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to document layout comprehension technology in intelligent text and graphic information processing, in particular to a content-based method for recovering text reading order of newspaper layout. Aiming at the disadvantages of losing the reading order and not having chapter independence in dealing with complex newspaper layouts in the prior art, this invention uses graph theory to mathematically model this problem for the first time, expressing the adjacency relationship of text blocks as a directed graph, and Split the directed graph into a weighted bipartite graph, use natural language processing technology to calculate the edge weight of the bipartite graph, and obtain multiple continuous sequences through optimal matching, and then divide each sequence into multiple subsequences according to the style information of the text block. The connection of the content corresponding to the sequence is the text flow of the independent chapters with the reading order. Utilizing semantic, spatial relation and style information, the correct rate of reading order recovery is greatly improved and independent of chapters. This method can be applied to layout understanding and structural reconstruction of styled documents.

Description

technical field [0001] The invention belongs to document layout comprehension technology in intelligent text and graphic information processing, and specifically relates to a method for recovering text reading order of newspaper layout. Background technique [0002] With the development of information technology and the emergence of new media forms, cross-media publishing is developing rapidly with its advantages of convenient information sharing, efficient information dissemination, rich information expression forms, and complementary advantages of various media. The XML-based digital asset management system is the core of cross-media publishing, but in traditional information dissemination, the form of information depends directly on the form of terminal media, which is not convenient for cross-media publishing. In particular, newspapers are huge in number, long in history, complex in style, poor in content independence, and vague in reading order, making their XML structu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/21
Inventor 贾娟陈晓鸥陈堃銶
Owner PEKING UNIV FOUNDER R & D CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products