Method and system for identifying form in layout file

A technology for formatted documents and identification methods, which is applied in the fields of instrumentation, calculation, and electrical digital data processing. It can solve the problems of cumbersome manual processing, automatic processing, and loss of table data, and achieve efficient indexing and automation

Active Publication Date: 2010-07-07
NEW FOUNDER HLDG DEV LLC +1
View PDF1 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method is mainly aimed at the recognition of the text in the layout, but it cannot effectively identify the tables in the layout.
[0004] At present, when indexing digital newspapers and periodicals (that is, organizing the content information in the newspapers and periodicals, such as: labeling the layout information---publishing date, edition number, and edition name), because there are often a large number of tables in the layout, Under normal circumstances, these table data cannot be processed automatically, and manual processing will be very cumbersome, so such data is often discarded or stored as a picture
As a result, a large amount of table data is lost

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for identifying form in layout file
  • Method and system for identifying form in layout file
  • Method and system for identifying form in layout file

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0049] A table recognition system in a format file, comprising the following modules:

[0050] (1) an extraction module for extracting the original text block from the layout in the layout file;

[0051] (2) the text block initial merging module that is used to merge the original text block for the first time;

[0052] (3) a text block re-merging module for further merging the text blocks after the initial merging;

[0053] (4) be used for screening the text block after remerging and select the selection module wherein is the form text block of form;

[0054] (5) A combination module for recombining the text content in the text block of the table to obtain the content in the table.

[0055]After the extraction module extracts the original text blocks of the layout in the layout file, the text block initial merging module connected to it is...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method and a system for identifying a form in a layout file and belongs to the technical field of mode identification in the field of computer information processing. The conventional mode identification technology cannot effectively identify and automatically extract the form of a layout. In the method and the system, firstly, independent characters of the layout are combined and organized into content blocks by utilizing automatic combination technology; and secondly, form identification and content combination are performed according to spatial positions, character information and typesetting information of the content blocks. Through the method and the system, the form can be rapidly identified at high efficiency, and the form content is accurately organized through the analysis of the position and the typesetting information of the content of the paper layout.

Description

technical field [0001] The invention belongs to the technical field of pattern recognition in the field of computer information processing, and in particular relates to a form recognition method and system in format files. Background technique [0002] In industries such as newspapers and publishing houses, after the typesetting software is used for typesetting, it is necessary to extract articles and related metadata information from the produced layouts for further use, which is the reconstruction and indexing of article information. In order to restore the content of the layout more realistically, in addition to the content information of the article itself (such as title, citation, subtitle, author, body and other information), the position of the text block, font size and other information are also extracted when indexing . [0003] The Chinese patent application with the application number 200710179938.4 "An indexing method for complex layouts based on PDF" discloses ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/21
Inventor 徐剑波董宁
Owner NEW FOUNDER HLDG DEV LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products