Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Method and system for extracting information from a document

a document and information technology, applied in the field of methods and systems, can solve the problems of document processing, high resource consumption, and high resource consumption of manual solutions, and achieve the effect of reducing manual processing costs

Inactive Publication Date: 2007-03-08
SCI APPL INT CORP
View PDF14 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

The invention is a computer-implemented method for extracting information from a population of subject documents. It models the structure of a document by identifying data of a specific type and associating it with the corresponding modeled data element type. Additionally, the invention allows for the horizontal alignment of different regions of a document by determining the type of data within each region and calculating the edit distance between them. This results in a more organized and efficient document structure.

Problems solved by technology

For organizations such as businesses; the federal government; research organizations, and the like, processing data obtained in printed form from various sources and in various formats consumes substantial resources.
Manual solutions are highly resource-intensive and well known to be susceptible to error.
Besides the regular and expected complexity of document and table structures, documents may pose additional challenges for automating the data extraction process.
In addition to irregularities related to record structure, such as the previous ones, common problems related to scanning (e.g., skewed and rotated images), as well as OCR errors should be anticipated.
The method works well on some isolated tables, however it may also erroneously extract “table structures” from non-table regions.
Despite many years of research toward automated information extraction from tables (and the initial step of recognizing a table in the first place), the problems have still not been solved.
The automatic extraction of information is difficult for several reasons.
It seems that the complexity of possible table forms multiplied by the complexity of image analysis methods has worked against the production of satisfactory and practical results.
Even though image analysis methods identify table structures and perform their segmentation, they typically do not rely on understanding about the logic of the table.
Hurst notes that table extraction “has not received much attention from either the information extraction or the information retrieval communities, despite a considerable body of work in the image analysis field, psychological and educational research, and document markup and formatting research.” As possible reasons, viewed from an information extraction perspective, Hurst identifies lack of current art and model, no training corpora, and confusing markup standards.
Moreover, “through the various niches of table-related research there is a lack of evolved or complex representations which are capable of relating high- and low-level aspects of tables.”
However, it would not likely be applicable for database upload applications.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting information from a document
  • Method and system for extracting information from a document
  • Method and system for extracting information from a document

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] As required, detailed embodiments of the present invention are disclosed herein. However, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. The figures are not necessarily to scale, and some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present invention.

[0037] A wide variety of printed documents exhibit complex data structures including textual or numerical data organized in rows and columns, i.e., tables, along with more general structures, one of which may be broadly characterized as consisting of one or more contextually-related elements including possibly tables, i.e., records. Contextual relationship typi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A computer-implemented method for extracting information from a population of subject documents. The method includes modeling a document structure. The modeled document structure includes at least a document component hierarchy with at least one record type. Each record type includes at least one record part type and at least one record part type comprising at least one data element type. For a subject document exhibiting at least a portion of the modeled document structure, preferred embodiments of the invention identifying data of a type corresponding to at least one modeled data element type. Identified subject document data is then associated with the corresponding modeled data element type.

Description

CROSS REFERENCE TO RELATED APPLICATIONS [0001] The present application is a continuation of, claims priority to and incorporates by reference U.S. patent application Ser. No. 10 / 146,959, entitled “METHOD AND SYSTEM FOR EXTRACTING INFORMATION FROM A DOCUMENT,” filed May 17, 2002. [0002] The present application hereby incorporates by reference in its entirety U.S. patent application Ser. No. 09 / 518,176, entitled “Machine Learning of Document Templates for Data Extraction,” filed Mar. 2, 2000.FIELD OF THE INVENTION [0003] The present invention relates to methods and systems for extracting information (e.g., data with context) from a document. More specifically, preferred embodiments of the present invention relate to extraction of information from printed and imaged documents including a regular table structure. BACKGROUND [0004] From credit card statements, to hospital bills, to auto repair invoices, most of us encounter printed documents containing complex, but mostly regular, data s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(United States)
IPC IPC(8): G06K9/32G06K9/20
CPCG06K9/00469G06V30/416
Inventor WNEK, JANUSZ
Owner SCI APPL INT CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products