Method and system for extracting information from a document

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
a document and information technology, applied in the field of methods and systems, can solve the problems of document processing, high resource consumption, and high resource consumption of manual solutions, and achieve the effect of reducing manual processing costs

Inactive Publication Date: 2007-03-08

SCI APPL INT CORP

View PDF14 Cites 26 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Benefits of technology

The invention is a computer-implemented method for extracting information from a population of subject documents. It models the structure of a document by identifying data of a specific type and associating it with the corresponding modeled data element type. Additionally, the invention allows for the horizontal alignment of different regions of a document by determining the type of data within each region and calculating the edit distance between them. This results in a more organized and efficient document structure.

Problems solved by technology

For organizations such as businesses; the federal government; research organizations, and the like, processing data obtained in printed form from various sources and in various formats consumes substantial resources.

Manual solutions are highly resource-intensive and well known to be susceptible to error.

Besides the regular and expected complexity of document and table structures, documents may pose additional challenges for automating the data extraction process.

In addition to irregularities related to record structure, such as the previous ones, common problems related to scanning (e.g., skewed and rotated images), as well as OCR errors should be anticipated.

The method works well on some isolated tables, however it may also erroneously extract “table structures” from non-table regions.

Despite many years of research toward automated information extraction from tables (and the initial step of recognizing a table in the first place), the problems have still not been solved.

The automatic extraction of information is difficult for several reasons.

It seems that the complexity of possible table forms multiplied by the complexity of image analysis methods has worked against the production of satisfactory and practical results.

Even though image analysis methods identify table structures and perform their segmentation, they typically do not rely on understanding about the logic of the table.

Hurst notes that table extraction “has not received much attention from either the information extraction or the information retrieval communities, despite a considerable body of work in the image analysis field, psychological and educational research, and document markup and formatting research.” As possible reasons, viewed from an information extraction perspective, Hurst identifies lack of current art and model, no training corpora, and confusing markup standards.

Moreover, “through the various niches of table-related research there is a lack of evolved or complex representations which are capable of relating high- and low-level aspects of tables.”

However, it would not likely be applicable for database upload applications.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0036] As required, detailed embodiments of the present invention are disclosed herein. However, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. The figures are not necessarily to scale, and some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present invention.

[0037] A wide variety of printed documents exhibit complex data structures including textual or numerical data organized in rows and columns, i.e., tables, along with more general structures, one of which may be broadly characterized as consisting of one or more contextually-related elements including possibly tables, i.e., records. Contextual relationship typi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A computer-implemented method for extracting information from a population of subject documents. The method includes modeling a document structure. The modeled document structure includes at least a document component hierarchy with at least one record type. Each record type includes at least one record part type and at least one record part type comprising at least one data element type. For a subject document exhibiting at least a portion of the modeled document structure, preferred embodiments of the invention identifying data of a type corresponding to at least one modeled data element type. Identified subject document data is then associated with the corresponding modeled data element type.

Description

CROSS REFERENCE TO RELATED APPLICATIONS [0001] The present application is a continuation of, claims priority to and incorporates by reference U.S. patent application Ser. No. 10 / 146,959, entitled “METHOD AND SYSTEM FOR EXTRACTING INFORMATION FROM A DOCUMENT,” filed May 17, 2002. [0002] The present application hereby incorporates by reference in its entirety U.S. patent application Ser. No. 09 / 518,176, entitled “Machine Learning of Document Templates for Data Extraction,” filed Mar. 2, 2000.FIELD OF THE INVENTION [0003] The present invention relates to methods and systems for extracting information (e.g., data with context) from a document. More specifically, preferred embodiments of the present invention relate to extraction of information from printed and imaged documents including a regular table structure. BACKGROUND [0004] From credit card statements, to hospital bills, to auto repair invoices, most of us encounter printed documents containing complex, but mostly regular, data s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(United States)

IPC IPC(8): G06K9/32G06K9/20

CPCG06K9/00469G06V30/416

InventorWNEK, JANUSZ

OwnerSCI APPL INT CORP

Method and system for extracting information from a document

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Benefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology