Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents

a technology of automatic extraction and document recognition, which is applied in the field of system and method for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents. it can solve the problems of difficulty and time-consuming to find a particular page within an electronic document, individual organizing paper documents or performing scanning may not have the skill, knowledge or time needed to correctly organize paper documents,

Inactive Publication Date: 2009-05-07
GRUNTWORX
View PDF39 Cites 34 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0013]Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents are provided. In some embodiments, a method of enabling a human trainer to assist a document analysis system in classifying unrecognized electronic documents and of causing the unrecognized electronic documents to be used to train the automatic recognition and classification of subsequently received documents is provided. The method automatically extracts image and text features from each received electronic document and compares the extracted features with feature sets associated with each category of document to determine whether the document is recognizable as belonging to a document category, in which each feature set includes a subset of image and text features and corresponding weights for each image and text feature in the subset so that the feature set distinguishes the respective category of document from the other categories of documents. If an electronic document is recognized as belonging to one of the document categories, the method classifies the electronic document as belonging to that document category. If an electronic document is unrecognized as, however, the method submits the unrecognized document to a learning phase, in which the unrecognized document is presented to a human trainer for manual classification of the unrecognized electronic document into a document category, and automatically modifies at least one of the features and the weights of the feature set of the document category corresponding to the manually-classified electronic document using the automatically extracted features of the manually-classified document so that subsequent automatic recognition and automatic classification of documents by the document analysis system improves as more and more unrecognized documents train the feature sets.

Problems solved by technology

In many instances, however, the paper documents are scanned in a random, unorganized sequence, which makes it difficult and time-consuming to find a particular page within the electronic document.
One solution can be to manually organize the paper documents prior to scanning; however, the individual organizing the paper documents or performing the scanning may not have the skill, knowledge or time needed to correctly organize the paper documents.
Additionally, organizing the paper documents prior to scanning can be very time-consuming and expensive.
Further, organizing the pages prior to scanning might properly order the pages, but it does not generate a table of contents, metadata, bookmarks or a hierarchical index that would facilitate finding a particular page within the complete set of pages.
Manually organizing an electronic document, including typing a table of contents, metadata, bookmarks or a hierarchical index, is time-consuming and expensive.
Manual organization tends to be ad-hoc, failing to deliver a standardized table of contents, metadata, bookmarks or a hierarchical index for the electronic document.
This approach requires the recipient to manually categorize each page, a time-consuming and expensive process.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents
  • Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents
  • Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028]While the prior art attempts to reduce the cost of electronic document organization through the use of software, none of the above methods of document organization (1) eliminates the human labor and accompanying requirements of education, domain expertise, training, and / or software knowledge, (2) minimizes time spent entering and quality checking page categorization, (3) minimizes errors and (4) protects the privacy of the owners of the data on the electronic documents being organized. What is needed, therefore, is a method of performing electronic document organization that overcomes the above-mentioned limitations and that includes the features numerated above.

[0029]Preferred embodiments of the present invention provide a method and system for converting paper and digital documents into well-organized electronic documents that are indexed, searchable and editable. The resulting organized electronic documents support more rapid and accurate data entry, retrieval and review th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method in a document analysis system automatically extracts image and text features from each received electronic document and compares the extracted features with feature sets associated with each category of document to determine whether the document is recognizable as belonging to a document category. If an electronic document is recognized as belonging to one of the document categories, the method classifies the electronic document as belonging to that document category. If, however, an electronic document is unrecognized, the method submits the unrecognized document to a learning phase, in which the unrecognized document is presented to a human trainer for manual classification of the unrecognized electronic document into a document category, and automatically modifies at least one of the features and the weights of the feature set of the document category corresponding to the manually-classified electronic document using the automatically extracted features of the manually-classified document.

Description

CROSS REFERENCE TO RELATED APPLICATIONS[0001]This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application Ser. No. 60 / 985,851, filed on Nov. 6, 2007, which is hereby incorporated by reference herein its entirety.[0002]This application is related to the following applications filed concurrently herewith, the entire contents of which are incorporated by reference:[0003]U.S. patent application Ser. No. ______, entitled “Systems and Methods for Classifying Electronic Documents by Extracting and Recognizing Text and Image Features Indicative of Document Categories;”[0004]U.S. patent application Ser. No. ______, entitled “Systems and Methods for Training a Document Classification System Using Documents from a Plurality of Users;”[0005]U.S. patent application Ser. No. ______, entitled “Systems and Methods for Parallel Processing of Document Recognition and Classification Using Extracted Image and Text Features;”[0006]U.S. patent application Ser. No. _...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06K9/62G06V30/40
CPCG06K9/6885G06K9/00442G06V30/40G06V30/1985
Inventor NEOGI, DEPANKARLADD, STEVEN K.KUMAR, ARJUNAHMED, DILNAWAJ
Owner GRUNTWORX
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products