Proprietary suite of underlying document
image analysis capabilities, including a novel forms enhancement, segmentation and modeling component, forms recognition and
optical character recognition. Future version of the
system will include form reasoning to detect and classify fields on forms with varying
layout. Product provides acquisition, modeling, recognition and
processing components, and has the ability to verify recognized data on the image with a line by line comparison. The key enabling technologies center around the recognition and
processing of the scanned forms. The
system learns the positions of lines and the location of text on the pre-printed form, and associates various regions of the form with specific required fields in the electronic version. Once the form is recognized, the preprinted material is removed and individual regions are passed to an
optical character recognition component. The current proprietary OCR engine is trained with a variety of Roman text fonts and has a back end dictionary that can be customized to account for the fact that the
system knows which field it is recognizing. The engine performs segmentation to obtain isolated characters and computes a
structure based feature vector. The characters are normalized and classified using a cluster centric classifier, which responds well to variations in the symbols contour. An efficient dictionary lookup scheme provides exact and
edit distance lookup using a
TRIE structure. An
edit distance is computed and a collection of near misses can be output in a lattice to enhance the final recognition result. The current
classification rate can exceed 99% with context. The ultimate goal of this system is to enable the
processing of all tax forms including forms with handwritten material.