Method for extracting information from error OCR result

An information extraction, error-prone technology, applied in the direction of instruments, character and pattern recognition, computer components, etc., can solve the problems of difficult data acquisition, error-prone writers, and inability to exhaustively, so as to improve the matching effect and improve the final result. Effect, effect of reducing typo penalty

Pending Publication Date: 2021-08-06
上海兑观信息科技技术有限公司
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In the technology of information extraction, one solution is to use manually written regular expressions to match the expected possible errors one by one, but regularization is still an exact matching scheme, and this method cannot exhaust all possibilities In the case of errors, the workload is also large, and the writer is prone to errors
[0005] Another solution is to use neural networks, but it often requires a large amount of labeled data, which is difficult to obtain in actual business

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting information from error OCR result
  • Method for extracting information from error OCR result
  • Method for extracting information from error OCR result

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0075] This embodiment is based on the above-mentioned method for extracting information from OCR results with errors, and takes extracting gender and ethnicity from an ID card as an example to provide a technical solution for specific implementation.

[0076] Such as figure 1 As shown, the main steps of the method provided by the present invention include: obtaining an image OCR recognition result; post-processing the OCR result; obtaining several lines of text strings after OCR post-processing; inputting a well-written sequence character template; inputting a pre-generated Table of near-words; perform sequence alignment and matching on each line of text strings; select the line with the highest matching score to extract information. The specific process is as follows:

[0077] 1. Recognize the text in the target image through OCR technology and obtain the OCR result.

[0078] 2. Post-process the OCR recognition results and merge the text of each line. The specific method i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention is applicable to the technical field of image text processing, and provides a method for extracting information from an error OCR result, which comprises the following steps of: obtaining a result of extracting an image text through OCR; carrying out post-processing on the OCR results, and merging the OCR results into rows; defining an extraction template according to an information extraction target; carrying out fuzzy matching on a template and all OCR lines by utilizing an optimized global sequence alignment algorithm; optimizing a matching alignment result by utilizing a character library with a similar shape; extracting target information according to a matching alignment result. Meanwhile, the invention further provides a method for generating the similar character library through the neural network recognition model, by means of the similar character library, information provided by wrong characters in OCR recognition can be more effectively utilized, and the information extraction precision is improved. Compared with the prior art, the information extraction method provided by the invention has the advantages that the problem of OCR result error can be effectively solved, and the information extraction effect under the conditions of missing characters, multiple characters and wrong characters is greatly improved.

Description

technical field [0001] The invention belongs to the technical field of image text processing, and in particular relates to a method for extracting information from OCR results with errors. Background technique [0002] OCR (Optical Character Recognition) technology is used to recognize text contained in images. At present, the OCR technology based on deep neural network technology is relatively mature, and can achieve high accuracy on many public data sets. With the maturity of OCR technology, many OCR-based technologies can be put into practical use, such as file digitization, ID card information extraction, invoice information extraction, etc. Information extraction based on OCR technology generally includes two methods: the method of using image position and the method of using text matching. The former can accurately locate the extraction target, but it is not suitable for business scenarios where the content location is not fixed; the latter relies on the accuracy of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06K9/34
CPCG06V30/153G06F18/22G06F18/2415
Inventor 陈恒生
Owner 上海兑观信息科技技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products