Unsupervised learning method for Chinese OCR post-processing

An unsupervised learning and unsupervised technology, applied in the direction of instruments, biological neural network models, character and pattern recognition, etc., can solve problems such as the influence of dossier information extraction, poor recognition results, and poor picture quality

Pending Publication Date: 2020-02-11
NANJING UNIV
View PDF5 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to various factors, such as poor picture quality or complex page structure, the recognition results are sometimes not very good
Further lead to the extraction of dossier information is affected

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unsupervised learning method for Chinese OCR post-processing
  • Unsupervised learning method for Chinese OCR post-processing
  • Unsupervised learning method for Chinese OCR post-processing

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0055] In order to make the purpose, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0056] The purpose of the present invention is to solve the problem of OCR post-processing of scanned legal files. It is an unsupervised learning method for Chinese OCR post-processing. Standing on the shoulders of predecessors, it proposes an OCR recognition model and an OCR error correction model. The recognition model combines the results of current classic models and mature OCR systems (Tesseract, Baidu OCR). The OCR error correction model, based on the results of the OCR recognition system, proposes an unsupervised multi-input OCR error correction method, which can avoid a large number of artificial marks. The entire model adopts the classic network model in the industry, and does not adopt a particularly complicated network hierarchy. The ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Document scanners from different regions of the legal field from 2014 to 2018 are collected, dozens of resolutions are covered, and on the basis of a large amount of legal document data, domain knowledge is fused in combination with legal documents, so that research on Chinese OCR post-processing is carried out. Based on a classic model and a mature OCR (Tesseract, Baidu OCR), an OCR identification model is constructed. A large number of diversified Witnessses are obtained, and manual annotation is omitted. Based on a result of an OCR recognition system, the invention provides an unsupervisedmulti-input OCR correction method, an OCR correction model is constructed, and a large number of artificial marks can be avoided. Experimental results show that the accuracy of an unsupervised learning model proposed on a corpus is improved to a certain extent. It is further indicated that the recognition result of OCR can be well corrected by adopting the multi-input unsupervised learning method.

Description

technical field [0001] The invention relates to a method for recommending legal articles, in particular to an unsupervised learning method for Chinese OCR post-processing, and belongs to the technical fields of natural language processing and image processing. Background technique [0002] In recent years, the Supreme People's Court has vigorously promoted the informatization construction of the people's courts around the strategic deployment of comprehensive law-based governance. Legal dossiers are paper records of the entire trial process, which generally require electronic scanning for archiving. It covers a lot of content, including filing, detention, arrest, release on bail and other compulsory measures procedures, prosecution opinions and other procedural documents. It also includes case evidence, including photographs of physical evidence, witnesses, interview records of the victim, appraisal opinions, on-site inspection reports, etc. The digitization of legal files...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/34G06K9/38G06K9/46G06K9/62G06N3/04
CPCG06N3/049G06V10/273G06V30/153G06V10/28G06V10/507G06V30/287G06V30/10G06N3/045G06F18/217
Inventor 葛季栋李传艺姚林霞乔洪波杨关熊凯奇周筱羽骆斌
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products