Low-resource language OCR (Optical Character Recognition) method fusing language information

A technology that integrates languages ​​and languages. It is applied in the fields of instruments, computing, character and pattern recognition. It can solve the problems of scarcity of training data resources, and achieve the effect of well fitting the characteristics of the data set, comprehensive coverage and improving performance.

Active Publication Date: 2021-09-21
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The purpose of the present invention is to solve the existing low-resource language OCR, which is limited by the scarcity of training data resources, resulting in a large gap in the recognition ability between low-resource languages ​​and high-resource languages, and proposes a low-resource language OCR method that integrates language information. , this method first enhances the OCR training data set of low-resource languages, and then transfers the OCR model of high-resource languages ​​to the OCR model of low-resource languages ​​based on transfer learning through a mixed fine-tuning migration strategy; then builds low-resource language OCR models based on low-resource language OCR models The vocabulary is used to find errors in the OCR recognition results and serve as the basis for generating correction options. Finally, OCR recognition and text correction based on the hybrid fine-tuning strategy are performed on the pictures in the test set, and the language information fusion method is used to improve the OCR recognition of low-resource languages. The accuracy of the result

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Low-resource language OCR (Optical Character Recognition) method fusing language information
  • Low-resource language OCR (Optical Character Recognition) method fusing language information
  • Low-resource language OCR (Optical Character Recognition) method fusing language information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0025] A low-resource language OCR method for merging language information according to the present invention includes: obtaining open-source texts of low-resource languages ​​to generate pictures and enhancing OCR training data of low-resource languages ​​based on image and text characteristics; selecting languages ​​based on similarity between languages For high-resource languages ​​with high similarity to low-resource languages, apply the hybrid fine-tuning migration strategy to migrate the OCR model of high-resource languages ​​to the OCR model of low-resource languages, and then recognize based on the OCR model, and use the scoring of the recognition results as the basis for judging the recognition results contains errors. Vocabulary detection is carried out for sentences with low scores, the wrong words are located and recognized, and multi-strategy fusion is used to generate possible correction schemes based on the vocabulary and edit distance; finally, each correction s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a low-resource language OCR (Optical Character Recognition) method fusing language information, and belongs to the technical field of OCR. The method the following steps: acquiring an open source text of a low-resource language to generate a picture, and enhancing OCR training data of the low-resource language based on image and character characteristics; selecting a high-resource language with high similarity between low-resource languages based on the similarity between languages, migrating an OCR model of the high-resource language to an OCR model of the low-resource language by applying a hybrid fine-tuning migration strategy, identifying based on the OCR model, dividing an identification result into a judgment basis, and judging whether the identification result contains an error or not, carrying out vocabulary detection on statements with low scores, positioning and recognizing wrong words, adopting multi-strategy fusion, and generating a possible correction scheme on the basis of the vocabulary and the editing distance; and finally, scoring each correction scheme of the OCR sequence, and selecting the optimal correction scheme. According to the method, the OCR recognition accuracy of low-resource languages caused by scarcity of data resources is improved.

Description

technical field [0001] The present invention relates to a low-resource language OCR method that integrates language information, and in particular to a training method based on a hybrid fine-tuning strategy and a text correction method that integrates language information. Modeling is performed by fusing language information in low-resource language OCR. The ability to improve the recognition ability of a model in a test set belongs to the technical field of OCR. Background technique [0002] Optical character recognition (OCR) technology simulates the intelligence of human vision, and recognizes the text information in the image by processing and analyzing the image. It belongs to the combination of computer vision and natural language processing. This technology builds a bridge between the two information carriers of image and text, and can quickly extract the text information in the image, replacing the way of manual re-entry. [0003] With the increasing research result...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/34G06K9/62
CPCG06F18/217G06F18/25
Inventor 冯冲滕嘉皓
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products