Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Visual rich document information extraction method for actual OCR scene

An information extraction, rich document technology, applied in the field of visual information extraction, which can solve the problems of complex OCR prediction, unclear named entity boundaries, and positioning frame extraction.

Active Publication Date: 2021-05-14
SOUTH CHINA UNIV OF TECH
View PDF7 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Since visual features such as glyph, text position, layout, and font size are important cues for extracting information from document images, many methods incorporate document images into sequential labeling models In , better results were obtained compared to using only plain text, however, most of the existing research assumes that the OCR (Optical Character Recognition) results are accurate and cannot deal with the case of flawed OCR results
On the other hand, it is very complicated to achieve error-free OCR prediction of document images, and manually annotated positioning boxes cannot be directly used for information extraction in defective OCR results, because defective OCR results usually contain a large number of repeated or missed content, which directly affects the performance of the VIE model
In addition, VIE systems that fuse the positions of text segments will face the problem of unclear boundaries of named entities, which will cause a lot of post-processing to get the final correct result
Although VIE models should consider the problem that human annotations cannot fully match OCR results, as a downstream task of OCR, it has often been ignored in previous studies

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Visual rich document information extraction method for actual OCR scene
  • Visual rich document information extraction method for actual OCR scene
  • Visual rich document information extraction method for actual OCR scene

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0052] Such as figure 1 , figure 2 As shown, in the present invention, a method for extracting visually rich document information in an actual OCR scene comprises the following steps:

[0053] S1. Collect visual rich text images with key information in the actual scene, and label the collected images with text lines, specifically:

[0054] In this embodiment, the visually rich text image data set includes data of a simple layout and a complex layout, which are respectively composed of bills, train tickets, passports and other data. Contains 4306, 1500, and 2331 in sequence, a total of 8137 images.

[0055] S11. Carry out labeling of text line position, text content and named entity attributes on the collected images, specifically:

[0056] The labeling of the named entity attribute is specifically for the named entity label under the actual OCR result, and the named entity label refers to the labeling of the sentence word using the sequence labeling method of BIO tagging; ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a visual rich document information extraction method for an actual OCR scene. The method comprises the following steps: collecting a visual rich text image in the actual scene; extracting text word embedding features and position embedding features of character levels and word levels by utilizing a pre-training word embedding model; training a named entity classification module; constructing a global document graph structure based on graph convolution GAT, and introducing a self-attention mechanism; training a named entity boundary positioning module; constructing a multi-feature aggregation structure; and training an error semantic correction module, adopting a decoding structure of a GRU, extracting a coding hidden state of a corresponding dimension feature according to an optimal path of a CRF, and guiding output of a decoder every time by taking category information of a named entity as prior guidance information to obtain entity naming information in a standard format. According to the visual rich document information extraction method, the precision of the visual rich document information extraction method in actual OCR detection and recognition application is effectively improved, and the visual rich document information extraction method is of great significance to structured storage of visual rich document information.

Description

technical field [0001] The invention belongs to the technical field of visual information extraction, and in particular relates to a method for extracting visually rich document information in an actual OCR scene. Background technique [0002] Visual Information Extraction (VIE), as an important part of Natural Language Processing (NLP), aims to directly extract structured information from unstructured document images, which is a key step in understanding document images. The extracted structured information is widely used in many occasions, such as fast indexing, efficient archiving, and document analysis. The typical method is to formulate the information extraction problem as a sequential labeling problem. In recent years, information extraction from document images (such as invoices, ID cards, and purchase receipts, etc.) has become a research hotspot. [0003] Since visual features such as glyphs, text position, layout, and font size are important cues for extracting i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/00G06F40/295G06F16/35G06F40/30
CPCG06F40/295G06F16/353G06F40/30G06V30/414G06V30/40G06V30/10
Inventor 唐国志金连文林上港汪嘉鹏薛洋
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products