Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Literature table content recognition and information extraction method based on image processing

An image processing and content recognition technology, applied in character and pattern recognition, special data processing applications, instruments, etc., can solve the problems of re-recovery of unrecognized content, inability to meet the diverse forms of tables, and unsatisfactory recognition effect, etc. Thorough removal of frame lines, effective and feasible methods, and the effect of promoting research and development

Active Publication Date: 2021-05-28
SHANGHAI UNIV
View PDF12 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] With the development of early computer vision technology, Hough line detection is used for table frame detection. First, use edge extraction to obtain the edges of characters and table frames on the picture, and then use Hough line detection method to detect the edge. If the edge If a certain threshold is met, it will be considered as a straight line, but the recognition effect of this method is not ideal, and it cannot meet the scene with various forms and variable thickness of the frame line
The common form content recognition uses the optical character recognition method to identify the character content, but does not restore the recognized content into the shape of the form. The recognition result loses the advantage of clear data displayed in the form, and an innovative form recovery method is needed. to solve this problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Literature table content recognition and information extraction method based on image processing

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0050] In this embodiment, a method for identifying and extracting information from tables in documents based on image processing includes the following steps:

[0051] (1) Read a document, extract part of the content of the table in the document, convert it into a picture format and save it, and store the picture access path into the path list;

[0052] (2) Read a table picture, remove the frame line of the table picture, including binarization, open operation to extract straight lines, and bitwise AND calculation. When performing straight line extraction, use different kernels to perform open operations, and the extraction level and The straight line in the vertical direction is then superimposed on the same picture, and then the bitwise AND operation is performed with this picture and the binary picture to complete the removal of the table frame;

[0053] (3) Acquisition and cutting of the text area is to expand the table image after the frame line has been removed and bina...

Embodiment 2

[0059] This embodiment is basically the same as Embodiment 1, especially in that:

[0060] In this embodiment, step (2) processes the input form image to obtain a binary image without frame lines. The specific steps are as follows:

[0061] (2-1) The original image is first converted into a grayscale image, and then the inverse binarization of a fixed threshold is performed to obtain the binary image of the original image;

[0062] (2-2) First perform an opening operation on the original image binary image to maintain the vertical direction feature, and obtain a vertical line binary image that only retains vertical lines; then perform an open operation on the original image binary image that maintains the horizontal direction feature, and obtain Horizontal line binary map with only horizontal lines preserved;

[0063] (2-3) Superimpose the binary image of vertical lines and the binary image of horizontal lines and then invert it to obtain the binary image of frame line, in wh...

Embodiment 3

[0067] This embodiment is basically the same as the previous embodiment, and the special features are:

[0068] In the present embodiment, for step (3), it is mainly to identify and cut out the area with characters in the form from the form picture, and the specific steps are as follows:

[0069] (3-1) Corrosion operation is performed on the binarized table image with the frame line removed, focusing on strengthening the corrosion in the horizontal direction, so that adjacent characters are connected into a whole block;

[0070] (3-2) Use the contour discovery technology based on the binary image to find out all the candidate target areas on the corroded picture, and number each target area in turn;

[0071] (3-3) The target area is screened, and the target area with an area smaller than the threshold pixel number is filtered out, and the rest is the target character block area that meets the conditions and is to be recognized;

[0072] (3-4) According to the coordinate range...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a literature table content recognition and information extraction method based on image processing, which is used for realizing content recognition, information extraction and structure restoration of a literature table picture by applying a computer image detection method. The method comprises the following steps: firstly, reading a table picture, and removing table frame lines by using a morphological method; then finding out a character region by using contour detection, and intercepting and storing the region; selecting and splicing a plurality of character block pictures into a large picture, calling a character recognition model to recognize characters on the large picture, analyzing and storing a recognition result; and finally, reading character block information data, restoring the table according to a row discovery and self-adaptive column alignment restoration algorithm based on character block coordinates, and storing the restored table in a database. According to the content recognition and information extraction method for the table picture in the literature, frame line removal, content recognition and structure recovery can be realized, the literature information extraction speed is improved, a method is provided for constructing a database of a corresponding subject, and research and development of the corresponding subject are promoted.

Description

technical field [0001] The invention relates to an image processing-based content recognition and information extraction method of a document form, which involves character area detection in a form picture, character content recognition and restoring the content in a database and a file that is easy to read and write according to the shape of the form, and can be applied to different In fields such as tabular data extraction and corresponding database construction in subject literature, to a certain extent, the speed and scope of subject literature data extraction are improved, and basic scientific data and empirical data are provided to improve the development progress and research efficiency of this research direction, and promote Research and development of relevant disciplines. Background technique [0002] Tables are a form of content presentation that is highly refined. In scientific literature, all important information, data that needs to be compared or experimental ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/38G06F16/90
CPCG06F16/90G06V30/40G06V10/28
Inventor 韩越兴张家旺张瑞陈侨川钱权夏锦桦王迎港
Owner SHANGHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products