Pdf table structure identification method based on image identification

A table structure and image recognition technology, applied in the field of image recognition, can solve problems such as loss of information in pdf files, inability to parse accurately, and inability to recognize complex tables accurately, to eliminate character omission, speed up convergence, and identify accurate rate effect

Active Publication Date: 2020-05-12
杭州费尔斯通科技有限公司
View PDF4 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For tables in pdf, currently existing table parsing methods generally include parsing tables by reading xml information of pdf (such as xpdf tool), converting pdf to xml, html, word and other formats and then parsing (such as pdf-docx tool ), convert the pdf to an image and then perform structure recognition. The first two methods cannot be accurately analyzed due to the information loss of the pdf file itself. The third method mainly relies on image recognition algorithms. The existing methods are not suitable for complex forms. can accurately identify

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Pdf table structure identification method based on image identification
  • Pdf table structure identification method based on image identification
  • Pdf table structure identification method based on image identification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] In order to make the above objects, features and advantages of the present invention more comprehensible, specific implementations of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0030] In the following description, a lot of specific details are set forth in order to fully understand the present invention, but the present invention can also be implemented in other ways different from those described here, and those skilled in the art can do it without departing from the meaning of the present invention. By analogy, the present invention is therefore not limited to the specific examples disclosed below.

[0031] Such as figure 1 As shown, the present invention provides a pdf form structure recognition method based on image recognition. This method converts documents in other formats into images. For each input image, the position of the form is recognized, and the form area is intercepted. The table area identi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a pdf table structure identification method based on image identification. According to the invention, the method includes converting a pdf document into an image; and for eachinput image, identifying the position of a table, intercepting a table area, identifying a text blob block in the table area, finding an adjacent blob for each blob, predicting the relationship between the blob and each adjacent blob, and finally obtaining the structure of the table through the relationship. According to the method, the image features are removed, the features of the edges of theadjacent blocks are increased, the search range of the text blocks in the table is narrowed by using the blob field, and the convergence rate and the recognition accuracy are greatly increased. The problem of character omission is eliminated through detection and post-processing of the text blob.

Description

technical field [0001] The invention relates to image recognition technology, in particular to an image recognition-based pdf table structure recognition method. Background technique [0002] In the application scenarios of big data and artificial intelligence, it is necessary to collect, process, and analyze a large amount of information, structure the data, and discover the laws in the data to guide production. Information exists in various and unstructured ways. A large amount of information exists in tables, and tables may exist in pdfs, web pages, and images. For tables in pdf, currently existing table parsing methods generally include parsing tables by reading xml information of pdf (such as xpdf tool), converting pdf to xml, html, word and other formats and then parsing (such as pdf-docx tool ), converting the pdf into an image and then performing structural recognition. The first two methods cannot be accurately analyzed due to the information loss of the pdf file i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/62
CPCG06V30/414G06F18/241
Inventor 杨红飞金霞韩瑞峰
Owner 杭州费尔斯通科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products