PDF document table extraction method, device and equipment and computer readable storage medium

An extraction method and technology of an extraction device are applied in the field of image processing and can solve the problems of small application range and low accuracy.

Active Publication Date: 2019-10-29
PING AN TECH (SHENZHEN) CO LTD
View PDF7 Cites 62 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The main purpose of this application is to provide a PDF document form extraction method, device, equipment and computer-readable storage medium, aiming to solve the technical problems of the existing PDF document form extraction method with a small application range and low accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF document table extraction method, device and equipment and computer readable storage medium
  • PDF document table extraction method, device and equipment and computer readable storage medium
  • PDF document table extraction method, device and equipment and computer readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

[0054] like figure 1 as shown, figure 1 It is a schematic diagram of the structure of the PDF document form extraction device of the hardware operating environment involved in the solution of the embodiment of the present application.

[0055] The PDF document table extracting device in the embodiment of the present application may be a terminal device with data processing capabilities such as a portable computer and a server.

[0056] like figure 1 As shown, the PDF document form extraction device may include: a processor 1001 , such as a CPU, a network interface 1004 , a user interface 1003 , a memory 1005 , and a communication bus 1002 . Wherein, the communication bus 1002 is used to realize connection and communication between these components. The user interface 1003 may include a display...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a PDF document table extraction method and device, equipment and a computer readable storage medium. The method comprises the steps of obtaining a to-be-identified PDF document, and processing the to-be-identified PDF document; preprocessing the processed PDF document, inputting the preprocessed PDF document into a convolutional neural network, outputting a feature map, inputting the feature map into an RPN region candidate network, and determining a table region; carrying out preprocessing and feature extraction on the table area based on the OCR character recognition technology, obtaining a feature picture, carrying out character detection on the feature picture, determining a text area, carrying out character recognition on the text area, determining text informatio, wherein the text information comprises text position information and text content information; and determining structure informationof the table according to the text coordinate information, dividing each cell of the table based on the structure information, and filling each corresponding cell of the table with a text corresponding to the text content information. According to the method and the device, the accuracy of PDF document table extraction is improved.

Description

technical field [0001] The present application relates to the technical field of image processing, and in particular to a PDF document form extraction method, device, equipment and computer-readable storage medium. Background technique [0002] Existing methods for extracting tables in PDF files are basically aimed at PDFs that can extract text, and extract table areas by obtaining structural information of the PDF. For image-type PDF files, table extraction can only be performed by traditional image processing methods. First extract the table frame, then extract the frame area according to the table frame, and finally perform OCR recognition on the image of the frame area to extract the table content. However, this method is only effective for tables with table lines. If the table lines are incomplete, there may be incomplete positioning of the table area or incomplete cell content, resulting in low accuracy of table extraction. Contents of the invention [0003] The ma...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/32G06K9/62
CPCG06V30/412G06V10/25G06V30/10G06F18/241
Inventor 刘克亮卢波
Owner PING AN TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products