Method for parsing PDF table data and storage medium

A technology for tabular data and storage media, applied in the field of data analysis, can solve the problems of difficulty in judging the correlation between data rows, unrealistic character division, and difficulty in data and title correspondence, so as to improve accuracy and convenience. Sexual, significant effect, strong automatic effect

Active Publication Date: 2018-06-08
XIAMEN MEIYA PICO INFORMATION
View PDF9 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The division of simple characters is not realistic. For each format of the table, it is necessary to analyze the distinguishing characteristics first, and then write the corresponding script to import it into the database. The workload is unimaginable, so it is difficult to realize the automatic conversion of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for parsing PDF table data and storage medium
  • Method for parsing PDF table data and storage medium
  • Method for parsing PDF table data and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0077] This embodiment mainly provides a method for parsing PDF table data, which is suitable for parsing tables in PDF format data, obtaining corresponding table data, and facilitating subsequent editing operations. If the data is cleaned at the front end, a large part of the bills and bills provided by the customer are in the table format PDF format. Through this embodiment, the table format PDF can be extracted into the corresponding CSV format, and automatically imported into the database for analysis.

[0078] Such as Figure 1-4 As shown, there are several existing common PDF forms. specific, figure 1 Corresponding single form; figure 2 Corresponds to random blank cells; image 3 Corresponding to the spread cell; Figure 4 Corresponding to multi-layer watermark and other forms. Based on the current existing PDF form parsing is relatively closed source, and this type of form data is purely character processing, it is difficult to achieve the correspondence between d...

Embodiment 2

[0114] This embodiment corresponds to Embodiment 1, and a corresponding computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, all the steps included in Embodiment 1 can be realized.

[0115] In summary, a method and a storage medium for parsing PDF form data provided by the present invention can realize accurate, convenient and automatic analysis of PDF form. Not only can it accurately analyze the data of a single table or multiple tables, but it can also accurately analyze random blank cells, double-page cells, and multi-layer watermark cells; it has strong practicability and a wide range of applications. Furthermore, the present invention analyzes based on character coordinates and line segment coordinates, which is different from the existing purely character-based processing. It not only achieves more accurate and convenient analysis, but also ensures the correspondence between data and titles; at...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for parsing PDF table data and a storage medium. The method includes the steps that coordinates of each line segment and each character of each page PDF are obtained; cells are divided based on points of intersection of the line segments, and a rectangular coordinate corresponding to each cell is obtained; and a field block corresponding to each cell is obtained based on the inclusion relation between the coordinate of each character and each rectangular coordinate. According to the relation of the coordinates of each line segment and each character, the cells and the characters in the cells are accurately divided, a PDF table and the data in the PDF table are accurately extracted, and the accurate, convenient and automatic parsing of the PDF table is achieved.

Description

technical field [0001] The present invention relates to the field of data analysis, specifically a method and a storage medium for analyzing PDF form data. Background technique [0002] The objects of PDF analysis in the prior art are generally aimed at text, and the tables inside are only visual, without real table objects. Each cell is only divided by line segments, and the PDF protocol only records these text, line segments, Location information such as pictures. [0003] The existing relevant analysis only obtains the text inside, but for the table data, it should strictly correspond to the corresponding column of the title. Due to the particularity of PDF, such as the continuity of the tables on the front and back pages, the uncertainty of line breaks in a single cell, watermarks, etc. . The division of simple characters is not realistic. For each format of the table, it is necessary to analyze the distinguishing characteristics first, and then write the corresponding...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/21G06F17/24
CPCG06F40/117G06F40/18G06F40/103
Inventor 蓝树和段涵瑞薛艳英江汉祥
Owner XIAMEN MEIYA PICO INFORMATION
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products