Method for parsing PDF table data and storage medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology for tabular data and storage media, applied in the field of data analysis, can solve the problems of difficulty in judging the correlation between data rows, unrealistic character division, and difficulty in data and title correspondence, so as to improve accuracy and convenience. Sexual, significant effect, strong automatic effect

Active Publication Date: 2018-06-08

XIAMEN MEIYA PICO INFORMATION

View PDF9 Cites 14 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The division of simple characters is not realistic. For each format of the table, it is necessary to analyze the distinguishing characteristics first, and then write the corresponding script to import it into the database. The workload is unimaginable, so it is difficult to realize the automatic conversion of PDF table data. The fetches are stored in the database

[0004] Therefore, the current PDF parsing in the market is relatively closed source, and this kind of table data is purely character processing, it is difficult to match the data with the title, and it is difficult to judge the correlation between data rows

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0077] This embodiment mainly provides a method for parsing PDF table data, which is suitable for parsing tables in PDF format data, obtaining corresponding table data, and facilitating subsequent editing operations. If the data is cleaned at the front end, a large part of the bills and bills provided by the customer are in the table format PDF format. Through this embodiment, the table format PDF can be extracted into the corresponding CSV format, and automatically imported into the database for analysis.

[0078] Such as Figure 1-4 As shown, there are several existing common PDF forms. specific, figure 1 Corresponding single form; figure 2 Corresponds to random blank cells; image 3 Corresponding to the spread cell; Figure 4 Corresponding to multi-layer watermark and other forms. Based on the current existing PDF form parsing is relatively closed source, and this type of form data is purely character processing, it is difficult to achieve the correspondence between d...

Embodiment 2

[0114] This embodiment corresponds to Embodiment 1, and a corresponding computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, all the steps included in Embodiment 1 can be realized.

[0115] In summary, a method and a storage medium for parsing PDF form data provided by the present invention can realize accurate, convenient and automatic analysis of PDF form. Not only can it accurately analyze the data of a single table or multiple tables, but it can also accurately analyze random blank cells, double-page cells, and multi-layer watermark cells; it has strong practicability and a wide range of applications. Furthermore, the present invention analyzes based on character coordinates and line segment coordinates, which is different from the existing purely character-based processing. It not only achieves more accurate and convenient analysis, but also ensures the correspondence between data and titles; at...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a method for parsing PDF table data and a storage medium. The method includes the steps that coordinates of each line segment and each character of each page PDF are obtained; cells are divided based on points of intersection of the line segments, and a rectangular coordinate corresponding to each cell is obtained; and a field block corresponding to each cell is obtained based on the inclusion relation between the coordinate of each character and each rectangular coordinate. According to the relation of the coordinates of each line segment and each character, the cells and the characters in the cells are accurately divided, a PDF table and the data in the PDF table are accurately extracted, and the accurate, convenient and automatic parsing of the PDF table is achieved.

Description

technical field [0001] The present invention relates to the field of data analysis, specifically a method and a storage medium for analyzing PDF form data. Background technique [0002] The objects of PDF analysis in the prior art are generally aimed at text, and the tables inside are only visual, without real table objects. Each cell is only divided by line segments, and the PDF protocol only records these text, line segments, Location information such as pictures. [0003] The existing relevant analysis only obtains the text inside, but for the table data, it should strictly correspond to the corresponding column of the title. Due to the particularity of PDF, such as the continuity of the tables on the front and back pages, the uncertainty of line breaks in a single cell, watermarks, etc. . The division of simple characters is not realistic. For each format of the table, it is necessary to analyze the distinguishing characteristics first, and then write the corresponding...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/21G06F17/24

CPCG06F40/117G06F40/18G06F40/103

Inventor蓝树和段涵瑞薛艳英江汉祥

OwnerXIAMEN MEIYA PICO INFORMATION

Method for parsing PDF table data and storage medium

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology