Method and system for extracting table information from PDF documents

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A form of information, text information technology, applied in the direction of instruments, character and pattern recognition, computer components, etc., can solve the problems of lack of robustness, poor accuracy, intervention and repair, etc., to achieve high accuracy, high efficiency, and high efficiency The effect of extraction

Pending Publication Date: 2021-11-19

SOUTHWEAT UNIV OF SCI & TECH

View PDF0 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0006] In order to overcome the deficiencies of the prior art, the present invention provides a method and system for extracting form information from a PDF document, which solves the problem of poor accuracy and poor Lack of stickiness, if there is an extraction error in the middle, it cannot be quickly intervened and repaired from the intermediate steps

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0064] Such as Figure 1 to Figure 3 As shown, a method for extracting table information from a PDF document includes the following steps:

[0065] S1, intercepting the image of the table part in the PDF format document, generating a new PDF document, adding a directly modifiable text layer for the new PDF document;

[0066] S2, analyze the table picture in the new PDF document, identify the hidden internal frame line in the table picture, draw the line to supplement the internal frame line, and obtain the table picture with complete frame line;

[0067] S3, identify the form picture with complete frame, obtain the text information of the form while retaining the complete frame of the form picture, and convert the text information and frame line information in the form picture into an electronic form file.

[0068] Due to operations such as generating a new PDF document, adding a text layer that can be directly modified, identifying and supplementing hidden internal frame lin...

Embodiment 2

[0108] Such as Figure 1 to Figure 3 As shown, as a further optimization of Embodiment 1, this embodiment provides a system for extracting form information from a PDF document suitable for the method.

[0109] A system for extracting table information from PDF documents, including the following modules:

[0110] New PDF document generation module: used to intercept the image of the table part in the PDF format document, generate a new PDF document, and add a directly modifiable text layer to the new PDF document;

[0111] The module for obtaining the complete frame and table picture: it is used to analyze the table picture in the new PDF document, identify the hidden internal frame line in the table picture, draw the line to supplement the internal frame line, and obtain the table picture with complete frame line;

[0112] Form information identification module: used to identify form pictures with complete frame lines, obtain form text information while retaining the complete...

Embodiment 3

[0115] Such as Figure 1 to Figure 3 As shown, as a further optimization of Embodiment 1 and Embodiment 2, this embodiment includes all the technical features of Embodiment 1 and Embodiment 2. In addition, this embodiment also includes the following detailed technical features:

[0116] Take the example of extracting the admission score form information of each college over the years from the college entrance examination volunteer report:

[0117] The form information of most colleges and universities’ admission information over the years has the following characteristics: each college’s major and admission information (scores / number of students enrolled / actual number of students enrolled) constitute a closed information area, within this closed information area, There are m*n columns of admission data, m and n are both ≥ 2, each row has 2 sets of admission data, and each set of data contains a major and its corresponding admission information (score / number of students to be e...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method and a system for extracting table information from PDF documents. The method comprises the following steps: S1, intercepting an image of a table part in a PDF document, generating a new PDF document, and adding a character layer that can be directly modified for the new PDF document; S2, analyzing a table picture in the new PDF document, identifying hidden internal frame lines in the table picture, and scribing to supplement the internal frame lines to obtain a table picture with complete frame lines; S3, identifying the table picture with the complete frame line, obtaining the table character information, reserving the complete frame lines of the table picture, and converting character information and frame line information in the table picture into a spreadsheet file. The method and the system solve the problems of, in the prior art, poor accuracy and lack of robustness in extracting PDF documents or table pictures without complete frame lines, and the defect that when extraction errors occur midway, rapid intervention and repairing cannot be conducted from the middle step.

Description

technical field [0001] The invention relates to the technical field of office document information processing, in particular to a method and system for extracting table information from PDF documents. Background technique [0002] Most people use more files, tables and documents in their daily office work, and the importance of tables is beyond doubt. In desktop office scenarios in various industries, Excel and WPS are the de facto standards for spreadsheets. We often encounter this need: import the content of a table picture into Excel. In the past, we could only manually enter the content into the Excel table file according to the picture, which is inefficient and error-prone. [0003] In recent years, with the help of deep learning, the usability of OCR (Optical Character Recognition, Optical Character Recognition) technology has been continuously improved. Most people directly use OCR software to automatically extract text information from pictures. However, for the s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06K9/00G06K9/34

Inventor 杨春明谢明旭张晖

Owner SOUTHWEAT UNIV OF SCI & TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method and system for extracting table information from PDF documents

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology