Check patentability & draft patents in minutes with Patsnap Eureka AI!

Method and system for extracting table information from PDF documents

A form of information, text information technology, applied in the direction of instruments, character and pattern recognition, computer components, etc., can solve the problems of lack of robustness, poor accuracy, intervention and repair, etc., to achieve high accuracy, high efficiency, and high efficiency The effect of extraction

Pending Publication Date: 2021-11-19
SOUTHWEAT UNIV OF SCI & TECH
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] In order to overcome the deficiencies of the prior art, the present invention provides a method and system for extracting form information from a PDF document, which solves the problem of poor accuracy and poor Lack of stickiness, if there is an extraction error in the middle, it cannot be quickly intervened and repaired from the intermediate steps

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting table information from PDF documents
  • Method and system for extracting table information from PDF documents
  • Method and system for extracting table information from PDF documents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0064] Such as Figure 1 to Figure 3 As shown, a method for extracting table information from a PDF document includes the following steps:

[0065] S1, intercepting the image of the table part in the PDF format document, generating a new PDF document, adding a directly modifiable text layer for the new PDF document;

[0066] S2, analyze the table picture in the new PDF document, identify the hidden internal frame line in the table picture, draw the line to supplement the internal frame line, and obtain the table picture with complete frame line;

[0067] S3, identify the form picture with complete frame, obtain the text information of the form while retaining the complete frame of the form picture, and convert the text information and frame line information in the form picture into an electronic form file.

[0068] Due to operations such as generating a new PDF document, adding a text layer that can be directly modified, identifying and supplementing hidden internal frame lin...

Embodiment 2

[0108] Such as Figure 1 to Figure 3 As shown, as a further optimization of Embodiment 1, this embodiment provides a system for extracting form information from a PDF document suitable for the method.

[0109] A system for extracting table information from PDF documents, including the following modules:

[0110] New PDF document generation module: used to intercept the image of the table part in the PDF format document, generate a new PDF document, and add a directly modifiable text layer to the new PDF document;

[0111] The module for obtaining the complete frame and table picture: it is used to analyze the table picture in the new PDF document, identify the hidden internal frame line in the table picture, draw the line to supplement the internal frame line, and obtain the table picture with complete frame line;

[0112] Form information identification module: used to identify form pictures with complete frame lines, obtain form text information while retaining the complete...

Embodiment 3

[0115] Such as Figure 1 to Figure 3 As shown, as a further optimization of Embodiment 1 and Embodiment 2, this embodiment includes all the technical features of Embodiment 1 and Embodiment 2. In addition, this embodiment also includes the following detailed technical features:

[0116] Take the example of extracting the admission score form information of each college over the years from the college entrance examination volunteer report:

[0117] The form information of most colleges and universities’ admission information over the years has the following characteristics: each college’s major and admission information (scores / number of students enrolled / actual number of students enrolled) constitute a closed information area, within this closed information area, There are m*n columns of admission data, m and n are both ≥ 2, each row has 2 sets of admission data, and each set of data contains a major and its corresponding admission information (score / number of students to be e...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a system for extracting table information from PDF documents. The method comprises the following steps: S1, intercepting an image of a table part in a PDF document, generating a new PDF document, and adding a character layer that can be directly modified for the new PDF document; S2, analyzing a table picture in the new PDF document, identifying hidden internal frame lines in the table picture, and scribing to supplement the internal frame lines to obtain a table picture with complete frame lines; S3, identifying the table picture with the complete frame line, obtaining the table character information, reserving the complete frame lines of the table picture, and converting character information and frame line information in the table picture into a spreadsheet file. The method and the system solve the problems of, in the prior art, poor accuracy and lack of robustness in extracting PDF documents or table pictures without complete frame lines, and the defect that when extraction errors occur midway, rapid intervention and repairing cannot be conducted from the middle step.

Description

technical field [0001] The invention relates to the technical field of office document information processing, in particular to a method and system for extracting table information from PDF documents. Background technique [0002] Most people use more files, tables and documents in their daily office work, and the importance of tables is beyond doubt. In desktop office scenarios in various industries, Excel and WPS are the de facto standards for spreadsheets. We often encounter this need: import the content of a table picture into Excel. In the past, we could only manually enter the content into the Excel table file according to the picture, which is inefficient and error-prone. [0003] In recent years, with the help of deep learning, the usability of OCR (Optical Character Recognition, Optical Character Recognition) technology has been continuously improved. Most people directly use OCR software to automatically extract text information from pictures. However, for the s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/34
Inventor 杨春明谢明旭张晖
Owner SOUTHWEAT UNIV OF SCI & TECH
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More