Method and system for extracting table information from PDF documents
A form of information, text information technology, applied in the direction of instruments, character and pattern recognition, computer components, etc., can solve the problems of lack of robustness, poor accuracy, intervention and repair, etc., to achieve high accuracy, high efficiency, and high efficiency The effect of extraction
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0064] Such as Figure 1 to Figure 3 As shown, a method for extracting table information from a PDF document includes the following steps:
[0065] S1, intercepting the image of the table part in the PDF format document, generating a new PDF document, adding a directly modifiable text layer for the new PDF document;
[0066] S2, analyze the table picture in the new PDF document, identify the hidden internal frame line in the table picture, draw the line to supplement the internal frame line, and obtain the table picture with complete frame line;
[0067] S3, identify the form picture with complete frame, obtain the text information of the form while retaining the complete frame of the form picture, and convert the text information and frame line information in the form picture into an electronic form file.
[0068] Due to operations such as generating a new PDF document, adding a text layer that can be directly modified, identifying and supplementing hidden internal frame lin...
Embodiment 2
[0108] Such as Figure 1 to Figure 3 As shown, as a further optimization of Embodiment 1, this embodiment provides a system for extracting form information from a PDF document suitable for the method.
[0109] A system for extracting table information from PDF documents, including the following modules:
[0110] New PDF document generation module: used to intercept the image of the table part in the PDF format document, generate a new PDF document, and add a directly modifiable text layer to the new PDF document;
[0111] The module for obtaining the complete frame and table picture: it is used to analyze the table picture in the new PDF document, identify the hidden internal frame line in the table picture, draw the line to supplement the internal frame line, and obtain the table picture with complete frame line;
[0112] Form information identification module: used to identify form pictures with complete frame lines, obtain form text information while retaining the complete...
Embodiment 3
[0115] Such as Figure 1 to Figure 3 As shown, as a further optimization of Embodiment 1 and Embodiment 2, this embodiment includes all the technical features of Embodiment 1 and Embodiment 2. In addition, this embodiment also includes the following detailed technical features:
[0116] Take the example of extracting the admission score form information of each college over the years from the college entrance examination volunteer report:
[0117] The form information of most colleges and universities’ admission information over the years has the following characteristics: each college’s major and admission information (scores / number of students enrolled / actual number of students enrolled) constitute a closed information area, within this closed information area, There are m*n columns of admission data, m and n are both ≥ 2, each row has 2 sets of admission data, and each set of data contains a major and its corresponding admission information (score / number of students to be e...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More - R&D
- Intellectual Property
- Life Sciences
- Materials
- Tech Scout
- Unparalleled Data Quality
- Higher Quality Content
- 60% Fewer Hallucinations
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2025 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com



