Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for directly obtaining table content in PDF through browser

A browser and table technology, applied in the field of PDF table content extraction, can solve the problems of missing tables, cumbersome operations, high consumption, etc., and achieve the effects of strong pertinence, good analysis effect, and less server resource occupation

Pending Publication Date: 2019-08-16
鼎复数据科技(北京)有限公司
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, most of the documents spread on the Internet are in PDF format. When it is necessary to extract the required content from the PDF document, especially the data in the table, it is necessary to organize the table format after copying and pasting the table content. When the table data is relatively large It is very cumbersome to operate
[0004] At present, the PDF parsing service in the prior art needs to consume a large amount of server resources to extract and parse PDF tables; however, in the parsing process, sometimes some tables are missed, sometimes garbled characters are parsed, and the table parsing effect is not good. it is good
[0005] Existing PDF parsing services cannot meet people's needs very well, and it is difficult to quickly extract the table content they need from PDF files, especially for data collectors, it will be difficult to extract a large number of PDF table content. Inability to extract table content quickly and accurately

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for directly obtaining table content in PDF through browser
  • Method for directly obtaining table content in PDF through browser
  • Method for directly obtaining table content in PDF through browser

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0094] The content of the PDF file is the mid-term exam results of Class 1 of a certain grade in a certain school. The results are a borderless table, such as figure 2 As shown, extract the content of the table in PDF format.

[0095] Upload the PDF file to the website, and the rendering engine of the browser renders the PDF file, which is divided into an html view layer and a canvas view layer. Among them, the html view layer includes text and number content and coordinate information; the canvas view layer includes background color and frame line information.

[0096] The browser can monitor the table area selected by the mouse. The Canvas technology in the browser can affect the borderless table (such as figure 2 shown) to scan, scan and collect the pixel value and position information of the table, and determine the position information of the intersection of the frame lines.

[0097] 从canvas视图层中得到的表格横线在Y轴坐标信息为:107,131,151,170,190,209,228,248,267,287,306,325,345,364,38...

Embodiment 2

[0122] The form in the PDF file is a person's personal information, and the form is a form with incomplete borders, such as Figure 4 As shown, the content of the table is extracted.

[0123] The PDF file is uploaded to the website, and the rendering engine of the browser renders the PDF file, and renders the PDF file as an html view layer and a canvas view layer.

[0124] The browser can monitor the table area selected by the mouse, and the user moves the mouse to complete the border of the table. The browser senses the mouse movement, and the canvas technology draws the frame along the mouse movement position to completely draw the missing frame in the table. Such as Figure 5 shown.

[0125] The Canvas technology in the browser scans the border form to obtain the position information of the form in the selected area. Among them, the position information of the vertical line on the X axis is: [0,0], [171,0], [576,0]; the position information of the horizontal line on the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for directly obtaining table content in PDF through a browser. The method comprises the following steps: 1) uploading a PDF file to be analyzed to a website, displaying an original PDF text on the left side of the website, displaying a corresponding analysis result on the right side of the website, and selecting a table to be analyzed in a PDF area by a mouse; 2) scanning and collecting all pixel values and position information of a selected area by utilizing a Canvas technology in the browser, and determining position information of a frame line of the table according to intersection point position information of the table; 3) converting the obtained position information of the frame line into coordinate information in the browser, and determining the coordinate information of the frame line; 4) through the coordinate information of the frame line, accurately extracting the content of characters / numbers and the coordinate information in the selected area; and 5) extracting table content. The method for directly obtaining table content in PDF through a browser does not depend on the background service, can complete the analysis of the table throughthe computing power of the browser, can select the target area according to the requirements to perform the table analysis, can quickly analyze and obtain the required table, and is strong in pertinence.

Description

technical field [0001] The invention relates to a method for extracting the content of a PDF form, in particular to a method for directly obtaining the content of a PDF form through a browser. Background technique [0002] The full name of PDF is Portable Document Format, also known as Portable Document Format. PDF files can perfectly represent the original style of the file (perfect fidelity), and will not produce different display effects due to different software or systems used; the screen display and printout of PDF files are exactly what the operator wants. Therefore, more and more enterprises use PDF files for transmission. [0003] At present, most of the documents spread on the Internet are in PDF format. When it is necessary to extract the required content from the PDF document, especially the data in the table, it is necessary to organize the table format after copying and pasting the table content. When the table data is relatively large , it is very cumbersome...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/22
CPCG06F40/151
Inventor 淡强强徐福海闫鹏哲吴雪军
Owner 鼎复数据科技(北京)有限公司