Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and device for extracting structured information of PDF (Portable Document Format) file

A technology of structured information and file structure, applied in the direction of unstructured text data retrieval, text database query, text database index, etc., can solve the problems of no structured information processing, batch processing, large manpower, etc., to achieve expansion The effect of product market share, reduction of human resources cost, and saving of manpower and material resources

Pending Publication Date: 2022-05-17
CHENGDU SEFON SOFTWARE CO LTD
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a method and device for extracting structured information of PDF files, so as to solve the existing methods that consume a large amount of manpower and time resources, and cannot be processed in batches, and there is no further processing of structured information question

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting structured information of PDF (Portable Document Format) file
  • Method and device for extracting structured information of PDF (Portable Document Format) file
  • Method and device for extracting structured information of PDF (Portable Document Format) file

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0051] Such as figure 1 As shown, a method for extracting structured information from PDF files reduces the labor cost of enterprise unstructured PDF document structured information extraction and storage investment, improves the data governance capabilities of enterprise unstructured data, and taps the huge value contained in PDF documents. At the same time, it also improves the market competitiveness of data governance products, which includes the following steps:

[0052] S1, filter out the editable PDF document, and read the editable PDF document;

[0053] S2, extracting the text content of the editable PDF document in step S1, and then segmenting the text content to obtain a string group;

[0054] S3, forming a discrimination index according to the character string group in step S2;

[0055] S4. According to the discriminant index in step S3, extract its structured information;

[0056] S5. Write the structured information conversion format in step S4 into the database...

Embodiment 2

[0059] Such as figure 1 As shown, the steps of utilizing the PDF document structured information extraction method based on the PDFBox tool are as follows:

[0060] Step 1, PDF document screening: According to the editability of PDF documents, select editable PDF documents.

[0061] Step 2, read an editable PDF document: call the PDDocument.load() function of PDFBox to generate the PDDocument object document.

[0062] Step 3, extract the text content of the homepage: use PDFBox to generate the PDFTextStripper object stripper, and obtain the text content result through stripper.getText(document).

[0063] In step 4, the text content is split to form a string group: the text content result obtained in step 3 is a string, and the string result is separated by the newline character "\n" to obtain the string array splitRes.

[0064] Step 5, traverse the string, add a prefix to form a discriminative index, so as to judge the position: traverse each element of the string data split...

Embodiment 3

[0076] a. Editable PDF document home page

[0077] Experimental analysis on the protective performance of basalt fiber cloth under high-speed impact

[0078] Ha Yue, Pang Baojun, Chi Runqiang, He Maojian, Guan Gongshun, Zhang Wei

[0079] (School of Astronautics, Harbin Institute of Technology, Harbin, Heilongjiang, 150080)

[0080] Abstract: In the field of space debris protection, the use of high-tech fibers as protective materials is one of the trends in the development of protective structures. Basalt fiber is a new high-tech fiber in recent years, with high strength and elastic modulus. In this paper, the protective performance of basalt fiber fabric against spherical projectile high-speed impact is experimentally studied by high-speed impact test. The experimental analysis shows that the basalt fiber cloth has the protective function of breaking the projectile and consuming the impact energy of the projectile that the protective screen should have. When the basalt fib...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

PropertyMeasurementUnit
Tensile strengthaaaaaaaaaa
Elastic modulusaaaaaaaaaa
Login to View More

Abstract

The invention discloses a method and a device for extracting structured information of a PDF (Portable Document Format) file, and mainly solves the problems that in the prior art, a large amount of manpower and time resources are consumed, batch processing cannot be carried out, and the structured information is not further processed in an existing method. According to the method for extracting the structured information of the PDF file, after the text content of the PDF file is extracted, the text content is segmented to obtain a character string group, then the character string group is traversed, a prefix is added to form a discriminant index, and then the structured information is extracted according to the discriminant index, so that automatic extraction of the structured information is realized; and finally, converting the structured information into a format and writing into a database. Through the scheme, the purposes that manpower and material resources are saved, the PDF files can be processed in batches, and a complete method for converting the PDF files from unstructured data to structured data is formed are achieved.

Description

technical field [0001] The present invention relates to the technical field of unstructured data governance, in particular to a method and device for extracting structured information of PDF files. Background technique [0002] Algorithms and tools for structured data are relatively mature, but algorithms and tools for mining unstructured data are still in their infancy and development. With the decline of storage costs and the development of emerging technologies, the industry has paid more attention to unstructured data, and unstructured data accounts for more than 80% of enterprise data, and is growing at a rate of 55% to 65% per year. If there are no tools and algorithms to analyze these massive data, the huge value of enterprise data will not be able to be realized. [0003] Existing PDF parsing tools, such as PDFBox, Tika, Itex, etc., extract the metadata information of PDF and convert the text content in PDF to txt text at the same time, but cannot automatically iden...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/31G06F16/33G06F40/166G06F40/258
CPCG06F16/316G06F16/33G06F40/166G06F40/258
Inventor 韩威宏刘俊良王怡君周刚
Owner CHENGDU SEFON SOFTWARE CO LTD