Method and device for extracting structured information of PDF (Portable Document Format) file

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of structured information and file structure, applied in the direction of unstructured text data retrieval, text database query, text database index, etc., can solve the problems of no structured information processing, batch processing, large manpower, etc., to achieve expansion The effect of product market share, reduction of human resources cost, and saving of manpower and material resources

Pending Publication Date: 2022-05-17

CHENGDU SEFON SOFTWARE CO LTD

View PDF0 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] The purpose of the present invention is to provide a method and device for extracting structured information of PDF files, so as to solve the existing methods that consume a large amount of manpower and time resources, and cannot be processed in batches, and there is no further processing of structured information question

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0051] Such as figure 1 As shown, a method for extracting structured information from PDF files reduces the labor cost of enterprise unstructured PDF document structured information extraction and storage investment, improves the data governance capabilities of enterprise unstructured data, and taps the huge value contained in PDF documents. At the same time, it also improves the market competitiveness of data governance products, which includes the following steps:

[0052] S1, filter out the editable PDF document, and read the editable PDF document;

[0053] S2, extracting the text content of the editable PDF document in step S1, and then segmenting the text content to obtain a string group;

[0054] S3, forming a discrimination index according to the character string group in step S2;

[0055] S4. According to the discriminant index in step S3, extract its structured information;

[0056] S5. Write the structured information conversion format in step S4 into the database...

Embodiment 2

[0059] Such as figure 1 As shown, the steps of utilizing the PDF document structured information extraction method based on the PDFBox tool are as follows:

[0060] Step 1, PDF document screening: According to the editability of PDF documents, select editable PDF documents.

[0061] Step 2, read an editable PDF document: call the PDDocument.load() function of PDFBox to generate the PDDocument object document.

[0062] Step 3, extract the text content of the homepage: use PDFBox to generate the PDFTextStripper object stripper, and obtain the text content result through stripper.getText(document).

[0063] In step 4, the text content is split to form a string group: the text content result obtained in step 3 is a string, and the string result is separated by the newline character "\n" to obtain the string array splitRes.

[0064] Step 5, traverse the string, add a prefix to form a discriminative index, so as to judge the position: traverse each element of the string data split...

Embodiment 3

[0076] a. Editable PDF document home page

[0077] Experimental analysis on the protective performance of basalt fiber cloth under high-speed impact

[0078] Ha Yue, Pang Baojun, Chi Runqiang, He Maojian, Guan Gongshun, Zhang Wei

[0079] (School of Astronautics, Harbin Institute of Technology, Harbin, Heilongjiang, 150080)

[0080] Abstract: In the field of space debris protection, the use of high-tech fibers as protective materials is one of the trends in the development of protective structures. Basalt fiber is a new high-tech fiber in recent years, with high strength and elastic modulus. In this paper, the protective performance of basalt fiber fabric against spherical projectile high-speed impact is experimentally studied by high-speed impact test. The experimental analysis shows that the basalt fiber cloth has the protective function of breaking the projectile and consuming the impact energy of the projectile that the protective screen should have. When the basalt fib...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Property	Measurement	Unit
Tensile strength	aaaaa	aaaaa
Elastic modulus	aaaaa	aaaaa

Login to View More

Abstract

The invention discloses a method and a device for extracting structured information of a PDF (Portable Document Format) file, and mainly solves the problems that in the prior art, a large amount of manpower and time resources are consumed, batch processing cannot be carried out, and the structured information is not further processed in an existing method. According to the method for extracting the structured information of the PDF file, after the text content of the PDF file is extracted, the text content is segmented to obtain a character string group, then the character string group is traversed, a prefix is added to form a discriminant index, and then the structured information is extracted according to the discriminant index, so that automatic extraction of the structured information is realized; and finally, converting the structured information into a format and writing into a database. Through the scheme, the purposes that manpower and material resources are saved, the PDF files can be processed in batches, and a complete method for converting the PDF files from unstructured data to structured data is formed are achieved.

Description

technical field [0001] The present invention relates to the technical field of unstructured data governance, in particular to a method and device for extracting structured information of PDF files. Background technique [0002] Algorithms and tools for structured data are relatively mature, but algorithms and tools for mining unstructured data are still in their infancy and development. With the decline of storage costs and the development of emerging technologies, the industry has paid more attention to unstructured data, and unstructured data accounts for more than 80% of enterprise data, and is growing at a rate of 55% to 65% per year. If there are no tools and algorithms to analyze these massive data, the huge value of enterprise data will not be able to be realized. [0003] Existing PDF parsing tools, such as PDFBox, Tika, Itex, etc., extract the metadata information of PDF and convert the text content in PDF to txt text at the same time, but cannot automatically iden...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F16/31G06F16/33G06F40/166G06F40/258

CPCG06F16/316G06F16/33G06F40/166G06F40/258

Inventor 韩威宏刘俊良王怡君周刚

Owner CHENGDU SEFON SOFTWARE CO LTD

Method and device for extracting structured information of PDF (Portable Document Format) file

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology