Method and device for extracting structured information of PDF (Portable Document Format) file
A technology of structured information and file structure, applied in the direction of unstructured text data retrieval, text database query, text database index, etc., can solve the problems of no structured information processing, batch processing, large manpower, etc., to achieve expansion The effect of product market share, reduction of human resources cost, and saving of manpower and material resources
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0051] Such as figure 1 As shown, a method for extracting structured information from PDF files reduces the labor cost of enterprise unstructured PDF document structured information extraction and storage investment, improves the data governance capabilities of enterprise unstructured data, and taps the huge value contained in PDF documents. At the same time, it also improves the market competitiveness of data governance products, which includes the following steps:
[0052] S1, filter out the editable PDF document, and read the editable PDF document;
[0053] S2, extracting the text content of the editable PDF document in step S1, and then segmenting the text content to obtain a string group;
[0054] S3, forming a discrimination index according to the character string group in step S2;
[0055] S4. According to the discriminant index in step S3, extract its structured information;
[0056] S5. Write the structured information conversion format in step S4 into the database...
Embodiment 2
[0059] Such as figure 1 As shown, the steps of utilizing the PDF document structured information extraction method based on the PDFBox tool are as follows:
[0060] Step 1, PDF document screening: According to the editability of PDF documents, select editable PDF documents.
[0061] Step 2, read an editable PDF document: call the PDDocument.load() function of PDFBox to generate the PDDocument object document.
[0062] Step 3, extract the text content of the homepage: use PDFBox to generate the PDFTextStripper object stripper, and obtain the text content result through stripper.getText(document).
[0063] In step 4, the text content is split to form a string group: the text content result obtained in step 3 is a string, and the string result is separated by the newline character "\n" to obtain the string array splitRes.
[0064] Step 5, traverse the string, add a prefix to form a discriminative index, so as to judge the position: traverse each element of the string data split...
Embodiment 3
[0076] a. Editable PDF document home page
[0077] Experimental analysis on the protective performance of basalt fiber cloth under high-speed impact
[0078] Ha Yue, Pang Baojun, Chi Runqiang, He Maojian, Guan Gongshun, Zhang Wei
[0079] (School of Astronautics, Harbin Institute of Technology, Harbin, Heilongjiang, 150080)
[0080] Abstract: In the field of space debris protection, the use of high-tech fibers as protective materials is one of the trends in the development of protective structures. Basalt fiber is a new high-tech fiber in recent years, with high strength and elastic modulus. In this paper, the protective performance of basalt fiber fabric against spherical projectile high-speed impact is experimentally studied by high-speed impact test. The experimental analysis shows that the basalt fiber cloth has the protective function of breaking the projectile and consuming the impact energy of the projectile that the protective screen should have. When the basalt fib...
PUM
| Property | Measurement | Unit |
|---|---|---|
| Tensile strength | aaaaa | aaaaa |
| Elastic modulus | aaaaa | aaaaa |
Abstract
Description
Claims
Application Information
Login to View More 


