File content extraction method, device and apparatus, and computer readable storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A technology for extracting methods and files, applied in the field of testing, which can solve problems such as inefficiency

Pending Publication Date: 2019-05-03

PINGAN PUHUI ENTERPRISE MANAGEMENT CO LTD

View PDF3 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] The main purpose of the present invention is to provide a method for extracting file content, aiming at solving the above-mentioned technical problem of low efficiency in extracting PDF file content, and improving the processing efficiency of PDF files

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0056] It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0057]The main solution of the embodiment of the present invention is to convert the PDF file into HTML data, and further extract different types of content data from the HTML data to generate corresponding content files, thereby extracting the content of the PDF file.

[0058] Since the extraction of PDF file content in the prior art mainly relies on manual screening and comparison, especially when a large number of PDF files need to be processed in batches, the processing efficiency of PDF files will be very low.

[0059] The present invention provides a solution, by converting PDF files into HTML data, extracting different types of content data from HTML data, and generating corresponding content files, so as to realize the automatic extraction of PDF file content and improve the efficiency of PDF files. Processing...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a file content extraction method. The file content extraction method comprises the following steps: obtaining a portable document format file; converting the portable documentformat file into hypertext markup language data according to a preset conversion rule; analyzing the hypertext markup language data according to a preset analysis rule so as to extract content data ofdifferent data types; and generating a corresponding content file according to the data type of the content data. The invention further discloses file content extraction equipment and device and a computer readable storage medium. According to the method, the file content extraction efficiency can be improved.

Description

technical field [0001] The present invention relates to the technical field of testing, in particular to a file content extraction method, equipment, device and computer-readable storage medium. Background technique [0002] Portable Document Format (PDF), as an electronic file format, is a file format developed for file exchange in a manner independent of application programs, operating systems, and hardware. PDF files are based on the PostScript language image model, which can faithfully reproduce every character, color, and image of the original manuscript. The PDF file format has nothing to do with the operating system platform, that is to say, it has good versatility, which makes it an ideal document format for electronic document distribution and digital information dissemination on the Internet. The PDF file format can encapsulate text, fonts, formats, colors, and graphic images independent of devices and resolutions in one file. Files in this format can also contai...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/22G06F16/16

Inventor朱峰

OwnerPINGAN PUHUI ENTERPRISE MANAGEMENT CO LTD

File content extraction method, device and apparatus, and computer readable storage medium

What is AI technical title? AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document. A technology for extracting methods and files, applied in the field of testing, which can solve problems such as inefficiency

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A technology for extracting methods and files, applied in the field of testing, which can solve problems such as inefficiency

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology