Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

File content extraction method, device and apparatus, and computer readable storage medium

A technology for extracting methods and files, applied in the field of testing, which can solve problems such as inefficiency

Pending Publication Date: 2019-05-03
PINGAN PUHUI ENTERPRISE MANAGEMENT CO LTD
View PDF3 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The main purpose of the present invention is to provide a method for extracting file content, aiming at solving the above-mentioned technical problem of low efficiency in extracting PDF file content, and improving the processing efficiency of PDF files

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • File content extraction method, device and apparatus, and computer readable storage medium
  • File content extraction method, device and apparatus, and computer readable storage medium
  • File content extraction method, device and apparatus, and computer readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0057]The main solution of the embodiment of the present invention is to convert the PDF file into HTML data, and further extract different types of content data from the HTML data to generate corresponding content files, thereby extracting the content of the PDF file.

[0058] Since the extraction of PDF file content in the prior art mainly relies on manual screening and comparison, especially when a large number of PDF files need to be processed in batches, the processing efficiency of PDF files will be very low.

[0059] The present invention provides a solution, by converting PDF files into HTML data, extracting different types of content data from HTML data, and generating corresponding content files, so as to realize the automatic extraction of PDF file content and improve the efficiency of PDF files. Processing...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a file content extraction method. The file content extraction method comprises the following steps: obtaining a portable document format file; converting the portable documentformat file into hypertext markup language data according to a preset conversion rule; analyzing the hypertext markup language data according to a preset analysis rule so as to extract content data ofdifferent data types; and generating a corresponding content file according to the data type of the content data. The invention further discloses file content extraction equipment and device and a computer readable storage medium. According to the method, the file content extraction efficiency can be improved.

Description

technical field [0001] The present invention relates to the technical field of testing, in particular to a file content extraction method, equipment, device and computer-readable storage medium. Background technique [0002] Portable Document Format (PDF), as an electronic file format, is a file format developed for file exchange in a manner independent of application programs, operating systems, and hardware. PDF files are based on the PostScript language image model, which can faithfully reproduce every character, color, and image of the original manuscript. The PDF file format has nothing to do with the operating system platform, that is to say, it has good versatility, which makes it an ideal document format for electronic document distribution and digital information dissemination on the Internet. The PDF file format can encapsulate text, fonts, formats, colors, and graphic images independent of devices and resolutions in one file. Files in this format can also contai...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/22G06F16/16
Inventor 朱峰
Owner PINGAN PUHUI ENTERPRISE MANAGEMENT CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products