PDF (Portable Document Format) document text extracting method and device

A text extraction and document technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve the problems of time-consuming and laborious, poor text extraction accuracy and reliability, and disadvantageous electronic document retrieval, etc., to improve accuracy and efficiency. The effect of reliability

Inactive Publication Date: 2016-02-17
HANVON CORP +1
View PDF5 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

like Figure 1 to Figure 3 As shown, because the coding system of the typesetting software is not consistent with the international general coding system, after the characters (such as: English letters, numbers, symbols, etc.) When manually copying a PDF document into an electronic document, there is no free space between the typesetting spaces of English words, and there are even overlaps (such as figure 2 shown), however, the e-reading application judges the space according to the interval between the character typesetting spaces. In the case of full-width characters, the English content will be connected together (such as image 3 As shown), the accuracy and reliability of the text extracted from the PDF document are poor. On the one hand, it is not conducive to the retrieval of the electronic document in the process of electronic reading. On the other hand, the user needs to manually edit the extracted text, which is time-consuming and laborious.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF (Portable Document Format) document text extracting method and device
  • PDF (Portable Document Format) document text extracting method and device
  • PDF (Portable Document Format) document text extracting method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] The present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention. In addition, it should be noted that, for the convenience of description, only parts related to the present invention are shown in the drawings but not all content.

[0045] The text extraction method and device for PDF documents provided by the embodiments of the present invention can be used to extract text from PDF documents exported by existing typesetting software. The existing typesetting software includes but is not limited to: Founder Book Edition, Founder Weisi or Founder Feiteng, etc., this method can solve the problem that the characters extracted from the PDF document exported by the existing typesetting software (such as: Arabic numerals, English letters, punctuation marks, spec...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a PDF (Portable Document Format) document text extracting method and device. The method comprises the steps of obtaining composition spaces of all characters according to display spaces of the characters in a PDF document; if a spaced distance between the composition space of a current character and the composition space of a former character is larger than a first preset threshold value, inserting a whitespace before the composition space of the current character. According to the PDF document text extracting method and device, provided by the invention, the phenomenon that after the PDF document is exported by using existing publishing software, English characters in an extracted text are connected together is avoided, and the accuracy and reliability of text extraction of the PDF document are improved.

Description

technical field [0001] The invention belongs to the technical field of reading and data processing, and in particular relates to a text extraction method and device for a PDF document. Background technique [0002] With the rapid development of digital publishing technology, more and more publishing organizations have begun to issue books in digital form, that is, in the form of electronic documents. At present, in the process of editing, processing and printing, the electronic files of these books are created by typesetting software (such as Founder Book Edition or Founder Feiteng), and after typesetting, large-scale files are exported for printing. Since the large sample file can only be used for printing, not for electronic reading, the tools provided by the typesetting software are generally used to convert the large sample file into a portable file (PortableDocumentFormat, PDF) for electronic reading. [0003] figure 1 It is a schematic diagram of a PDF document expor...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/25G06F40/189
CPCG06F40/189
Inventor 楼永植
Owner HANVON CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products