PDF file conversion method based on OCR pre-judgment

A file conversion and pre-judgment technology, applied in the field of PDF file conversion, can solve the problems of low judgment accuracy and text deviation, and achieve the effect of strong applicability, improved accuracy and good conversion effect.

Active Publication Date: 2019-03-19
四川译讯信息科技有限公司
View PDF7 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Patent CN108038093A discloses a PDF text extraction method and device, specifically by obtaining the first code, font bitmap, embedded information and font information of each text object in the PDF page to determine whether the PDF pa

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF file conversion method based on OCR pre-judgment
  • PDF file conversion method based on OCR pre-judgment
  • PDF file conversion method based on OCR pre-judgment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] Refer to attached Figure 1-4 Embodiments of the present invention will be specifically described.

[0031] Such as figure 1 As shown, a PDF file conversion method based on OCR pre-judgment includes the following steps:

[0032] Step 1: Parse the PDF file to determine whether OCR is required for each page in the PDF file;

[0033] Step 2: perform ocr on the pages that need to be ocr to obtain text information; directly extract the text information from the text encoding information of the text object in the PDF page for the pages that do not need to be ocr;

[0034] Step 3: convert the obtained text information into a corresponding editable document through the PDF parsing algorithm and the Office file reconstruction algorithm.

[0035] We call pdf files such as scanned pdfs, PDFs converted from images, and other image-based pdf files image-based, image-type PDFs cannot directly extract their text information, and we cannot obtain its text by parsing such PDF files ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a PDF file conversion method based on OCR pre-judgment, comprising the following steps: analyzing the PDF file, judging whether each page in the PDF file needs ocr; Ocr is performed on the page which needs ocr to obtain the text information; Extracting text information directly from text object text encoding information in the PDF page for a page that does not require ocr;the PDF file is converted to the corresponding editable document by PDF parsing algorithm and Office file reconstruction algorithm. The invention improves the correct rate of PDF character extractionthrough pre-analysis of PDF files, ensures the accuracy of character extraction and improves the conversion efficiency of PDF files while reducing unnecessary ocr recognition, and has strong applicability and good conversion effect.

Description

technical field [0001] The invention belongs to the technical field of PDF file conversion, and in particular relates to a PDF file conversion method based on OCR pre-judgment. Background technique [0002] PDF is the abbreviation of Portable Document Format, which is an open electronic file format developed by Adobe. PDF was developed from the PostScript programming language, and PostScript is still widely used in professional publishing as the mainstream printer programming language. [0003] The advantage of the PDF file format is that the file format has nothing to do with software, hardware, and operating system platforms, and can achieve the same display effect in Windows, Unix, or Mac OS operating systems. Therefore, PDF has become the mainstream electronic document format on the Internet. play an important role in communication. However, because the text information in the PDF file is not easy to extract, edit, and query, it is usually necessary to convert the PDF ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/22G06K9/20
CPCG06F40/151G06V10/22
Inventor 马万炯
Owner 四川译讯信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products