Optical character recognition method and apparatus of PDF document

An optical character recognition and file technology, applied in the field of optical character recognition, can solve the problems of long time occupation, complicated operation, poor user experience, etc., and achieve the effect of reducing operation time, simplifying user operation, and good user experience

Active Publication Date: 2009-05-27
北京汉王影研科技有限公司
View PDF0 Cites 35 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Obviously, using the above method to perform OCR recognition on PDF files requires switching back and forth in different software, the operation is complicated, it takes a long time, and the user experience is poor.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Optical character recognition method and apparatus of PDF document
  • Optical character recognition method and apparatus of PDF document
  • Optical character recognition method and apparatus of PDF document

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0065] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0066] From the perspective of PDF file generation, there are two ways to generate PDF files: the first is to use optical scanning technology to pre-convert existing paper documents, books, etc. into images, and then generate PDF files from images. Data such as characters and graphics exist in the form of images; the second is to use the application program and PDF printer (a virtual printing software) to convert the computer internal code of character and graphics data in the computer into the internal representation of PDF. The characters, graphics and other data exist in the form of PDF code.

[0067] From the perspective of the data structure of the PDF file, the data in the PDF file is organized in the form of PDF objects. S...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an optical character identification method for PDF files. The method comprises the steps of determining a target page in a PDF file, acquiring the page-size information of the target page, generating an image region with corresponding size in a memory according to the page-size information and preset resolution information, acquiring a page-describing instruction of the target page, extracting page-content data and position information in the page-describing instruction, drawing the page-content data in a corresponding position in the image region according to the position information, identifying optical characters of the page-content data and obtaining identification results. The method can realize direct OCR identifying operation to the PDF file, and does not need to repeatedly switch over among various types of software, thereby simplifying the operation of users, reducing operation time and ensuring that the users have good use experience.

Description

technical field [0001] The invention relates to the field of optical character recognition, in particular to an optical character recognition method for a PDF file and an optical character recognition device for a PDF file. Background technique [0002] Optical character recognition technology, referred to as OCR (Optical Character Recognition) technology, is a technology that uses character recognition technology to convert character images into character computer internal codes. Currently, the file formats that can be recognized by the OCR technology are limited to image file formats, that is, files in formats such as tif, bmp, or jpg. [0003] PDF (Portable Document Fromat, Portable File Format) file is an electronic document format used to describe the content of the page. PDF files are independent of the operating system platform (that is, whether it is in Windows, Unix or Mac OS operating system are common) and have become an ideal document format for electronic docum...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/20G06K9/34G06K9/00
Inventor 刘迎建刘昌平江世盛丁迎刘强
Owner 北京汉王影研科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products