Unlock instant, AI-driven research and patent intelligence for your innovation.

Detection and correction method for messy codes of PDF (portable document format) document

A document garbled code and document technology, which is applied in the field of garbled character detection and correction, can solve problems such as time consumption, and achieve the effect of improving the efficiency of garbled code detection

Active Publication Date: 2015-06-24
同方知网数字出版技术股份有限公司 +1
View PDF5 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For example, for a PDF document with only a small amount of garbled characters in most characters, using OCR word recognition technology for each character will inevitably consume a lot of time on identifying normal characters

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Detection and correction method for messy codes of PDF (portable document format) document

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0021] Such as figure 1 As shown, it is a method flow for PDF document garbled detection and correction, and the method includes:

[0022] Extract all font features in PDF documents;

[0023] According to font characteristics, fonts are divided into normal fonts, garbled fonts and undetermined fonts;

[0024] Extract the dot matrix image of characters in the undetermined font, and calculate the similarity between the dot matrix image and the corresponding code based on the garbled code detection algorithm based on image statistical features, and judge the normal characters or garbled characters in the undetermined font according to the similarity;

[0025] Carrying out vertical and horizontal editing and correction to the garbled characters in th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a detection and correction method for messy codes of a PDF (portable document format) document. The detection and correction method includes extracting all font characteristics in the PDF document; dividing fonts into normal fonts, garbage fonts and undetermined fonts according to the font characteristics; extracting dot matrix images of characters in the undetermined fonts, calculating similarity between the dot matrix images and corresponding codes according to a messy code detection algorithm for image statistical characteristics, and judging normal characters or garbage characters in the undetermined fonts according to the similarity; performing vertical and horizontal editing and correcting on the garbage characters in the undetermined fonts and garbage characters in the garbage fonts; correcting the PDF document according to a correction result to remove the garbage characters. The detection and correction method has the advantages that automatic detection of the messy codes is achieved through combination of the font characteristics and character image characteristics, labor and time for messy code correction are reduced through combination of vertical editing and horizontal editing, the messy codes are removed effectively, interference of the messy codes to follow-up fragmentization is avoided, processing efficiency and quality are improved, and processing cost is reduced.

Description

technical field [0001] The invention relates to a method for detecting and correcting garbled characters in the fragmentation process of PDF documents, in particular to a method for detecting and correcting garbled characters in Chinese and English PDF documents. Background technique [0002] PDF (Portable Document Format, Portable Document Format) is an electronic document format, which is independent of the operating system platform and has become an ideal document format widely used in electronic document distribution and digital information dissemination. [0003] In the process of fragmentation processing (metadata indexing) of PDF documents, it is necessary to perform word extraction operations on the documents. The so-called word extraction refers to copying and pasting the document characters to the specified position. Usually, the displayed content of the document is correct and the displayed content is consistent with the word extraction result. When the displaye...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/32
Inventor 邹季英梁洵袁仁慧
Owner 同方知网数字出版技术股份有限公司