Unlock instant, AI-driven research and patent intelligence for your innovation.

A method for detecting and correcting garbled characters in PDF documents

A document garbled and document technology, applied in the field of garbled character detection and correction, can solve problems such as time consumption, and achieve the effect of improving the efficiency of garbled character detection

Active Publication Date: 2018-03-30
同方知网数字出版技术股份有限公司 +1
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For example, for a PDF document with only a small amount of garbled characters in most characters, using OCR word recognition technology for each character will inevitably consume a lot of time on identifying normal characters

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for detecting and correcting garbled characters in PDF documents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0021] like figure 1 As shown, it is a method flow for PDF document garbled detection and correction, and the method includes:

[0022] Extract all font features in PDF documents;

[0023] According to font characteristics, fonts are divided into normal fonts, garbled fonts and undetermined fonts;

[0024] Extract the dot matrix image of characters in the undetermined font, and calculate the similarity between the dot matrix image and the corresponding code based on the garbled code detection algorithm based on image statistical features, and judge the normal characters or garbled characters in the undetermined font according to the similarity;

[0025] Carrying out vertical and horizontal editing and correction to the garbled characters in the u...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for detecting and correcting garbled characters in a PDF document. image, and calculate the similarity between the dot matrix image and the corresponding code based on the garbled code detection algorithm of image statistical features, and judge the normal characters or garbled characters in the undetermined font according to the similarity; The garbled characters are edited and corrected vertically and horizontally; the PDF document is corrected based on the correction results, and the garbled characters are removed. The present invention realizes the automatic detection of garbled codes by combining font features and image features of characters. The combination of vertical and horizontal edits reduces the manual time-consuming of garbled code correction, effectively removes garbled codes, and eliminates the impact of garbled codes on subsequent fragmentation. Processing interference improves processing efficiency and quality, and reduces processing costs.

Description

technical field [0001] The invention relates to a method for detecting and correcting garbled characters in the fragmentation process of PDF documents, in particular to a method for detecting and correcting garbled characters in Chinese and English PDF documents. Background technique [0002] PDF (Portable Document Format, Portable Document Format) is an electronic document format, which is independent of the operating system platform and has become an ideal document format widely used in electronic document distribution and digital information dissemination. [0003] In the process of fragmentation processing (metadata indexing) of PDF documents, it is necessary to perform word extraction operations on the documents. The so-called word extraction refers to copying and pasting the document characters to the specified position. Usually, the displayed content of the document is correct and the displayed content is consistent with the word extraction result. When the displaye...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06K9/32
Inventor 邹季英梁洵袁仁慧
Owner 同方知网数字出版技术股份有限公司