Document text extraction method and device

A text extraction and document technology, applied in character and pattern recognition, electrical digital data processing, special data processing applications, etc., can solve the problems of low OCR recognition accuracy, wrong corresponding position, poor noise resistance, etc., to reduce processing Efficacy of workload, reduction of manual intervention, preservation of format and logical information

Inactive Publication Date: 2011-11-30
HANVON CORP
View PDF2 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

During this process, some OCR recognition engines have poor noise resistance, especially when the document layout is chaotic or contains background text, the accuracy of OCR recognition is not high, and in the layout proofreading, especially the tex...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document text extraction method and device
  • Document text extraction method and device
  • Document text extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0036] The invention discloses a text extraction method of a document, such as figure 1 shown, including the following steps:

[0037] Step 1: Parse the document, obtain the corresponding information of the font in the document, and obtain the character mapping table according to the corresponding information;

[0038] The corresponding information of the font includes baseline, original code, font name, Ascent (rising part), Descent (descending part), EM Square. Such as figure 2 As shown, Ascent represents the vertical distance above the baseline, and in this embodiment, Ascent is the height of 4 / 5 characters. Descent represents the vertical distance below the baseline, and Descent is 1 / 5 of the height of the character. EM S...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a document text extraction method and device, belonging to the field of data processing. The method includes: step 1: analyzing the document, obtaining the corresponding information of the font in the document, and obtaining a character mapping table according to the corresponding information; step 2: obtaining the font image corresponding to each character according to the font corresponding information; step 3: cutting the font image to obtain The inked area corresponding to the font image; Step 4: Perform character recognition on the inked area to obtain the recognition result of each character; Step 5: Update the character mapping table according to the recognition result, and extract text information from the document according to the updated character mapping table . The invention improves the flow of data processing, and also reduces the workload of data processing, so that the randomly coded packaged fonts will not become an obstacle to data processing. For a specific layout document, the correct text information can be obtained without identifying the page image, which minimizes manual intervention and preserves the format and logic information of the document.

Description

technical field [0001] The invention belongs to the field of data processing, and relates to a text extraction method and device for documents. Background technique [0002] In the process of packaging fonts when the layout document is created, some manufacturers use random codes to process the document in order to prevent the text in the document from being copied, and the text obtained when this type of document is exported is garbled. At present, the processing process of this type of layout document is as follows: the entire document is generated page by page into a layout picture, and an OCR recognition engine is used to identify the picture, and the text is proofread after the layout proofreading, and the obtained text is exported. During this process, some OCR recognition engines have poor noise resistance, especially when the document layout is chaotic or contains background text, the accuracy of OCR recognition is not high, and in the layout proofreading, especially...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/22G06K9/20
Inventor 楼永植陈峻峰
Owner HANVON CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products