Image formula Chinese document retrieval method based on content

An image format and document retrieval technology, which is applied in the field of information processing, can solve the problem of ineffective processing of character degraded image format documents, etc., and achieve the effect of simple retrieval method, low cost and fast speed

Inactive Publication Date: 2010-10-20
HARBIN INST OF TECH
View PDF3 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the problem that existing retrieval methods based on OCR technology cannot effectively deal with image format documents with serious character degradation, the present invention provides a content-based retrieval method for image format Chinese documents

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Image formula Chinese document retrieval method based on content
  • Image formula Chinese document retrieval method based on content
  • Image formula Chinese document retrieval method based on content

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0024] Specific implementation mode one: according to the instructions attached figure 1 with 2 Specifically illustrate this embodiment, a kind of content-based image format Chinese document retrieval method of this embodiment, it comprises the following steps:

[0025] Step 1: Obtain the Chinese document in image format to be retrieved, and perform character segmentation for each Chinese document in image format, and then obtain a single character image in each Chinese document in image format ;

[0026] Step 2: According to the acquired single character image , extracting the character image feature vector of the character image;

[0027] Step 3: Based on the principle of local sensitive hash transformation, construct a hash function h, and extract the character image The character image feature vector correspondingly transforms into a pseudocode , and according to the pseudocode Establish a character indexing database, the pseudocode consists of L 16-bit intege...

specific Embodiment approach 2

[0040] Embodiment 2: This embodiment is a further description of Embodiment 1. In Embodiment 1, in step 3, the specific process of constructing the hash function h is as follows: first define the fixed-point set of the regular polyhedron in the m-dimensional space ,in, , and define the rotation matrix A, and then establish the hash function , is a unit vector, the hash function The mapped result set is .

specific Embodiment approach 3

[0041] Specific implementation mode three: this implementation mode is a further description of specific implementation mode one or two, in specific implementation mode one or two, in step three, the pseudo code 16-bit integer The range of the number L is 1-50.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical filed of information processing, in particular to an image formula Chinese document retrieval method based on content, which solves the problem that the existing retrieval method based on the OCR technology can not effectively process the image formula document with serious character degradation. The method comprises the following steps: firstly, carrying out the character division on the image formula document to obtain single character image; secondly, extracting a characteristic vector of the character image; thirdly, constructing a hash function based on the local sensitive hash transformation principle LSH, transforming each characteristic vector of each character image to be a pseudo code, and establishing a character index database; and fourthly, inputting a query keyword to obtain the pseudo code expression of the query keyword, then comparing the pseudo code of the query keyword and the pseudo code in the character index database on the character similarity to further obtain all similar words of the query keyword, outputting the similar words according to the sequence thereof in the document, and finishing the retrieval. The invention is applicable to the Chinese document retrieval of the image formula.

Description

technical field [0001] The invention relates to the technical field of information processing, in particular to a content-based retrieval method for Chinese documents in image format. Background technique [0002] The digital storage and retrieval of paper documents has far-reaching significance for information acquisition and office automation. For data storage, a scanner or digital camera is generally used to convert paper documents into image formats for storage, that is, to convert paper documents into image format documents. Two examples of image format documents are figure 1 shown. How to retrieve large-scale document datasets in image format is a very challenging problem, and it is also a hot spot in the research field in recent years. [0003] For the retrieval of documents in image format, the document is generally converted into ASCII text by using the relatively mature OCR technology at present. Because OCR will have recognition errors and some original informa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06K9/46
Inventor 夏勇王宽全左旺孟黎捷
Owner HARBIN INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products