Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for distinguishing language of document image

A document image and discriminant technology, which is applied in character and pattern recognition, instruments, computing, etc., can solve the problems of distinguishing between Simplified Chinese and Traditional Chinese, and the speed of language discrimination is unacceptable

Inactive Publication Date: 2009-12-02
CANON KK
View PDF2 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, these prior art only use the word with the best recognition confidence to decide the language set
Since there are many characters of the same shape in the two language sets, Simplified Chinese and Traditional Chinese, Simplified Chinese and Traditional Chinese cannot be distinguished well by these existing techniques
Also, if only dictionary methods are used to discriminate languages, the speed of language discrimination is unacceptable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for distinguishing language of document image
  • Method and system for distinguishing language of document image
  • Method and system for distinguishing language of document image

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0072] First, we will explain some basic concepts used in this specification, which are as follows.

[0073] - Language family / language set

[0074] In this specification, a language family / language set refers to a Chinese-based or Latin-based language. The Chinese-based language family includes three East Asian languages, which are Chinese (both Simplified and Traditional), Japanese, and Korean. The Latin-based language family mainly includes European languages.

[0075] -connected domain

[0076]In an undirected graph, a connected domain is a maximally connected subgraph. Two vertices are in the same connected domain if and only if there is a path between them. When plotting, each connected domain can be plotted separately with empty intervals between them. A non-empty connected graph has at least one connected domain.

[0077] - Circular / Circular Connected Domain

[0078] The shape of a circular / circular connected domain resembles a circle or ellipse, not a rectangle...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method and a system for distinguishing the language of a document image, comprising the following steps: detecting the round white pixel connection domain in the document block of the document file, and determining whether the document block is a Korean on the basis of the detecting the round white pixel connection domain.

Description

technical field [0001] The present invention relates to a method and system for language discrimination of document images in East Asian languages, such as Korean, Japanese, Simplified Chinese and Traditional Chinese. Background technique [0002] Optical character recognition (OCR) systems are language-dependent systems. An OCR device is trained to recognize a specific language. If the OCR device is run in an inappropriate language, the OCR device will not be able to process the document correctly and will not be able to achieve high accuracy. Therefore, language family or language discrimination is a very important preprocessing step in automatic document recognition. [0003] Various techniques for classifying language families (Latin-based language families and Chinese-based language families) and discriminating languages ​​have been developed. [0004] A.L.Spitz discloses a prior art in an article entitled "Determination of the Script and Language Content of Document...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/68G06K9/72
Inventor 陈刚罗兆海
Owner CANON KK
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products