Supercharge Your Innovation With Domain-Expert AI Agents!

Unicode traditional Mongolian language normalization method based on glyph similarity

A glyph similarity, traditional technology, applied in electrical digital data processing, special data processing applications, instruments, etc., to improve the detection rate, reduce sparsity, and improve translation quality

Inactive Publication Date: 2017-03-01
XIAMEN UNIV
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The purpose of the invention is to solve the problem caused by Unicode traditional Mongolian homographs and overcome the deficiencies of existing methods, and provide the Unicode traditional Mongolian normalization method based on font similarity

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unicode traditional Mongolian language normalization method based on glyph similarity
  • Unicode traditional Mongolian language normalization method based on glyph similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046] Unicode traditional Mongolian homograph table generating method described in the present embodiment, comprises the following steps:

[0047] S1, use the Unicode-encoded traditional Mongolian corpus to count the Unicode traditional Mongolian vocabulary.

[0048] Among them, this embodiment adopts a Unicode-encoded traditional Mongolian Internet corpus (http: / / cloudtranslation.cc / corpus_minority.html) with a scale of 150 million words.

[0049] S2. Select a word from the Unicode traditional Mongolian vocabulary, use the Unicode traditional Mongolian homograph alphabet to generate all possible words with the same word form for the current word, and filter out homographs with the same word form through image matching.

[0050] Among them, the Unicode traditional Mongolian homograph alphabet includes 22 homograph replacement rules:

[0051] (1) U+1820(a) can be replaced by U+1821(e);

[0052] (2) U+1821(e) can be replaced by U+1820(a);

[0053] (3) U+1823(o) can be replac...

Embodiment 2

[0081] Mongolian search engine system described in the present embodiment adopts the traditional Mongolian standardization method based on the Unicode encoding of font similarity, wherein Unicode traditional Mongolian homograph table adopts the homographs described in embodiment 1 that include 84611 equivalence classes surface.

[0082] Specifically, a traditional Mongolian text normalization method based on Unicode encoding based on font similarity is used to standardize traditional Mongolian web pages crawled by crawlers and query requests input by users.

[0083] In order to verify the effectiveness of the technical solution of the present invention, relevant comparative experiments have been carried out. In the experiment, the "site:" command was used to limit the retrieval range of the search engine to two Mongolian websites, "www.mgyxw.net" and "mgl.nmg.gov.cn". The number of matching items detected by the search engine according to the query request is shown in Table 1...

Embodiment 3

[0088] The statistical machine translation system from traditional Mongolian to Chinese described in this embodiment adopts the traditional Mongolian standardization method based on Unicode encoding of font similarity, wherein Unicode traditional Mongolian homograph table adopts embodiment 1 and includes 84611 etc. A glossary of homographs for valence classes.

[0089] Specifically, the traditional Mongolian normalization method based on Unicode encoding based on font similarity is used to normalize both the training data and the text to be translated input by the user.

[0090] In order to verify the effectiveness of the technical solution of the present invention, relevant comparative experiments have been carried out. The experiment uses a phrase-based statistical machine translation system (http: / / cloudtranslation.cc / mt), the training corpus is Chinese legal texts and government work reports, including 59,000 parallel sentence pairs, and the test corpus includes government...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a Unicode traditional Mongolian language normalization method based on glyph similarity and relates to the fields of text normalization, traditional Mongolian language encoding and the like. Words recorded in a Unicode traditional Mongolian language homograph vocabulary are replaced by equivalent-class normalized encoding forms of the words by traversing each word in an input Unicode encoded traditional Mongolian language text to obtain a normalized traditional Mongolian language text. Unicode encoded traditional Mongolian language homographs can be effectively normalized, and the data sparsity of a statistical language model in a traditional Mongolian language is reduced. Statistics is conducted on a Unicode traditional Mongolian language vocabulary by utilizing a Unicode encoded traditional Mongolian language corpus, homographs of words in the vocabulary are generated according to a Unicode traditional Mongolian language homomorphic alphabet and an image matching algorithm, and the homograph vocabulary is obtained through merging equivalent classes. The Unicode traditional Mongolian language homomorphic alphabet includes 22 homomorphic replacement rules.

Description

technical field [0001] The invention relates to the fields of text standardization, traditional Mongolian coding and the like, in particular to a method for standardizing homographs generated due to different inner codes of homographs in traditional Mongolian Unicode coding. Background technique [0002] In traditional Mongolian Unicode encoding, characters are encoded according to their corresponding letters, and characters of the same glyph may correspond to different code points. For example, letter a corresponds to different glyphs at the beginning of a word, in the middle of a word, and at the end of a word, but they all correspond to the same code (U+1820); although the letters o and u have the same glyphs at the end of a word, they are different codes (U+1820). 1823 and U+1824). This coding principle leads to the fact that in the traditional Mongolian text encoded by Unicode, there may be many different internal codes for a word form. For example, in traditional Mon...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/22
CPCG06F40/129
Inventor 史晓东王博立
Owner XIAMEN UNIV
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More