Unicode traditional Mongolian language normalization method based on glyph similarity
A glyph similarity, traditional technology, applied in electrical digital data processing, special data processing applications, instruments, etc., to improve the detection rate, reduce sparsity, and improve translation quality
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0046] Unicode traditional Mongolian homograph table generating method described in the present embodiment, comprises the following steps:
[0047] S1, use the Unicode-encoded traditional Mongolian corpus to count the Unicode traditional Mongolian vocabulary.
[0048] Among them, this embodiment adopts a Unicode-encoded traditional Mongolian Internet corpus (http: / / cloudtranslation.cc / corpus_minority.html) with a scale of 150 million words.
[0049] S2. Select a word from the Unicode traditional Mongolian vocabulary, use the Unicode traditional Mongolian homograph alphabet to generate all possible words with the same word form for the current word, and filter out homographs with the same word form through image matching.
[0050] Among them, the Unicode traditional Mongolian homograph alphabet includes 22 homograph replacement rules:
[0051] (1) U+1820(a) can be replaced by U+1821(e);
[0052] (2) U+1821(e) can be replaced by U+1820(a);
[0053] (3) U+1823(o) can be replac...
Embodiment 2
[0081] Mongolian search engine system described in the present embodiment adopts the traditional Mongolian standardization method based on the Unicode encoding of font similarity, wherein Unicode traditional Mongolian homograph table adopts the homographs described in embodiment 1 that include 84611 equivalence classes surface.
[0082] Specifically, a traditional Mongolian text normalization method based on Unicode encoding based on font similarity is used to standardize traditional Mongolian web pages crawled by crawlers and query requests input by users.
[0083] In order to verify the effectiveness of the technical solution of the present invention, relevant comparative experiments have been carried out. In the experiment, the "site:" command was used to limit the retrieval range of the search engine to two Mongolian websites, "www.mgyxw.net" and "mgl.nmg.gov.cn". The number of matching items detected by the search engine according to the query request is shown in Table 1...
Embodiment 3
[0088] The statistical machine translation system from traditional Mongolian to Chinese described in this embodiment adopts the traditional Mongolian standardization method based on Unicode encoding of font similarity, wherein Unicode traditional Mongolian homograph table adopts embodiment 1 and includes 84611 etc. A glossary of homographs for valence classes.
[0089] Specifically, the traditional Mongolian normalization method based on Unicode encoding based on font similarity is used to normalize both the training data and the text to be translated input by the user.
[0090] In order to verify the effectiveness of the technical solution of the present invention, relevant comparative experiments have been carried out. The experiment uses a phrase-based statistical machine translation system (http: / / cloudtranslation.cc / mt), the training corpus is Chinese legal texts and government work reports, including 59,000 parallel sentence pairs, and the test corpus includes government...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More - R&D
- Intellectual Property
- Life Sciences
- Materials
- Tech Scout
- Unparalleled Data Quality
- Higher Quality Content
- 60% Fewer Hallucinations
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2025 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com


