Unlock instant, AI-driven research and patent intelligence for your innovation.

Word position tagging-based Tibetan word segmentation method

A word segmentation method and Tibetan technology, applied in the field of Tibetan word segmentation based on lexeme tagging, can solve the problem of poor segmentation and disambiguation processing effect, and achieve the effect of simplifying the design

Inactive Publication Date: 2011-07-27
INST OF SOFTWARE - CHINESE ACAD OF SCI
View PDF3 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Aiming at the problem that in the existing Tibetan word segmentation method, the processing effect on two important problems such as ambiguity and unregistered words is poor, the purpose of the present invention is to provide a method for Tibetan word segmentation, in order to achieve better results on the whole. word segmentation results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word position tagging-based Tibetan word segmentation method
  • Word position tagging-based Tibetan word segmentation method
  • Word position tagging-based Tibetan word segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0031] Example 1: The word segmentation process of a typical Tibetan sentence

[0032] For input Tibetan text 302:

[0033] Step 304 according to the single vertical symbol of Tibetan Divide it into a Tibetan sentence;

[0034] Step 306 divides the Tibetan sentence into a series of Tibetan syllables (separated by slashes here), and the result after the segmentation is:

[0035] Step 308 affixes a lexeme label to each syllable, where the lexeme label is placed behind a slash to indicate that the result after labeling is:

[0036] Step 312 splits the syllable marked J and restores it into two syllables, the result after processing is (the part affected by this step is underlined, the same below):

[0037] Step 314 merges all syllables that are marked as B and the syllables that are marked as E behind them into one word, and the result after processing is:

[0038] In step 316, all syllables marked as S and all unmerged syllables are used as monosyllabic word...

Embodiment 2

[0040] Example 2: The word segmentation process of another typical Tibetan sentence

[0041] For input Tibetan text 302:

[0042] Step 304 according to the single vertical symbol of Tibetan Divide it into a Tibetan sentence;

[0043] Step 306 divides the Tibetan sentence into a series of Tibetan syllables (separated by slashes here), and the result after the segmentation is:

[0044] Step 308 affixes a lexeme label to each syllable, where the lexeme label is placed behind a slash to indicate that the result after labeling is:

[0045] Step 312 splits the syllable marked as J and restores it into two syllables, the result after processing is:

[0046] Step 314 merges all the syllables marked as B and the syllables marked as E thereafter and one or more syllables marked as M between them into one word, and the result after processing is:

[0047] In step 316, all syllables marked as S and all unmerged syllables are used as monosyllabic words, and the resul...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a word position tagging-based Tibetan word segmentation method, which belongs to the field of Tibetan information processing. The method comprises the following steps of: 1) segmenting an input Tibetan text into a series of Tibetan sentences by taking punctuations as symbols; 2) segmenting each Tibetan sentence into a series of Tibetan syllables by taking syllable dots as the symbols; 3) for each Tibetan syllable, searching for and selecting a word position tag from a knowledge base according to the context of the Tibetan syllable, and endowing the word position tag to the syllable; 4) recovering all the syllables tagged to be contracted into two syllables, and sequentially tagging the two syllables to be a suffix and independent respectively; 5) combining the syllables from the syllable tagged as an initial to the first syllable, behind the initial syllable, tagged as the suffix into a word; and 6) taking all the syllables tagged to be independent and all the uncombined syllables as monosyllabic words. In the method, all processing is finished on units of a first syllable level without explicit regional word segmentation table words and unregistered words, so word segmentation is called as a simple syllable recombination process.

Description

technical field [0001] The invention relates to the field of computer and Tibetan information processing, more specifically, relates to the field of Tibetan word segmentation, and provides a Tibetan word segmentation method based on lexeme labeling. Background technique [0002] With the enhancement of the computer's ability to support the Tibetan language and the gradual advancement of the informatization process in my country's ethnic minority areas, more and more Tibetan information has begun to be stored and disseminated through computers, and the research on Tibetan information processing is also supported by the operating system. , typesetting and printing, input methods and fonts and other basic text levels gradually shift to text recognition, text-to-speech conversion, text correction, information retrieval, machine translation and other text levels. However, Tibetan is a kind of phonetic script, and its syllables are separated by syllable nodes, but there is no separ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 刘汇丹吴健诺明花马龙龙
Owner INST OF SOFTWARE - CHINESE ACAD OF SCI