Word position tagging-based Tibetan word segmentation method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A word segmentation method and Tibetan technology, applied in the field of Tibetan word segmentation based on lexeme tagging, can solve the problem of poor segmentation and disambiguation processing effect, and achieve the effect of simplifying the design

Inactive Publication Date: 2011-07-27

INST OF SOFTWARE - CHINESE ACAD OF SCI

View PDF3 Cites 12 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0007] Aiming at the problem that in the existing Tibetan word segmentation method, the processing effect on two important problems such as ambiguity and unregistered words is poor, the purpose of the present invention is to provide a method for Tibetan word segmentation, in order to achieve better results on the whole. word segmentation results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0031] Example 1: The word segmentation process of a typical Tibetan sentence

[0032] For input Tibetan text 302:

[0033] Step 304 according to the single vertical symbol of Tibetan Divide it into a Tibetan sentence;

[0034] Step 306 divides the Tibetan sentence into a series of Tibetan syllables (separated by slashes here), and the result after the segmentation is:

[0035] Step 308 affixes a lexeme label to each syllable, where the lexeme label is placed behind a slash to indicate that the result after labeling is:

[0036] Step 312 splits the syllable marked J and restores it into two syllables, the result after processing is (the part affected by this step is underlined, the same below):

[0037] Step 314 merges all syllables that are marked as B and the syllables that are marked as E behind them into one word, and the result after processing is:

[0038] In step 316, all syllables marked as S and all unmerged syllables are used as monosyllabic word...

Embodiment 2

[0040] Example 2: The word segmentation process of another typical Tibetan sentence

[0041] For input Tibetan text 302:

[0042] Step 304 according to the single vertical symbol of Tibetan Divide it into a Tibetan sentence;

[0043] Step 306 divides the Tibetan sentence into a series of Tibetan syllables (separated by slashes here), and the result after the segmentation is:

[0044] Step 308 affixes a lexeme label to each syllable, where the lexeme label is placed behind a slash to indicate that the result after labeling is:

[0045] Step 312 splits the syllable marked as J and restores it into two syllables, the result after processing is:

[0046] Step 314 merges all the syllables marked as B and the syllables marked as E thereafter and one or more syllables marked as M between them into one word, and the result after processing is:

[0047] In step 316, all syllables marked as S and all unmerged syllables are used as monosyllabic words, and the resul...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a word position tagging-based Tibetan word segmentation method, which belongs to the field of Tibetan information processing. The method comprises the following steps of: 1) segmenting an input Tibetan text into a series of Tibetan sentences by taking punctuations as symbols; 2) segmenting each Tibetan sentence into a series of Tibetan syllables by taking syllable dots as the symbols; 3) for each Tibetan syllable, searching for and selecting a word position tag from a knowledge base according to the context of the Tibetan syllable, and endowing the word position tag to the syllable; 4) recovering all the syllables tagged to be contracted into two syllables, and sequentially tagging the two syllables to be a suffix and independent respectively; 5) combining the syllables from the syllable tagged as an initial to the first syllable, behind the initial syllable, tagged as the suffix into a word; and 6) taking all the syllables tagged to be independent and all the uncombined syllables as monosyllabic words. In the method, all processing is finished on units of a first syllable level without explicit regional word segmentation table words and unregistered words, so word segmentation is called as a simple syllable recombination process.

Description

technical field [0001] The invention relates to the field of computer and Tibetan information processing, more specifically, relates to the field of Tibetan word segmentation, and provides a Tibetan word segmentation method based on lexeme labeling. Background technique [0002] With the enhancement of the computer's ability to support the Tibetan language and the gradual advancement of the informatization process in my country's ethnic minority areas, more and more Tibetan information has begun to be stored and disseminated through computers, and the research on Tibetan information processing is also supported by the operating system. , typesetting and printing, input methods and fonts and other basic text levels gradually shift to text recognition, text-to-speech conversion, text correction, information retrieval, machine translation and other text levels. However, Tibetan is a kind of phonetic script, and its syllables are separated by syllable nodes, but there is no separ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/27

Inventor 刘汇丹吴健诺明花马龙龙

Owner INST OF SOFTWARE - CHINESE ACAD OF SCI

Word position tagging-based Tibetan word segmentation method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology