Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for word segmentation of Thai texts

A technology of text segmentation and Thai language, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of no obvious delimiter and complicated word segmentation, and achieve the goal of improving recognition ability, usability and accuracy Effect

Active Publication Date: 2013-09-25
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF5 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In some other language texts, it is much more complicated to realize word segmentation
For example, in the Chinese we are familiar with, only words, sentences and paragraphs can be easily demarcated by obvious delimiters (such as punctuation marks, newlines, etc.), but there is no obvious delimiter at the "word" level

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for word segmentation of Thai texts
  • Method and device for word segmentation of Thai texts
  • Method and device for word segmentation of Thai texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0070] In order to enable those skilled in the art to better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be described in detail below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of embodiments of the present invention, but not all embodiments. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present invention shall fall within the protection scope of the present invention.

[0071] Thai has no natural segmentation markers such as spaces, and even very few punctuation marks, such as Thai text strings (Chinese meaning: find address from phone number) The actual word formation is: (Search|Address|From|Number|Phone), however, according to the general writing habits of Thai, it is difficult to perform the above split. Therefore, in order to reali...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and device for word segmentation of Thai texts. The method for word segmentation of the Thai texts comprises the following steps: conducting segmentation on a text string to-be-corrected according to a dictionary matching algorithm; if unmatched parts exist, combining characters of the unmatched parts into syllables by utilizing a preset syllable combination template; obtaining a first word segmentation result according to the matched parts and the syllables obtained through combination. According to the technical scheme provided by the embodiment of the invention, basic word segmentation of The Thai texts is realized according to the dictionary matching algorithm, and especially, word segmentation of The Thai texts is also realized by utilizing melody regulation according to the practical condition that a plurality of irregular writing styles exist in Thai language, thereby improving the recognition capability of words unlisted in a dictionary, and improving the feasibility and the accuracy of word segmentation result.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a Thai text word segmentation method and device. Background technique [0002] Word segmentation, also known as word segmentation, refers to the process of recombining a continuous text sequence into a word sequence according to certain specifications. Word segmentation technology belongs to the category of natural language processing technology and is mainly used in search engines, text mining and other fields. [0003] In Latin-based texts represented by English, spaces are used as natural delimiters between words, and word segmentation is relatively simple. In some other language texts, it is much more complicated to realize word segmentation. For example, in the Chinese we are familiar with, only words, sentences and paragraphs can be simply demarcated by obvious delimiters (such as punctuation marks, newlines, etc.), but there is no obvious delimiter at...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
Inventor 何径舟张超
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products