Maximum entropy classification model and Thai grammar rule correction-based Thai sentence segmentation method

A classification model, a technique of maximum entropy

Pending Publication Date: 2018-09-04
KUNMING UNIV OF SCI & TECH
View PDF4 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The invention provides a Thai sentence segmentation method based on the maximum entropy classification model and Thai grammar rule cor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Maximum entropy classification model and Thai grammar rule correction-based Thai sentence segmentation method
  • Maximum entropy classification model and Thai grammar rule correction-based Thai sentence segmentation method
  • Maximum entropy classification model and Thai grammar rule correction-based Thai sentence segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0044] Embodiment 1: as Figure 1-2 As shown, a Thai sentence segmentation method based on maximum entropy classification model and Thai grammar rule correction, the specific steps of the method are as follows:

[0045] Step1. Collect and preprocess the Thai sentence segmentation corpus to construct a Thai text corpus; perform Thai word segmentation and part-of-speech tagging on the Thai text corpus, and construct a structured Thai text corpus required for Thai sentence segmentation research;

[0046] Step1.1. Use web crawler technology to collect Thai texts of Thai news and e-books from the Internet, and perform preprocessing operations on the obtained Thai texts to filter, deduplicate and denoise, thereby constructing a Thai text corpus;

[0047] Step1.2. Use the Thai word segmentation tool and Thai part-of-speech tagging tool to perform Thai word segmentation and part-of-speech tagging on the Thai text corpus, and perform manual proofreading to build a structured Thai text ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a maximum entropy classification model and Thai grammar rule correction-based Thai sentence segmentation method, and belongs to the technical field of natural language processing. The method achieves a very good classification effect for classification of space characters in a Thai language, and achieves a good promotion effect for research work of Thai sentence segmentation and Thai sentence boundary identification; the method achieves a very good sentence segmentation effect for Thai sentence segmentation research, and provides powerful support for research work of machine translation, named entity identification, sentence similarity calculation, a large corpus library quick construction technology, information extraction, information retrieval and the like.

Description

technical field [0001] The invention relates to a Thai sentence segmentation method based on a maximum entropy classification model and Thai grammar rule correction, belonging to the technical field of natural language processing. Background technique [0002] Thai sentence segmentation is the basis of Thai natural language processing research work. Most of the research results of natural language processing require the input or output of the language to be sentences rather than entire paragraphs, for example, research on machine translation, named entity recognition, sentence similarity calculation, and rapid construction of large corpus technologies. Sentence segmentation research in natural language processing research can be divided into two aspects, one is to identify sentence-end boundaries in languages ​​that lack sentence-end marks or weak sentence-end marks, such as Uyghur, Tibetan and Thai, etc.; the other One aspect is to disambiguate the sentence-end boundary re...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
CPCG06F40/211G06F40/289
Inventor 王红斌沈强线岩团余正涛郭剑毅文永华
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products