Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

high-precision Thai sentence segmentation method

A high-precision, Thai-language technology, applied in the field of machine translation, can solve problems such as unreachable, high-precision, and errors, and achieve the effect of small number of parameters, simple model structure, and accurate segmentation

Active Publication Date: 2019-06-11
沈阳雅译网络技术有限公司
View PDF8 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although there is indeed a space (space) between most Thai sentences, there are also certain disadvantages
Facts have proved that not all Thai sentences have space symbols at the boundaries, which results in the fact that relying on this method to segment Thai sentences cannot achieve a very high precision, and there will be certain errors

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • high-precision Thai sentence segmentation method
  • high-precision Thai sentence segmentation method
  • high-precision Thai sentence segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The present invention will be further elaborated below in conjunction with the accompanying drawings of the description.

[0029] The present invention proposes a high-precision Thai sentence segmentation method, which uses a bidirectional RNN to realize accurate segmentation of Thai sentences. Simultaneously, the present invention also realizes a Thai sentence phrasing method based on the Thai language as the baseline, the model training period is short, and the segmentation speed is fast, which is a light and quick Thai sentence phrasing method.

[0030] Such as figure 1 Shown, a kind of high-precision Thai sentence clause method of the present invention comprises the following steps:

[0031] 1) The sentences in the Thai corpus are artificially marked with sentence boundary information to obtain Thai sentences in the fields of news, spoken language, and encyclopedias. Before encoding the Thai sentences, the Thai sentences are combined to obtain more sentences with a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a high-precision Thai sentence segmentation method, which comprises the following steps of: marking sentence boundary information by sentences in a Thai corpus to obtain required Thai sentences, and combining the required Thai sentences to obtain more Thai sentences with artificial sentence boundary marks; coding each Thai simple sentence in a character mode, generating a binary output string according to current sentence boundary information, and dividing a Thai corpus into a training set and a verification set according to the proportion of 9: 1; adopting a bidirectional RNN for building a Thai sentence segmentation model structure, adopting batch processing operation in the model training process, training a model, wherein each batch comprises a plurality of encoded Thai sentences, and performing pad mark filling on the portions, with the ends empty, of the rest sentences; and starting training by using the Thai sentence segmentation model to obtain a final Thai sentence segmentation model. Accurate positioning of the Thai sentence boundary is achieved, and the method is easy to achieve, accurate in segmentation, high in practicability and high in training speed.

Description

technical field [0001] The invention relates to the field of machine translation, in particular to a high-precision Thai sentence segmentation method. Background technique [0002] At present, machine translation technology has begun to translate around sentences as the basic unit. At this time, a very important task will be involved in the automatic judgment of sentence boundaries. In most of the data corpus that has not passed the data filtering step, the source of sentence pairs in it is rich and diverse, that is, there are sentences obtained through manual translation, and at the same time there are inter-translated sentence pairs obtained from the Internet relying on web crawlers. Therefore, the quality of sentence pairs in the corpus is often uneven, and sentences in the form of paragraphs can be seen almost everywhere, which may cause too many super-long sentences in the corpus. However, if such a situation occurs in the corpus, the training of the model and even the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/28G06F16/35G06F16/33G06N3/04
Inventor 杜权李自荐朱靖波肖桐
Owner 沈阳雅译网络技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products