Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Korean word segmentation restoration method based on language model

A language model and word segmentation technology, applied in natural language translation, natural language data processing, instruments, etc., can solve problems such as language restoration that cannot be granular, and achieve the effect of fluent sentences, high accuracy and fast speed

Pending Publication Date: 2020-05-19
沈阳雅译网络技术有限公司
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0018] Aiming at the deficiencies that the existing word segmentation and restoration methods cannot restore the translated Korean, which has a relatively large granularity, the problem to be solved by the present invention is to provide a method for word segmentation and restoration based on a language model, which can be used for the translation of the Korean language. Re-reduction processing after word segmentation for languages ​​with large granularity

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Korean word segmentation restoration method based on language model
  • Korean word segmentation restoration method based on language model
  • Korean word segmentation restoration method based on language model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] The present invention will be further elaborated below in conjunction with the accompanying drawings of the description.

[0052] In the present invention, the language model obtained by using Korean data training is used to perform word segmentation and restoration processing for the translation output by the machine translation system, figure 1 It is an overall flow chart, and the specific steps are as follows:

[0053] 1) Language model training: use the Unigram method to perform language model training on Korean monolingual data, and obtain a Korean language model for subsequent word segmentation and restoration operations;

[0054] 2) Word segmentation of bilingual data: train a translation system and use word segmentation tools to segment bilingual training data;

[0055] 3) Translation model training: use the language model generated in step 1) to perform word segmentation on the data, and input the data after word segmentation into the neural network model to sta...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Korean word segmentation restoration method based on a language model, and the method comprises the following steps: 1) language model training: carrying out the language model training of Korean monolingual data through employing a Unigram method, and obtaining a Korean language model for subsequent word segmentation restoration operation; 2) bilingual data word segmentation: training a translation system, and performing word segmentation on bilingual training data by using a word segmentation tool; 3) translation model training: performing word segmentation processing on data by using the language model generated in the step 1), and then inputting the data after word segmentation into a neural network model to start training the model until the model converges;4, translated text word segmentation reduction: using the translation system obtained through training in the step 3 for translating test statements, and combining the obtained Korean translated textsinto a standard Korean writing method. By means of the method, the problems that word meaning granularity in Korean data is large, and the data are sparse can be relieved, and the quality of translated texts of machine translation with target languages being Korean is effectively improved.

Description

technical field [0001] The invention relates to a word segmentation restoration technology in language processing, in particular to a Korean word segmentation restoration method based on a language model. Background technique [0002] "Word" is the smallest language unit that can be used independently in a language. Words are usually used as the basic unit in the training process of machine translation. Any language needs to perform word segmentation on the data before training machine translation. Complete sentences are divided into groups of consecutive words, which are then trained for machine translation. [0003] There are two common word segmentation methods: one is inflectional words similar to English, and the words in the sentence are separated by spaces. This kind of data can be segmented using punctuation-based word segmentation. You only need to combine the punctuation marks in the sentence with the Words can be separated; another word segmentation method is sui...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/284G06F40/42
Inventor 杜权徐萍朱靖波肖桐张春良
Owner 沈阳雅译网络技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products