Word frequency based skip language model training method

A language model and training method technology, applied in the field of Chinese statistical language model training, can solve the problem of lack of statistical language model OOV and so on

Active Publication Date: 2016-12-28
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF9 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The present invention provides a kind of training method of skip language model based on word frequency in view of ab

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word frequency based skip language model training method
  • Word frequency based skip language model training method
  • Word frequency based skip language model training method

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0077] Example 1, the number of words in the longest vocabulary in the dictionary is L = 5, and the positive maximum matching word segmentation is performed for the "Planning Developer Circle of the Ministry of Land and Resources":

[0078] The first round of search: "Ministry of Land and Resources", scan the dictionary, there is this word;

[0079] Round 2 search: "plan developer", scan dictionary, no such word;

[0080] "planned development", scan dictionary, no such word;

[0081] "Scheduled to open", scan the dictionary, there is no such word;

[0082] "plan", scan the dictionary, there is the word;

[0083] The third round of search: "developer circle", scan the dictionary, no such word;

[0084] "Developer", scan the dictionary, there is the word;

[0085] The fourth round of search: "circle", scan the dictionary, there is the word;

[0086] The positive maximum matching word segmentation result Rf is "Ministry of Land and Resources Planning Developer Circle".

[0...

example 2

[0097] Example 2: Take a sentence in the corpus as an example, the sentence after the word segmentation in step S2: "Ili, located in the northwest border, is an important window for my country's opening to the outside world. Since the reform and opening up, Yili's economy has developed rapidly, and people's living standards have rapidly improved. "

[0098] The example sentence is processed in step S3, and the result obtained is: " Yili, located on the northwestern border, is an important window for my country's opening to the outside world. Since the reform and opening up, Yili's economy has achieved rapid development and people's living standards have improved rapidly. ".

[0099] S4. Count the vocabulary and word frequency in the learning set corpus, and generate the Chinese vocabulary wt. Traversing the sentence processed in step S3, every time a new word a is encountered, write a into the Chinese vocabulary, and initialize the number of occurrences c to 1; when traversin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a word frequency based skip language model training method, relates to the technical field of machine translation and aims at solving the OOV problem of a statistical language model caused by linguistic data shortage in the prior art. The word frequency based skip language model training method comprises the steps that Chinese sentences are collected; the Chinese sentences are segmented; a learning set corpus is generated; statistics is conducted on vocabulary and word frequency in the learning set corpus to generate a Chinese vocabulary wt; statistics is conducted on phrases and the emerging times of phrases in the corpus to generate a 1-n Chinese phrase table pt0; a selective skip standard k is set, and k judgment is performed according to the statistical results of the word frequency in the Chinese vocabulary wt, and when the sum of the number of all the vocabulary with the emerging times k not greater than i accounts for above 60% of the number of all the vocabulary, k = i; linguistic model training is performed according to a Chinese sentence table pt2 to obtain a skip-ngram linguistic model. The word frequency based skip language model training method is used for obtaining a linguistic model probability table.

Description

technical field [0001] A word frequency-based skip language model training method for obtaining a language model probability table relates to the technical field of machine translation, in particular to the technical field of a corpus-based Chinese statistical language model training method. Background technique [0002] Language model is one of the basic probability models in statistical machine translation technology. It is used to evaluate the possibility of a certain word sequence in the translated sentence, which is related to the fluency of the translated sentence (translated word sequence). The higher the probability of the language model, the more legal, fluent and consistent the translation is. Assuming that w1 and w2 represent two words, p(w2|w1) represents the probability of w2 appearing behind w1, that is, the probability that the phrase w1w2 may appear; according to linguistic knowledge, the possibility of the phrase "eat noodles" appearing will be significantly...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27
CPCG06F40/216G06F40/242G06F40/284
Inventor 秦科刘贵松段贵多罗光春
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products