Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for extracting translation unit table in machine translation

A technology of machine translation and cell table, applied in the field of hierarchical phrase table and lexical ordering model, and distributed phrase extraction, which can solve the problems of high program time consumption and no mention of how to implement it

Active Publication Date: 2014-01-01
NANJING UNIV
View PDF3 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Among the existing inventions and technologies, such as the "Statistical Machine Translation Phrase Extraction Method" patent applied by the Institute of Computing Technology, Chinese Academy of Sciences in 2009, it focuses on the algorithm of extraction and calculation of probability, and does not mention how to realize this work. If centralized extraction is adopted The method is to extract three files, that is, to use a computer to complete the work independently. With the continuous increase of the size of the training corpus, the time consumption of the program is increasing, and every time a new word alignment method is tested, it must be re-extracted These translation rules, in this way, highlight the inefficiency of the centralized extraction method, so it is necessary to find a way to extract these translation rules faster

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting translation unit table in machine translation
  • Method for extracting translation unit table in machine translation
  • Method for extracting translation unit table in machine translation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0139] This embodiment extracts the phrase table operation as follows:

[0140] 11. Enter the bilingual alignment corpus and the corresponding word alignment file, set the maximum length of the source language phrase to 3, and the maximum length of the target language phrase to 5, and set the extracted phrase pairs to have empty phrases. For each bilingual pair in the bilingual alignment corpus To align sentence pairs, according to the word alignment information in the word alignment file, first extract all the aligned phrase pairs and record their word alignment information and the number of occurrences; then merge the information of the same aligned phrase pairs and add the number of occurrences, And save the word alignment information with the most occurrences; the combined result has a total of 44018003 phrase pairs.

[0141] 12. Take the result of step 1 as input, use the Good-Turing method for smoothing, count (c, nc) pairs, and output the result to a file, c and n in th...

Embodiment 2

[0144] The present invention extracts hierarchical phrase table and operates as follows:

[0145] 21. Enter the bilingual alignment corpus and the corresponding word alignment file, set the maximum length of the source language phrase to 3, and the maximum length of the target language phrase to 5, and set the extracted phrase pair to have empty phrases. For each bilingual pair in the bilingual alignment corpus Align the sentence pairs, according to the word alignment information in the word alignment file, first extract all the alignment level phrase pairs and record the corresponding word alignment information and the number of occurrences; then merge the information of the same level phrase pairs, and add the number of occurrences , and save the word alignment information with the most occurrences; the combined result has a total of 430252258 pairs of hierarchical phrases.

[0146] 22. Take the result of step 1 as input, use the Good-Turing method for smoothing, and count (...

Embodiment 3

[0149] The present invention extracts the lexical ordering model and operates as follows:

[0150] 31. For the input bilingual alignment corpus and the corresponding word alignment file, set the maximum length of the source language phrase and the target language phrase to 7, and set the extracted phrase pair to have no empty phrases. For each pair of bilingual alignment sentences in the bilingual alignment corpus Yes, according to the word alignment information in the word alignment file, extract all aligned phrase pairs and corresponding ordering rules and output them to the file. There are 228,514,143 unmerged phrase pairs in the result file.

[0151] 32. According to the result of step 1, count the total number of appearances of each ordering rule, among which the number of occurrences of the mono rule in the upper direction is 150367615, the number of occurrences of the swap rule is 14918685, the number of occurrences of the discontinuous rule is 63227843; the number of oc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for extracting a translation unit table in machine translation. The method is operated on a Hadoop parallel computing platform and includes the following steps: selecting content to be extracted according to input bilingual alignment corpus and word alignment files and combining relevant information; selecting whether to conduct smoothing and selecting a smoothing method according to requirements and conducting corresponding smoothing counting and combining; calculating corresponding probability and outputting a final result file. Compared with the existing centralized extraction method, the method can greatly quicken program operation. Various smoothing technologies can be added selectively in the probability calculation process, overfitting caused by data sparseness and experience distribution in a training data set can be processed to enable the probability to accord with the condition of a real world, and the performance of a computer machine translation system can be improved under actual conditions.

Description

technical field [0001] The invention relates to the field of computer statistical machine translation and parallel computing, in particular to a method for distributed extraction of phrases, hierarchical phrase tables and lexical ordering models. Background technique [0002] Statistical machine translation has developed very rapidly since the 1990s and has made great progress, and has gradually become a research hotspot in the field of machine translation. Compared with rule-based machine translation systems, the biggest advantage of statistical methods is that there is no need to manually write rules, and machine translation systems can be obtained directly through training using corpora. The statistical machine translation system based on phrases or hierarchical phrases can better grasp the dependence of local contexts, and its performance is better than that of word-based statistical machine translation; compared with syntax-based statistical machine translation, it has ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/28
Inventor 黄书剑孙辉丰戴新宇陈家骏
Owner NANJING UNIV