Word alignment training method, machine translation method and system

A training method and a training system technology, applied in the field of machine translation methods and systems, and word alignment training methods, can solve problems such as consuming network resources, taking a long time, and affecting the efficiency of word alignment training

Active Publication Date: 2017-12-05
阿里巴巴(中国)网络技术有限公司
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Word alignment technology belongs to the offline training part. For a good statistical machine translation system, the size of the training corpus is generally at the level of tens of millions of sentences. In the prior art, word alignment training can be implemented on a single machine, but because the training corpus The number is huge, so the obtained word alignment training results need to occupy a lot of memory and take a long time. For example, on a server with 128G memory, the time spent on word alignment training based on tens of millions of sentence-level training corpora is 60 about hours
A translation system upgrade is often accompanied by multiple word alignment training and experiments, so offline word alignment training has become a bottleneck for upgrading the machine translation system, seriously affecting the iterative upgrade speed of the translation system
[0004] In order to increase the speed of word alignment training and reduce the operating pressure of a single machine, distributed clusters can also be used for word alignment training in the prior art, that is, word alignment training is performed on multiple machines. However, no matter which word alignment training is used technology, the word alignment training in the prior art needs to maintain a vocabulary of a large matrix, that is, a two-dimensional matrix from the source language vocabulary to the target language vocabulary. Generally speaking, the matrix can reach more than 20G or larger. Maintaining such a large matrix poses a great technical challenge
In the stand-alone mode, it is easy to cause insufficient memory, and the word alignment training process of the stand-alone machine takes a long time
However, in a distributed cluster, each cluster needs to load such a large matrix, which will consume the resources of the cluster. At the same time, distributing such a large matrix in the cluster will also consume the network resources of the entire cluster. And it will also affect the efficiency of word alignment training

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word alignment training method, machine translation method and system
  • Word alignment training method, machine translation method and system
  • Word alignment training method, machine translation method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0063] The embodiment of the present application provides a word alignment training method and system to improve word alignment training efficiency.

[0064] The embodiment of the present application proposes an efficient distributed word alignment training method for the shortcomings of the prior art. Adopt the technique of inverted index to calculate the required vocabulary of each sentence pair (that is, the vocabulary translation subtable of the words of the source sentence and the words of the target sentence in each parallel corpus), and then the vocabulary can follow the double Statements are distributed to each processing node in the parallel cluster together, which avoids dynamically loading the vocabulary of the entire parallel corpus (that is, the vocabulary translation summary table of the words of the source sentence and the words of the target sentence), and reduces the processing time in the parallel cluster. The resource consumption of each processing node of ....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The application discloses a word alignment training method, a machine translation method and system for increasing the efficiency of word alignment training. The application provides a word alignment training method comprising following steps: determining a vocabulary translation general table of multiple parallel corpus, wherein the vocabulary translation general table comprises the translation probability of a word of a source statement to a word of a target statement in the parallel corpus; splitting the vocabulary translation general table to obtain multiple vocabulary translation sub-tables, wherein the vocabulary translation sub-table comprises the translation probability of at least one word of a source statement to a word of a target statement in the parallel corpus; on the basis of the vocabulary translation sub-tables, determining the alignment relationship between words in the source statement and words in the target statement in the parallel corpus.

Description

technical field [0001] The present application relates to the technical field of information processing, in particular to a word alignment training method, machine translation method and system. Background technique [0002] Statistical machine translation technology is the mainstream technology of machine translation at present. Word alignment is the core of machine translation training technology. Word alignment is to calculate the alignment between words in each sentence pair through statistics and analysis from the double sentence pairs. result. The performance of word alignment directly affects the subsequent translation accuracy. [0003] Word alignment technology belongs to the offline training part. For a good statistical machine translation system, the size of the training corpus is generally at the level of tens of millions of sentences. In the prior art, word alignment training can be implemented on a single machine, but because the training corpus The number is...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/28
CPCG06F40/44G06F40/58
Inventor 张海波朱长峰傅春霖黄瑞赵宇骆卫华林锋
Owner 阿里巴巴(中国)网络技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products