Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training

A parallel corpus and phrase pair technology, applied in the field of chapter-level phrase translation pair extraction, can solve the problems of scarce data resources, dependence on bilingual dictionaries, etc.

Active Publication Date: 2015-03-04
哈尔滨工业大学高新技术开发总公司
View PDF5 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to solve the problem that the parallel data resources of the statistical machine translation system are scarce or even non-existent. If the parallel corpus is to be obtained, it will co

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
  • Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
  • Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0030] Specific implementation mode one: a kind of method for extracting parallel phrase pairs based on parallel corpus training based on parallel corpus in this embodiment is specifically prepared according to the following steps:

[0031] Step 1, set the source language sentence set S and the target language sentence set T in the corpus; wherein, the corpus includes parallel corpus and comparable corpus;

[0032] Step 2, respectively divide S and T into phrases according to the specified length, the length of the phrase is 2-7 words, and the divided phrases are combined in pairs to obtain all the phrase pair sets of the parallel corpus; wherein, each phrase pair must contain a phrase from S and a phrase from T;

[0033] Step 3, use the GIZA++ tool to extract the two-way word translation table from the parallel corpus, and use the parallel corpus to establish a phrase-based statistical machine translation system in the Moses system to obtain that most of the phrases contained...

specific Embodiment approach 2

[0054] Specific embodiment 2: The difference between this embodiment and specific embodiment 1 is that in step 3, the specific process of extracting positive examples of training data (marking of positive examples) is as follows:

[0055] (1) Let S k is the word at the k'th position in the source language sentence set S, is the sequence of words from position i to position j in S and T k 'is the word at the k'th position in the target language sentence set T, is the word sequence from position i' to position j' in T; assume a threshold ε, ε∈(0,1);

[0056] (2) The threshold is selected based on experience and actual conditions. If the translation probability of two words in the two-way word translation table is greater than the threshold ε, the two words S k with T k' are mutual translations;

[0057] (3) If and only if S k with T k' When mutual translation is alignment, k∈[i,j] and k'∈[i',j'];

[0058] S k with T k' When there is no mutual translation or alignment...

specific Embodiment approach 3

[0060] Specific embodiment three: the difference between this embodiment and specific embodiment one or two is: in the step 5, extract classification features as follows from the parallel phrase pair of parallel corpus and the non-parallel phrase pair of parallel corpus respectively:

[0061] (1) Phrase length difference: it is the absolute value of the difference between the source language phrase and the target language phrase length;

[0062] (2) Same start: if the beginning of the source language phrase and the beginning of the target language phrase can be translated each other, the value is 1, otherwise the value is 0;

[0063] (3) Same ending: If the ending of the source language phrase and the ending of the target language phrase can be translated each other, the value is 1, otherwise the value is 0;

[0064] (4) The number of words in the phrase: it is the number of words contained in the source language phrase and the target language phrase respectively;

[0065] (5...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for extracting a chapter-level parallel phrase pair of a comparable corpus based on parallel corpus training and relates to a method for extracting the parallel phrase pair of the comparable corpus. The method solves the problems that acquisition of a parallel corpus needs high expenditure, and when two most similar contextual words or fragments are mutually translated and applied to the comparable corpus, serious dependency to a bilingual dictionary is caused. The method comprises the following steps of 1, providing a source language sentence set S and a target language sentence set T; 2, obtaining a phrase pair set of the parallel corpus; 3, obtaining a parallel phrase pair of the parallel corpus; 4, obtaining a non-parallel phrase pair of the parallel corpus; 5, obtaining a binary classifier of a support vector machine; 6, extracting a candidate parallel phrase pair <s, t>; 7, obtaining the parallel phrase pair containing a noise in the comparable corpus; 8, obtaining the parallel phrase pair of the comparable corpus; 9, obtaining an extension decoder. The method is applied to the field of extraction of the parallel phrase pair of the comparable corpus.

Description

technical field [0001] The invention relates to a method for extracting phrase-translation pairs, in particular to a method for extracting phrase-level phrase translation pairs. Background technique [0002] With the emergence of high-coverage media such as radio, television, and the Internet, the space-time distance between people has suddenly shortened, and international exchanges have become more frequent and convenient. The entire earth is like a small village in the vast universe. In order to allow people to communicate unimpeded, machine translation, as an automatic translation from one language to another, has a huge market demand and broad application prospects. [0003] In recent years, computing power has improved by leaps and bounds, the development and popularization of the Internet, and bilingual countries and the multilingual archives of the United Nations have provided us with tens of millions of bilingual parallel corpora, which have laid the necessary founda...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35G06F40/58
Inventor 曹海龙张捷鑫赵铁军
Owner 哈尔滨工业大学高新技术开发总公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products