Training-corpus quality evaluation and selection method orienting to statistical-machine translation

A technology for statistical machine translation and quality evaluation, applied in the field of training corpus quality evaluation and selection for statistical machine translation, and can solve problems such as unavailability, time-consuming and labor-intensive

Active Publication Date: 2013-02-27
沈阳雅译网络技术有限公司
View PDF2 Cites 31 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Manual methods require a lot of experimental support, are time-consuming and labor-intensive, and become very unusable when the number of features increases

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Training-corpus quality evaluation and selection method orienting to statistical-machine translation
  • Training-corpus quality evaluation and selection method orienting to statistical-machine translation
  • Training-corpus quality evaluation and selection method orienting to statistical-machine translation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0070] The present invention will be further elaborated below in conjunction with the accompanying drawings of the description.

[0071] The present invention is oriented to the training corpus quality evaluation and selection method of statistical machine translation and comprises the following steps:

[0072] Automatic weight acquisition: use small-scale corpus to train the weight automatic acquisition model to obtain the weight and classification threshold of each feature in the quality evaluation linear model;

[0073] Sentence pair quality evaluation: The above weights and classification thresholds are used together with the original large-scale parallel corpus as input, and the large-scale parallel corpus is classified by the sentence pair quality evaluation linear model to generate each corpus subset;

[0074] Selection of high-quality corpus subsets: On the basis of the above-mentioned corpus subsets, high-quality corpus is selected as the training data of the statisti...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a training-corpus quality evaluation and selection method orienting to statistical-machine translation. The training-corpus quality evaluation and selection method comprises the following steps of: automatic weight acquisition: adopting small-scale corpus to train an automatic weight acquisition model so as to obtain a characteristic weight and a classification critical value; sentence-pair quality evaluation: using the weight and the classification critical value as well as the original large-scale parallel corpuses as input, carrying out classification on the large-scale parallel corpuses by using a linear model for sentence-pair quality evaluation, and generating all corpus subsets; and high-quality corpus subset selection: on the basis of all the corpus subsets, considering the influence of the cover degree, and selecting the high-quality corpuses as training data of a statistical-machine translation system. The training-corpus quality evaluation and selection method has the advantages that richer sequence-pair quality evaluation characteristic is provided, so that the automatic learning of the characteristic weight is realized, and when the scale of the subsets reaches to 30%, the performance can reach 100%, even better; and the class of any input sequence pair can be divided, and help can be provided for tasks such as selection of high-quality corpus data.

Description

technical field [0001] The invention relates to a statistical machine translation technology, in particular to a statistical machine translation-oriented training corpus quality evaluation and selection method. Background technique [0002] The training of the Statistical Machine Translation (SMT) system requires the support of large-scale bilingual parallel corpus, and the quality and quantity of the corpus will have a great impact on the performance of the machine translation system. In general, increasing the size of the training corpus helps to obtain stable model parameters and improve the translation performance of SMT systems. However, the larger the corpus, the higher the execution cost of the system, and the longer it takes to train and decode. In addition, a larger corpus means that there may be more noisy data, which will affect the reliability of system training to a certain extent. [0003] Yao Shujie et al. (2010) proposed a method for selecting statistical m...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/28
Inventor 朱靖波张浩肖桐李强
Owner 沈阳雅译网络技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products