Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for screening parallel sentence pairs

A technology of parallel sentence pairs and screening methods, which is applied in the field of machine translation and can solve problems such as being unsuitable for noisy input corpus.

Active Publication Date: 2018-06-15
TSINGHUA UNIV
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Current parallel sentence pair screening methods rely on word alignment techniques, which were not developed for parallel sentence pair screening, but assume that the input corpora are inter-translated and are therefore not suitable for noisy input corpora

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for screening parallel sentence pairs
  • Method and system for screening parallel sentence pairs
  • Method and system for screening parallel sentence pairs

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0057] In order to understand the above-mentioned purpose, features and advantages of the present invention more clearly, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.

[0058] In the following description, many specific details are set forth in order to fully understand the present invention. However, the present invention can also be implemented in other ways different from those described here. Therefore, the protection scope of the present invention is not limited by the specific details disclosed below. EXAMPLE LIMITATIONS.

[0059] The invention provides a screening method for parallel sentence pairs, such as figure 1 As shown, the method includes:

[0060] Step S1, segmenting the source language sentence and the targe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a parallel sentence pair screening method and system. The method comprises the following steps: a source language statement and a target language statement of each to-be-screened sentence pair are divided into words; the word vector of each word obtained through dividing is determined with a bilingual word vector model; the weight value of each word in the source language statement is calculated; the weight value of each word in the target language statement is calculated; an objective function is established, the optimal solution of the objective function is calculated, and the minimum earth mover's distance of each to-be-screened sentence pair is determined according to the optimal solution; the parallel sentence pair screening standard is determined according to the minimum earth mover's distances of a plurality of to-be-screened sentence pairs, and parallel sentence pair screening is performed according to the parallel sentence pair screening standard. The method is specially designed for parallel sentence pair screening work, the condition that all corpora are inter-translated is not supposed, large quantity of rough bilingual corpora on the internet can be screened, and high-quality and reliable bilingual corpora can be obtained.

Description

technical field [0001] The invention relates to the technical field of machine translation, in particular to a screening method and system for parallel sentence pairs. Background technique [0002] With the deepening of international exchanges, people's demand for language translation is increasing day by day. However, there are many kinds of languages ​​in the world, each with its own characteristics and flexible forms, making machine translation of languages ​​an unsolved problem. In order to realize automatic machine translation, the current translation technology is generally based on statistical models, and the establishment of reliable statistical models requires the establishment of large-scale high-quality parallel corpus. However, high-quality parallel corpora often only exist in a small number of languages, and are often restricted to specific domains, such as government documents, news, etc. With the rise of the Internet, the exchange of international informatio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/28
CPCG06F40/58
Inventor 孙茂松张檬刘洋栾焕博
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products