Parallel corpus construction method and device

A construction method and parallel corpus technology, applied in the field of machine translation, can solve the problems of small corpus size, restricting the effect of machine translation models, low domain coverage, etc., and achieve the effect of expanding the scale

Active Publication Date: 2015-11-18
TSINGHUA UNIV
View PDF6 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The existing parallel corpora are basically obtained from parallel websites. This corpus has problems such as small corpus size and low domain coverage, which restricts the further improvement of the effect of machine translation models.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel corpus construction method and device
  • Parallel corpus construction method and device
  • Parallel corpus construction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0022] Existing parallel corpora are basically obtained from parallel websites. This kind of corpus has problems such as small corpus size and low domain coverage, which restricts the further improvement of the effect of machine translation models. In combination with this problem, the inventors have found in practice that bilingual non-parallel corpora have the characteristics of large corpus and rich fields, but non-parallel corpora are simple monolingual corpora of two languages, and there is no interaction between the two languages. Alignment relationship; if more parallel phrase pairs can be trained based on non-parallel corpus, the scale of parallel corpus will be further expanded. Therefore, this application provides figure 1 The constructi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses a parallel corpus construction method and device, wherein the method includes : determining a translation probability of each translation word pair in the parallel corpus and including source language words and corresponding target language words; adding the translation word pairs and the corresponding translation probability to a translation probability table; matching phrases in a non-parallel corpus according to the translation probability, and determining matched phases as new parallel phase pairs; and adding the new parallel phases to the parallel corpus. According to the scheme of the present invention, parallel phrase pairs based on the non-parallel corpus can be trained, and the scale of the parallel corpus can be increased.

Description

technical field [0001] The invention relates to machine translation technology, in particular to a method and device for constructing parallel corpus. Background technique [0002] With the deepening of international exchanges, people's language translation needs are increasing. As the most convenient platform for obtaining information today, the Internet has increasingly urgent needs for online translation. How to provide users with high-quality translation services has become a difficult problem. There are many kinds of languages ​​on the Internet, each language has a lot of polysemy, and the language is changing all the time, which puts forward higher requirements for translation services. [0003] Using bilingual parallel corpus for machine translation is currently the mainstream. A bilingual parallel corpus refers to two texts that have a mutual translation relationship. Generally, one sentence is used as an alignment unit. [0004] Existing parallel corpora are bas...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/28G06F17/27
Inventor 刘洋董梅平孙茂松
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products