Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system

A technology of parallel corpus and acquisition method, which is applied in the fields of instruments, computing, and electrical digital data processing, and can solve problems such as scarcity of corpus resources.

Inactive Publication Date: 2012-07-18
FUJITSU LTD
View PDF3 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Applying the solution provided by the embodiment of the present invention, using a third-party language to obtain parallel corpus between two languages, thereby solving the

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system
  • Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system
  • Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] Embodiments of the present invention will be described below with reference to the drawings.

[0029] When there are not enough parallel corpus resources between the two languages, in order to obtain the translation rules between the two languages, the intermediate language can be used to merge the translation rules, so as to indirectly obtain the translation rules between the two languages. For example, two sets of translation models M1 and M2 are currently known, where:

[0030] M1 is the translation model between the first language and the intermediate language

[0031] M2 is the translation model between the intermediate language and the second language

[0032] Both sets of translation models M1 and M2 contain a certain number of translation rules. The translation model of statistical machine translation is mainly divided into four parts: first language rules, second language rules, alignment relationship information and rule probability. figure 1 Shown is a sch...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

An embodiment of the invention discloses a bilingual corpus resource acquisition method and a bilingual corpus resource acquisition system. The bilingual corpus resource acquisition method includes the steps: acquiring a matched intermediate language common word string between a first language database and a second language database; and forming a mutually-translated text pair of a first language and a second language, wherein the mutually-translated text pair is used for forming bilingual corpus resources of the first language and the second language. The first language database comprises bilingual corpora of the first language and an intermediate language, and the second language database comprises bilingual corpora of the second language and the intermediate language. By means of applying the scheme provided by the embodiment, the bilingual corpora of the two languages are acquired by the aid of the third-party language, so that the problem of corpus resource scarcity between the languages is solved, and a high-quality translation rule can be acquired to construct a statistical machine translation system.

Description

technical field [0001] The present invention generally relates to the technical field of computer applications, and in particular to a method and system for acquiring parallel corpus resources. Background technique [0002] Machine translation, also known as automatic translation, is the process of using a computer to convert a natural source language into another natural target language, generally referring to the translation of sentences and full texts between natural languages. Statistical Machine Translation (SMT) is a type of machine translation, and it is also a method with better performance in machine translation in non-limited fields. The basic idea of ​​statistical machine translation is to conduct statistical analysis on a certain amount of parallel corpus (bilingual corpus, also known as bilingual translation corpus), and then build a statistical translation model through training, and then use this model for translation. At present, machine translation has grad...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/28G06F17/30
Inventor 郑仲光何中军孟遥于浩
Owner FUJITSU LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products