Method and device for aligning sentences in bilingual corpus

A bilingual corpus and sentence pair technology, applied in the field of data processing, can solve problems such as low efficiency, time-consuming, and high complexity, and achieve the effects of improving accuracy, simplifying the process, and improving alignment efficiency

Inactive Publication Date: 2013-01-02
FUJITSU LTD
View PDF4 Cites 41 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In short, when aligning bilingual corpus sentences in the prior art, it is not only complex and time-consuming to implement, but also leads to very low efficiency of the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for aligning sentences in bilingual corpus
  • Method and device for aligning sentences in bilingual corpus
  • Method and device for aligning sentences in bilingual corpus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] Embodiments of the present invention will be described below with reference to the drawings.

[0030] see figure 1 , the first method for sentence alignment of a bilingual corpus provided in an embodiment of the present invention may include:

[0031] S101: For each alignment block of the source language and the target language, use the source keyword list and the target keyword list extracted from the source block and the target block to generate a candidate translation pair list, each of the candidate translation pair lists Entries are translation pairs that include source and target keyword pairs.

[0032] In practical applications, the original corpus is often aligned with paragraphs or chapters as the smallest unit, and these smallest alignment units are called "blocks" in the present invention. For example, in a block B, if a word a is a keyword in the source language F, then its translation b is likely to be a keyword in the target language E; keywords to gene...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a method and a device for aligning sentences in a bilingual corpus. A source language corpus and a target language corpus in the bilingual corpus are in block alignment. The method comprises the following steps of: aiming at each alignment block in a source language and a target language, generating a candidate translation pair list according to a source keyword list and a target keyword list which are extracted from a source block and a target block respectively; generating a bilingual dictionary according to the translation probability of each translation pair in the candidate translation pair list; expanding the bilingual dictionary by taking a source-target keyword pair in each item in the bilingual dictionary as a seed translation pair in reference to contents of a text of the seed translation pair; translating a source sentence in the source block into a target language, and calculating the similarity between a translation result and a target sentence in the target block; and aligning the source sentence to the target sentence according to the similarity. By the embodiment of the invention, the flow of aligning the sentences can be simplified and the sentence alignment efficiency is improved.

Description

technical field [0001] The present invention generally relates to the technical field of data processing, in particular to a method and device for aligning sentences in a bilingual corpus. Background technique [0002] At present, more and more statistical methods are applied in the field of natural language processing, so the role of corpus is becoming more and more important. Among them, bilingual parallel corpus (abbreviated as bilingual corpus) refers to a corpus composed of two languages ​​(respectively referred to as source language F and target language E), and the corpus is a translation of each other in units of sentences. In many natural language processing tasks, bilingual corpus is an important source of knowledge, such as: statistical machine translation, cross-language retrieval and other fields. Therefore, the quantity and quality of bilingual corpus largely affect or even determine the final results of related tasks. [0003] In many cases, a large amount o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/28
Inventor 郑仲光孟遥于浩
Owner FUJITSU LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products