Method for improving bilingual corpus, device for improving bilingual corpus, machine translation method and machine translation device

A bilingual corpus and machine translation technology, applied in the field of natural language processing, can solve problems such as no improvement in word alignment quality, long sentences without considering context information, and complicated bilingual corpus segmentation methods, so as to improve word alignment quality and avoid word alignment. Alignment error, easy to expand the effect

Active Publication Date: 2015-07-01
KK TOSHIBA
View PDF5 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] In order to improve the above-mentioned existing problems in the prior art in the training stage, the bilingual corpus segmentation method is complicated and the quality of word alignment is not improved, the present invention proposes a new segmentation algorithm
[0016] In addition, in order to improve the problem in the prior art that the context information is not considered when segmenting long sentences in the decoding stage, the present invention proposes to use a conditional random field (CRF) model combined with sentence similarity to jointly segment long sentences. Divided into shorter, relatively independent clauses that are easier to translate and understand

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for improving bilingual corpus, device for improving bilingual corpus, machine translation method and machine translation device
  • Method for improving bilingual corpus, device for improving bilingual corpus, machine translation method and machine translation device
  • Method for improving bilingual corpus, device for improving bilingual corpus, machine translation method and machine translation device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0080] 下面就结合附图对本发明的各个优选实施方式进行详细的说明。

[0081] Methods for Improving Bilingual Corpora

[0082] 本实施方式提供一种用于改进双语语料库的方法,其中,上述双语语料库包括多个第一语种和第二语种的句对以及每个句对之间的词对齐信息,上述方法包括以下步骤:在给定的句对的词对齐信息中提取切分候选;计算上述切分候选的切分置信度;对上述切分置信度和预定的阈值进行比较;以及在上述切分置信度大于等于上述阈值的情况下,在上述切分候选处对上述给定的句对进行切分。

[0083] Refer below figure 1 Describe in detail. figure 1 是根据本实施方式的用于改进双语语料库的方法的流程图。

[0084] Such as figure 1 所示,首先,在步骤S101,在需要进行改进的对齐双语语料库10中选择一个双语句对。本实施方式中,对齐双语语料库10包括多个第一语种(源语言)和第二语种(目标语言)的句对以及每个句对之间由自动词对齐工具给出的词对齐信息。对齐双语语料库10是利用本领域的技术人员公知的任何词对齐工具,例如GIZA++工具对双语语料进行对齐而获得的词对齐结果。双语语料库是本领域的技术人员公知的用于SMT系统的任何双语语料库。本实施方式对于对齐双语语料库10没有任何限制。

[0085] 接着,在步骤S105,对于所选的双语句对,在其词对齐信息中提取切分候选。具体过程如下。

[0086] 假设双语句对中的源语言句子为: 目标语言句子为: m和l为自然数。

[0087] 由GIZA++得到的双向词对齐结果:

[0088] a j =j ,t j >,s j ∈[0,1,...,m],t j ∈[0,1,...,1]

[0089] 在步骤S105中,提取可能的切分候选a j =j , t j >. 在本实施方式中,切分候选优选满足如下条件:

[0090] (1) ,为一一对齐,

[0091] (2) ,为具有断句功能的词和 / 或符号。

[0092] 具有断句功能的符号优选为标点符号,标点符号优选但不限于:逗号、句号、分号、问号、感叹号等。

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

According to one aspect, there is provided an apparatus for improving a bilingual corpus including a plurality of sentence pairs of a first language and a second language and word alignment information of each of the sentence pairs, the apparatus comprises: an extracting unit for extracting a split candidate from word alignment information of a given sentence pair; a calculating unit for calculating split confidence of said split candidate; a comparing unit for comparing said split confidence and a pre-set threshold; and a splitting unit for splitting said given sentence pair at said split candidate in a case that said split confidence is larger than said pre-set threshold.

Description

technical field [0001] This embodiment relates to a natural language processing technology, specifically, to a method for improving a bilingual corpus, a device for improving a bilingual corpus, a machine translation method, and a machine translation device. Background technique [0002] For a long time, the translation of long sentences has been a difficult problem in Statistical Machine Translation (SMT). When the sentence is too long, it is usually difficult for the SMT system to give a correct translation result, or even unable to handle it at all. [0003] In order to avoid the difficulty of translating long sentences, people usually divide long sentences into shorter clauses before processing them. Previous research results show that this is an effective way to deal with it, especially for spoken sentences with relatively simple sentence structures, even if you simply splice the translation results of the segmented clauses in order, there will often be better results....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/28
CPCG06F17/289G06F17/27G06F40/45
Inventor 苏韬张大鲲郝杰
Owner KK TOSHIBA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products