Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for optimizing corpus

A corpus and to-be-optimized technology, applied in the field of corpus optimization, can solve problems affecting the accuracy of translation models and language model estimation, and achieve the effect of eliminating noise and improving corpus quality.

Active Publication Date: 2015-09-30
KK TOSHIBA
View PDF5 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

On the other hand, the more corpus, the more noise it contains, which will affect the accuracy of the translation model and language model estimation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for optimizing corpus
  • Method and device for optimizing corpus
  • Method and device for optimizing corpus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0074] Various preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0075] Methods for Optimizing Corpus

[0076] Refer below Figure 1~5 Describe in detail.

[0077] figure 1 is a flowchart of a method for optimizing a corpus according to an embodiment of the present invention.

[0078] Such as figure 1 As shown, the present embodiment provides a method for optimizing the corpus, including: step S101, based on the optimization parameters, the sentences in the above-mentioned corpus 10 are filtered to obtain the sentence pairs to be optimized; step S105, at least one sentence pair of the sentence pairs to be optimized A part is replaced; and step S110, calculate the degree of perplexity of the sentence pair after replacement, when the degree of perplexity of the sentence pair after replacement is less than the degree of perplexity of the sentence pair to be optimized, use the sentence pair after r...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for optimizing a corpus and a device for optimizing the corpus. The device for optimizing the corpus comprises a filter unit, a replacement unit and a perplexity calculation unit; the filter unit performs filtration on sentences in the corpus on the basis of an optimization parameter to obtain sentence pairs to be optimized; the replacement unit performs replacement on at least part of the sentence pairs to be optimized; the perplexity calculation unit calculates the perplexity of the replaced sentence pairs, and the replaced sentence pairs can serve as optimization results of the sentence pairs to be optimized on the condition that the perplexity of the replaced sentence pairs is smaller than that of the sentence pairs to be optimized.

Description

technical field [0001] The present invention relates to the technology of natural language processing, in particular, to a method and a device for optimizing a corpus. Background technique [0002] The performance of statistical machine translation depends largely on the quantity and quality of parallel corpora. On the one hand, the collected large-scale training data needs to be efficiently managed for different purposes. On the other hand, the more corpus, the more noise it contains, and these noises will affect the estimation accuracy of translation model and language model. Therefore, filtering noise in training corpus is a very basic and important task. In this regard, the following methods exist in the prior art. [0003] (1) English-Chinese bilingual corpus filtering method, including the following steps: A. Determine the sentence length ratio feature value of the English-Chinese bilingual sentence pair; The quantity of the corresponding word matching in the bilin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/28
Inventor 狄慧张大鲲郝杰
Owner KK TOSHIBA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products