Method and system for cleaning parallel corpus based on language model and translation model

A parallel corpus and language model technology, applied in natural language translation, natural language data processing, special data processing applications, etc., can solve problems such as lowering the translation quality of machine translation models, inaccurate translation, and high time costs, so as to save time and labor costs, improving the quality of machine translation, and improving the quality of corpus

Inactive Publication Date: 2018-11-23
GLOBAL TONE COMM TECH
View PDF8 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] (1) The existing method of processing corpus takes a lot of time to find problems manually
[0005] (2) The existing methods of processing corpus cannot solve the problems of poor word order and inaccurate translation, and such problems are common in most corpora, which will reduce the translation quality of the machine translation model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for cleaning parallel corpus based on language model and translation model
  • Method and system for cleaning parallel corpus based on language model and translation model
  • Method and system for cleaning parallel corpus based on language model and translation model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0072] In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0073] The present invention uses a language model and a translation model to clean the corpus, can score the corpus through the model, delete the corpus with a lower score, and leave parallel corpus with higher quality.

[0074] The application principle of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0075] Such as figure 1 As shown, the cleaning parallel corpus method based on language model and translation model provided by the embodiment of the present invention includes the following steps:

[0076] S101: Corpus preprocessing mainly deals with bilingual par...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention belongs to the technical field of computer software, and discloses a method and a system for cleaning parallel corpora based on a language model and a translation model. A corpuspreprocessing is mainly used for processing a bilingual parallel corpus of multiple directions of the same language family, screening the parallel corpus by using the language model of a source language and a target language, and screening the corpus from the bilingual parallel corpus by using the translation model. According to the method and the system for cleaning parallel corpora based on thelanguage model and the translation model, the language model and the translation model are utilized to clean the large-scale bilingual corpus, and time and labor costs of cleaning the parallel corpusby using a heuristic rule are high, only when a problem is found can processing be carried out for a certain problem, and the problem that the intonation is not smooth and the translation is inaccurate cannot be solved on a large scale. However, the language model and the translation model can solve the problem that the use rule cannot be solved in a short time, time and labor costs are saved, the corpus can be cleaned, the corpus quality is improved, and the machine translation quality can be effectively improved.

Description

technical field [0001] The invention belongs to the technical field of computer software, and in particular relates to a method and system for cleaning parallel corpus based on a language model and a translation model. Background technique [0002] At present, the existing technologies commonly used in the industry are as follows: machine translation is a process of using machine learning technology to translate one natural language into another natural language. As an important branch of computational linguistics, it involves cognitive science, linguistics and other disciplines, and is one of the ultimate goals of artificial intelligence. The existing mainstream machine translation model is a neural network model using an encoding-decoding structure based on a self-attention mechanism, consisting of an encoder and a decoder. Both are dominated by the self-attention layer. The translation process mainly includes: first, map the input word to a high-dimensional vector space...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/28G06F17/27
CPCG06F40/289G06F40/51G06F40/58
Inventor 贝超程国艮
Owner GLOBAL TONE COMM TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products