Domain parallel corpus generation method and translation model training method

A parallel corpus and translation model technology, applied in the field of generation method and translation model training, can solve problems such as low efficiency, less corpus, difficult collection and processing, etc., and achieve the effect of improving content quality, ensuring correctness, and improving efficiency

Pending Publication Date: 2022-07-12
10TH RES INST OF CETC
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Previous studies on the generation of parallel corpora mainly focused on the scale and quality of parallel corpora, and rarely involved research on the generation of parallel corpora in the field
At the same time, due to the relatively difficult collection and processing of domain corpus, the generation of parallel corpus for a specific domain often relies on manual translation, which makes the current domain corpus extremely scarce, and even some domains do not meet the requirements of machine translation model training at all. Corpus of
[0007] At present, the existing technology has the following technical problems: 1) The domain parallel corpus is scarce, which cannot meet the needs of the machine translation model; 2) The existing domain parallel corpus has poor versatility; 3) In the process of generating the existing parallel corpus, there is no guarantee The problem of correct translation of domain terms; 4) The cost and efficiency of manually determining the generation of parallel corpus in the domain are high and low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Domain parallel corpus generation method and translation model training method
  • Domain parallel corpus generation method and translation model training method
  • Domain parallel corpus generation method and translation model training method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0066] A method for generating domain-parallel corpus, comprising steps:

[0067] The machine translation model is used to align the text-level corpus and the sentence-level corpus in the parallel corpus, and after the alignment, the text-level parallel corpus and the sentence-level parallel corpus are generated to form the domain parallel corpus.

Embodiment 2

[0069] On the basis of embodiment 1, including sub-steps:

[0070] Use open parallel corpus to initialize and train supervised machine translation model;

[0071] Collect bilingual website content and analyze material title, content and reporting time to generate corpus material, and store it in the parallel corpus material library;

[0072] Chapter-level parallel corpus alignment sub-step: Calculate the reporting time difference between an original source material and a translated text material in the parallel corpus material database, and match the domain terms in the title of the translated source material, for example, the reporting time difference is greater than a preset time difference threshold. If it is less than the preset time difference threshold, the initialized supervised machine translation model is used to compare the similarity of the title content of the two materials, if it is greater than the preset title content similarity threshold, then judge them. It i...

Embodiment 3

[0075] On the basis of Embodiment 2, a method for training a translation model, comprising the steps of: updating a machine translation model with the sentence-level parallel corpus generated by the method described in Embodiment 1, and then using the updated machine translation model to generate domain parallel corpus; The generation process of the domain-parallel corpus and the update process of the machine translation model are cycled separately.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a field parallel corpus generation method and a translation model training method, and belongs to the field of machine translation in natural language process.The method comprises the steps that a machine translation model is used for aligning chapter-level corpus and sentence-level corpus in a parallel corpus material library, chapter-level parallel corpus and sentence-level parallel corpus are generated after alignment, and the chapter-level parallel corpus and the sentence-level parallel corpus are used for training the chapter-level parallel corpus and the sentence-level parallel corpus; and forming a field parallel corpus. The field parallel corpora can be generated, self-updating of the supervised machine translation model is achieved, universality is achieved, meanwhile, the content quality of the field parallel corpora is improved, correctness of field term translation in the translation process is guaranteed, cost is reduced, and the method is suitable for popularization and application. Self-circulation of field parallel corpus generation and machine translation in supervised machine translation is achieved, and meanwhile efficiency is improved.

Description

technical field [0001] The invention relates to the field of machine translation in natural language processing, and more particularly, to a method for generating parallel corpora in the field and a method for training translation models. Background technique [0002] Machine translation belongs to the category of computational linguistics, which studies the technology of translating text from one natural language into another natural language by means of computer programs, namely machine translation models. There are two types of machine translation models: supervised and unsupervised. With the parallel corpus generation technology, the supervised translation model can realize more complex automatic text translation, and can handle different grammatical structures, vocabulary recognition and idiom correspondence. [0003] Parallel corpus refers to the text placed in parallel with the translation. Parallel text alignment technology refers to the technology of determining t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/58G06F40/30
CPCG06F40/58G06F40/30
Inventor 杨露黄细凤代翔
Owner 10TH RES INST OF CETC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products