Experimental method for verifying influence of common sub-words on XLM translation model effect

A technology of translation models and experimental methods, applied in the field of natural language processing, can solve problems such as poor results, and achieve the effect of improving the performance of machine translation

Active Publication Date: 2021-05-28
KUNMING UNIV OF SCI & TECH
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Low-resource neural machine translation of cognate language pairs such as English-French and English-German has developed well, but non-cognate language pairs such as Chinese and English have not worked well

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Experimental method for verifying influence of common sub-words on XLM translation model effect
  • Experimental method for verifying influence of common sub-words on XLM translation model effect
  • Experimental method for verifying influence of common sub-words on XLM translation model effect

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0026] Embodiment 1: as Figure 1-3 As shown, verify the experimental method of the impact of the common subword on the effect of the XLM translation model, and the method includes:

[0027] Step1. Preprocess the corpus of XLM translation model pre-training;

[0028] Step2. Verify whether the performance of the XLM translation model is degraded: use the preprocessed corpus to pre-train the XLM translation model, initialize the translation model with the pre-trained model, and observe the BLEU value of the new translation model.

[0029] The Step1 preprocessing includes the following:

[0030] First obtain the common subwords and all subword frequencies in English and French subwords; then randomly separate the common subwords according to the separation ratio; then read the vocabulary of all English and French subwords and save them in the dictionary for subsequent generation Separate subword files; use the generated separated subword files to initialize the dictionary, and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an experimental method for verifying the influence of common sub-words on the effect of an XLM translation model. The method comprises the following steps: preprocessing a corpus pre-trained by an XLM translation model; and verifying whether the performance of the XLM translation model is degraded or not: pre-training the XLM translation model by using the preprocessed corpus, initializing the translation model by using the pre-trained model, and observing the BLEU value of the new translation model. The preprocessing comprises the following steps: firstly, obtaining common sub-words in English and French sub-words and word frequencies of all the sub-words; then according to the separation proportion, randomly separating the common sub-words; reading word lists of all English and legal sub-words and storing the word lists in a dictionary for subsequently generating a division sub-word file; and initializing a dictionary by using the generated division sub-word file, and finally structuring a model corpus file by using the initialized dictionary. According to the method, the influence of the common sub-words on the BLEU value is verified, and the method is helpful for low-resource neural machine translation research of non-homologous languages.

Description

technical field [0001] The invention relates to an experimental method for verifying the influence of common subwords on the effect of an XLM translation model, and belongs to the technical field of natural language processing. Background technique [0002] Machine translation is one of the tasks in the field of natural language processing. It is widely used and has great research value and commercial value. The emergence of neural network machine translation has greatly promoted the development of machine translation. Neural machine translation requires a large amount of parallel corpus, and the development of low-resource neural machine translation is particularly important. Low-resource neural machine translation of cognate language pairs such as English, French, and English has developed well, but non-cognate language pairs such as Chinese and English have not worked well. In order to analyze the reasons for the degradation of Chinese-English pairs in translation models...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/226G06F40/242G06F40/284G06F40/58
CPCG06F40/226G06F40/242G06F40/284G06F40/58Y02D10/00
Inventor 余正涛杨晓霞吴霖朱俊国王振晗文永华
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products