Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A method for constructing a test set for text-level English-to-Chinese machine translation

A technology of machine translation and construction method, which is applied in the construction field of text-level English-to-Chinese machine translation test sets, and can solve the problems of no evaluation indicators

Active Publication Date: 2022-07-19
TIANJIN UNIV
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Most of the existing evaluation indicators are those used in automatic evaluation. When calculating the scores of indicators, most of them only consider various linguistic phenomena in sentences, which are more suitable for evaluating various linguistic phenomena in sentences. There are no relevant metrics specifically designed for discourse-level linguistic phenomena

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for constructing a test set for text-level English-to-Chinese machine translation
  • A method for constructing a test set for text-level English-to-Chinese machine translation
  • A method for constructing a test set for text-level English-to-Chinese machine translation

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0088] Source language data:

[0089] Previous sentence: You rich guys think that money can buy anything.

[0090] Current Sentence: How right you are.

[0091] Target language data:

[0092] Previous sentence: You rich people always think that money can buy everything.

[0093] Current sentence: You are so right.

[0094] The text-level connective test set requires that the current sentence in the source language data contains one of five text-level connectives such as "as", "or", "while", "since", and "though", and the connection The part-of-speech of a word needs to satisfy one of "CC", "IN", and "WRB". Due to the diverse expressions of Chinese discourse-level connectives, we first automatically filter the sentence pairs that contain the corresponding meaning, and then take manual checks to satisfy the source language. Sentence pairs for each condition in the data but the target language data does not contain the corresponding connective, then check whether the informat...

example 2

[0097] Source language data:

[0098] Previous sentence: Everything is so difficult in life, for me.

[0099] Current Sentence: While for others it’s all child’s play.

[0100] Target language data:

[0101] Previous sentence: For me, everything in life is very difficult.

[0102] Current sentence: To others, it is like child's play.

[0103] Omit the test set to filter out the sentence pairs containing auxiliary verbs in the current sentence of the source language data by character matching, that is, including "do", "does", "can", "could", "should", "is", "am" ", "are", "may", and then the previous sentence of the source language data needs to contain verbs, that is, the parts of speech are "VC", "VE", "VV", and then check the current sentence in the target language data. The verbs in the previous sentence are consistent with the verbs in the previous sentence, and finally a certain number of test cases are selected to form the omitted test set.

[0104] Then check the v...

example 3

[0106] Source language data:

[0107] The previous sentence: You see, she doesn't know.

[0108] Current Sentence: Neither do I.

[0109] Target language data:

[0110] Previous sentence: Look, she doesn't know.

[0111] Current sentence: I don't know either.

[0112] Step 4: Manually check the selected test cases to correct translation errors.

[0113] Table 1: BLEU automatic scoring results

[0114] pronoun discourse-level connectives omit thumb 12.4 9.8 18.2 CADec 19.1 15.3 25.5 bert-nmt 13.9 12.7 19.1

[0115] It can be seen from Table 1: From the perspective of BLEU (Bilingual Evaluation Substitute) value, the CADec (Combined Context Decoder) model has the highest BLEU value on the three language phenomena, indicating that the model has the highest BLEU value on the three text-level language phenomena. The translation effect is the best, the BLEU value of the bert-nmt (neural machine translation with BERT fusion) model ran...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for constructing a text-level English-to-Chinese machine translation test set, which acquires text-level English text data and corresponding Chinese text data with cohesive grammars of reference, connection and omission; Filter processing to form text data that only contains English and Chinese words; take English text data as source language data and Chinese text data as target language data; select ambiguous pronouns, polysemous conjunctions and auxiliary verbs as search parameters to search the source language Check and correct the target language data; perform word segmentation and part-of-speech tagging on the two language data after checking and correction respectively, and then make a candidate data set; set the screening parameters respectively, and select the corresponding source language data and its corresponding source language data from the candidate data set. Corresponding target language data are respectively made into reference test set, discourse-level connective test set and omission test set. The present invention can be used to test and evaluate the chapter-level translation capabilities of different machine translation models.

Description

technical field [0001] The invention relates to the field of machine translation, in particular to a method for constructing a machine translation test set for text-level English translation into Chinese. Background technique [0002] At present, with the gradual improvement of machine translation technology, there are more and more researches on machine translation close to practical applications, and the research focus in the field of machine translation is gradually transitioning from the sentence level to the chapter level. Compared with sentence-level machine translation, text-level machine translation focuses on a wider range of texts, and more problems and phenomena need to be considered, so the difficulty is further increased. [0003] While studying how machine translation models can further improve translation capabilities, how to evaluate model translation capabilities more reasonably has also become a concern of researchers. The text-level machine translation mo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/58G06F40/44G06F40/289
CPCG06F40/58G06F40/44G06F40/289
Inventor 蔡心怡熊德意
Owner TIANJIN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products