Synthetic corpus generation method and device

A corpus and sentence technology, applied in the fields of natural language and artificial intelligence, can solve the problems of small improvement in the translation quality of the NMT model, low corpus text quality and lexical diversity, and achieve high text quality and improve translation quality.

Pending Publication Date: 2022-06-21
INDUSTRIAL AND COMMERCIAL BANK OF CHINA
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the text quality and lexical diversity of the corpus obtained through the existing back-translation method are not high, so the improvement of the translation quality of the trained NMT model is small

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Synthetic corpus generation method and device
  • Synthetic corpus generation method and device
  • Synthetic corpus generation method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

[0040] It should be noted that the method and device for generating a synthetic corpus disclosed in this application can be used in the field of artificial intelligence technology, and can also be used in any field other than the technical field of artificial intelligence. The method and device for generating a synthetic corpus disclosed in this application The application field is not limited.

[0041] In order to facilitate the unde...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention provides a synthetic corpus generation method and device which can be used in the technical field of artificial intelligence, and the method comprises the steps: carrying out the quality evaluation of candidate statements in a training data set according to seed statements in a preset seed data set, and generating a quality score of each candidate statement; generating a comprehensive score of each candidate statement according to the quality score and a pre-calculated relevancy score of each candidate statement and the seed statement; and according to the comprehensive score of each candidate statement corresponding to each seed statement, the synthetic corpus is generated, and the synthetic corpus with relatively high text quality and relatively high vocabulary diversity can be obtained, so that the translation quality of the NMT model is greatly improved.

Description

technical field [0001] The present invention relates to the technical field of natural language, in particular to the technical field of artificial intelligence, and in particular to a method and device for generating a synthetic corpus. Background technique [0002] At present, machine translation technology is widely used in many fields such as finance. The source language can be translated into the target language using a neural machine translation (NMT) model. The NMT model needs to be trained through a large number of data sets, namely parallel corpora, but it is difficult to obtain a large-scale parallel corpus in some fields, so that the NMT model cannot meet the training requirements. In the related art, a back-translation method is usually used to generate a pseudo-parallel corpus for training an NMT model. However, the text quality and lexical diversity of the corpus obtained by the existing back-translation methods are not high, so the improvement of the transla...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/51G06F40/58
CPCG06F40/51G06F40/58
Inventor 张磊
Owner INDUSTRIAL AND COMMERCIAL BANK OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products