Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Parallel sentence pair extraction method based on pre-training language model and bidirectional interactive attention

A language model and parallel sentence pair technology, applied in neural learning methods, biological neural network models, semantic analysis, etc., can solve problems such as lack of training data, and achieve the effect of improving prediction results

Pending Publication Date: 2022-01-07
KUNMING UNIV OF SCI & TECH
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The invention provides a parallel sentence pair extraction method based on a pre-trained language model and two-way interactive attention, which is used to extract bilingual parallel sentences with consistent deep semantics from a comparable corpus to expand the bilingual parallel corpus, thereby alleviating the lack of training for language pairs that are scarce in resources The problem of data and its use to improve the prediction effect of parallel sentence pairs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel sentence pair extraction method based on pre-training language model and bidirectional interactive attention
  • Parallel sentence pair extraction method based on pre-training language model and bidirectional interactive attention
  • Parallel sentence pair extraction method based on pre-training language model and bidirectional interactive attention

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046] Embodiment 1: as figure 1 As shown, based on the pre-trained language model and the two-way interactive attention parallel sentence pair extraction method, the specific steps of the method are as follows:

[0047] Step1. Collect and construct Chinese-Vietnamese parallel data through web crawler technology, use negative sampling to obtain non-parallel data, and manually mark the data set to obtain a Chinese-Vietnamese comparable corpus data set. The main sources of Chinese-Vietnamese parallel data include Wikipedia, bilingual news websites, Movie subtitles and more.

[0048] As a further solution of the present invention, the specific steps of the step Step1 are:

[0049] Step1.1. Obtain parallel data between China and Vietnam through web crawler technology; data sources include Wikipedia, bilingual news websites, movie subtitles, etc.;

[0050] Step1.2. After cleaning and aligning the crawled data, they are used as positive samples for model training. In order to main...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a parallel sentence pair extraction method based on pre-training language model and bidirectional interactive attention, and belongs to the field of natural language processing. The method comprises the following steps: constructing a Chinese-Vietnamese comparable corpus data set; obtaining bilingual representation of a source language and bilingual representation of a target language through a pre-training language model, and then achieving spatial semantic alignment of cross-language features based on a bidirectional interactive attention mechanism; finally, achieving relation judgment of cross-language sentence pairs based on semantic representation after multi-view feature fusion, and achieving parallel sentence pair extraction according to deep semantic consistency. Experimental results show that the bilingual parallel sentences with consistent semantics can be effectively recognized under the background of noise-containing data, and the extracted bilingual parallel sentences provide support for subsequent machine translation.

Description

technical field [0001] The invention relates to a pre-training language model and a parallel sentence pair extraction method of two-way interactive attention, belonging to the field of natural language processing. Background technique [0002] The performance of neural machine translation relies on the support of a large number of high-quality parallel data. For mainstream language pairs (German-English, French-English), etc., there are already resource-rich parallel corpora used to support academic research. The performance of mainstream language machine translation has recently achieved remarkable results. improvement, and is close to the effect of human translation. However, for a large number of non-mainstream language pairs, the performance of machine translation is severely restricted due to the lack of large-scale high-quality parallel sentence pair resources. [0003] Parallel sentence pair extraction is based on semantic similarity to achieve the matching of two la...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/30G06K9/62G06N3/04G06N3/08
CPCG06F40/30G06N3/08G06N3/047G06N3/044G06N3/045G06F18/2414G06F18/22G06F18/2415G06F18/253
Inventor 余正涛张乐乐郭军军
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products