Chinese-Vietnamese parallel sentence pair extraction method based on cross-language bilingual pre-training and Bi-LSTM

A parallel sentence pair, pre-training technology, applied in natural language translation, neural learning methods, natural language data processing, etc., can solve the problem of data scarcity, poor effect of Chinese-Vietnamese bilingual parallel sentence pair extraction, and poor Chinese-Vietnamese machine translation effect. And other issues

Pending Publication Date: 2021-01-29
KUNMING UNIV OF SCI & TECH
View PDF2 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The present invention provides a Chinese-Vietnamese parallel sentence pair extraction method based on cross-language bilingual pre-training and Bi-LSTM to solve the problem of scarcity of Chinese-Vietnames

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese-Vietnamese parallel sentence pair extraction method based on cross-language bilingual pre-training and Bi-LSTM
  • Chinese-Vietnamese parallel sentence pair extraction method based on cross-language bilingual pre-training and Bi-LSTM
  • Chinese-Vietnamese parallel sentence pair extraction method based on cross-language bilingual pre-training and Bi-LSTM

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0061] Embodiment 1: as Figure 1-2 As shown, based on the Chinese-Vietnamese parallel sentence pair extraction method based on cross-language bilingual pre-training and Bi-LSTM, the specific steps of the Chinese-Vietnamese parallel sentence pair extraction method based on cross-language bilingual pre-training and Bi-LSTM are as follows:

[0062] Step1. Build a corpus: build a Chinese-Vietnamese comparable corpus, crawl Chinese-Vietnamese monolinguals, and build a Chinese-Vietnamese seed dictionary;

[0063] Step2, Chinese-Vietnamese cross-language word vector pre-training: Chinese-Vietnamese bilingual word vector representation is performed, and the Chinese-Vietnamese seed dictionary is used to align words in the same semantic space for cross-language bilingual pre-training;

[0064] Step3, Bi-LSTM and CNN unified spatial encoding: Then input the Chinese and Vietnamese sentences obtained after pre-training into a twin neural network composed of Bi-LSTM and CNN, and extract th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a Chinese-Vietnamese parallel sentence pair extraction method based on cross-language bilingual pre-training and Bi-LSTM and belongs to the technical field of natural languages. The method comprises the following steps: firstly, collecting Chinese-Vietnamese comparable corpora, and extracting Chinese-Vietnamese parallel sentence pairs from the comparable corpora; adding aChinese-Vietnamese bilingual dictionary and a large number of Chinese-Vietnamese monolanguages in pre-training, performing word alignment by mapping the Chinese-Vietnamese bilingual dictionaries to apublic semantic space, then generating a new dictionary iteratively in a self-learning mode by the Chinese-Vietnamese seed dictionary so that semantic similarity between Chinese-Vietnamese sentences is represented to the maximum extent; then inputting Chinese and Vietnamese sentences obtained after pre-training into a twin neural network composed of Bi-LSTM and CNN, and extracting global featuresand local features of the sentences; and finally, judging whether the input sentence pair is a Chinese-Vietnamese bilingual parallel sentence pair or not by using a full connection layer. Good effectis achieved in an experiment of extracting parallel sentence pairs from comparable corpora.

Description

technical field [0001] The invention relates to a Chinese-Vietnamese parallel sentence pair extraction method based on cross-language bilingual pre-training and Bi-LSTM, and belongs to the technical field of natural language processing. Background technique [0002] Parallel sentence pair extraction is an important method to alleviate the scarcity of machine translation data in natural language processing, aiming to expand Chinese-Vietnamese translation corpus. At present, parallel sentence pair extraction can be transformed into a sentence similarity classification task in the same semantic space, and its core lies in bilingual semantic space alignment. Traditional semantic space alignment methods rely on large-scale bilingual parallel corpora, but Vietnamese is relatively difficult to obtain large-scale parallel corpora as a low-resource scarce language, while it is relatively easy to obtain Chinese-Vietnamese monolinguals. Therefore, how to use Chinese-Vietnamese monolin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/58G06F40/30G06F40/284G06F40/211G06F16/35G06F16/951G06N3/04G06N3/08
CPCG06F40/58G06F40/30G06F40/211G06F40/284G06F16/355G06F16/951G06N3/049G06N3/084G06N3/045
Inventor 高盛祥刘畅余正涛毛存礼黄于欣王振晗
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products