Multi-granularity Chinese-Vietnamese parallel sentence pair extraction method based on graph attention network

A technology of parallel sentence pairs and attention, which is applied in neural learning methods, biological neural network models, natural language translation, etc., can solve problems such as poor translation effect, difficulty in obtaining, and lack of parallel corpus, so as to reduce language differences and improve The effect of accuracy

Pending Publication Date: 2022-01-07
KUNMING UNIV OF SCI & TECH
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the effect of neural machine translation of existing low-resource languages ​​is poor, especially in Chinese-Vietnamese neural machine translation, where parallel corpus is very scarce and difficult to obtain directly on the Internet, resulting in poor model performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-granularity Chinese-Vietnamese parallel sentence pair extraction method based on graph attention network
  • Multi-granularity Chinese-Vietnamese parallel sentence pair extraction method based on graph attention network
  • Multi-granularity Chinese-Vietnamese parallel sentence pair extraction method based on graph attention network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0036] Embodiment 1: as Figure 1-Figure 4 As shown, the multi-granularity Chinese-Vietnamese parallel sentence pair extraction method based on the graph attention network, the specific steps of the method are as follows:

[0037] Step1. Multi-granularity document modeling: By analyzing the characteristics of the hierarchical structure of comparable corpus texts, documents are composed of paragraphs, paragraphs are composed of sentences, and sentences are composed of subwords. Therefore, combined with the characteristics of this structure, it can be divided into four different granularities of subwords, sentences, paragraphs, and documents, and represented by a tree structure, and the overall text is divided into four types of nodes, namely, subword-level nodes. Sentence-level nodes, paragraph-level nodes and document-level nodes. Such as image 3 shown.

[0038] Since the four different granularities of subwords, sentences, paragraphs, and documents have constraint relatio...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a multi-granularity Chinese-Vietnamese parallel sentence pair extraction method based on a graph attention network, and belongs to the technical field of natural language processing. According to the method, the twinning neural network is mainly adopted as a main body frame to share Chinese and Vietnamese bilingual in the same semantic space. The method comprises the following steps: firstly, in combination with the structure of a document, dividing a Chinese-Chinese document into four different levels of granularities, namely sub-words, sentences, paragraphs and documents, constructing different levels of graph network structures, and realizing multi-granularity document modeling; then, carrying out coding by utilizing BERT, obtaining feature representations of nodes with different granularities through a graph attention network layer, and carrying out fusion representation on a graph information integration layer; and finally, training a classifier by using a full connection layer, calculating the parallel probability of the two sentences, and extracting Chinese-Vietnamese parallel sentence pairs. According to the method, the language difference of the Chinese-Vietnamese sentence pairs can be reduced, the Chinese-Vietnamese parallel sentence pairs are effectively extracted, and powerful support is provided for development of neural machine translation.

Description

technical field [0001] The invention relates to a multi-granularity Chinese-Vietnamese parallel sentence pair extraction method based on a graph attention network, and belongs to the technical field of natural language processing. Background technique [0002] Neural machine translation is data-driven and relies on massive bilingual data. Generally, the larger the data size, the better the translation model will be. However, the existing neural machine translation of low-resource languages ​​is not effective, especially in Chinese-Vietnamese neural machine translation, where parallel corpora are scarce and difficult to obtain directly on the Internet, resulting in poor model performance. Parallel sentence pair extraction is one of the important methods to improve the quality and scale of parallel corpora. Many studies have shown that large-scale and high-quality parallel corpora can effectively improve the quality of neural machine translation for low-resource languages. Th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F16/36G06F40/289G06F40/58G06K9/62G06N3/04G06N3/08
CPCG06F16/353G06F16/367G06F40/289G06N3/08G06F40/58G06N3/047G06N3/044G06N3/045G06F18/253
Inventor 高盛祥杨玉倩
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products