Check patentability & draft patents in minutes with Patsnap Eureka AI!

Chinese-cross-language word embedding method fusing word cluster constraints

A cross-language and word cluster technology, applied in natural language translation, natural language data processing, instruments, etc., can solve problems such as inability to accurately align bilingual word embedding spaces, weak generalization of mapping matrices, etc., to achieve poor improvement effects, The effect of improving generalization and improving mapping accuracy

Active Publication Date: 2022-06-07
KUNMING UNIV OF SCI & TECH
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The present invention provides a Chinese-Vietnamese cross-language word embedding method that integrates word cluster constraints to solve the problem of lack of large-scale The bilingual dictionary leads to the problem that the learned mapping matrix has weak generalization on non-marked words outside the dictionary, and cannot accurately align the bilingual word embedding space

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese-cross-language word embedding method fusing word cluster constraints
  • Chinese-cross-language word embedding method fusing word cluster constraints
  • Chinese-cross-language word embedding method fusing word cluster constraints

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046] Example 1: as Figure 1-Figure 3 As shown, the Chinese-Vietnamese cross-language word embedding method fused with word cluster constraints, the specific steps of the method are as follows:

[0047] Step1. Use the large-scale open-source news datasets brightmart and binhvq in Chinese and Vietnamese as monolingual training corpora to obtain Chinese-Vietnamese monolingual word embeddings;

[0048] Step1.1. Remove numbers, special characters and punctuation marks in the Chinese-Vietnamese monolingual news corpus;

[0049] Step1.2. Convert the uppercase letters in the Vietnamese corpus to lowercase letters;

[0050] Step1.3. Perform word segmentation on the corpus, use the jieba tool for Chinese corpus for word segmentation, and use the Vncorenlp tool for Vietnamese corpus for word segmentation, and remove sentences with a length of less than 20 after word segmentation;

[0051] Step1.4. Input the preprocessed Chinese-Vietnamese monolingual corpus into the monolingual word...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a Chinese cross-language word embedding method fusing word cluster constraints. The method comprises the following steps: firstly, preprocessing a Chinese-View lingual corpus and training Chinese-View lingual word embedding; then constructing a Chinese-Vietnamese bilingual dictionary and a word cluster dictionary by using an online dictionary and an open source word library, and fusing alignment information of two granularities of words and word clusters into a training process of a mapping matrix; and finally, obtaining a Chinese and Vietnamese shared word embedding space through a cross-language mapping framework, so that Chinese and Vietnamese words with the same meaning are embedded in the space to be close to each other. According to the method, the word cluster alignment information in the Chinese-View bilingual dictionary is extracted by using different types of association relationships, so that the mapping matrix learns a multi-granularity mapping relationship, the generalization of the mapping matrix on non-tagged words is improved, and the problem that the bilingual space alignment effect is poor in a Chinese-View low-resource scene is solved. Experimental results show that the alignment accuracy of the model on induction tasks (at) 1 and (at) 5 in the Chinese-cross dictionary is improved by 2.2 percentage points compared with that of a VecMap model.

Description

technical field [0001] The invention relates to a Chinese-Spanish cross-language word embedding method fused with word cluster constraints, and belongs to the technical field of natural language processing. Background technique [0002] Cross-language word embedding maps words with the same meaning in different languages ​​to the same space for alignment. It is the basis for tasks such as cross-language text classification, cross-language sentiment analysis, machine translation, and cross-language entity linking, and has important application value. [0003] Chinese-Vietnamese cross-language word embedding is a bilingual word embedding task for low-resource languages. At present, low-resource cross-language word embedding methods mainly include unsupervised, semi-supervised and supervised three categories. Unsupervised methods exploit the similarity of monolingual embedding spaces in different languages, and can learn mapping matrices to achieve alignment without labeling da...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/289G06F40/242G06F40/216G06F40/40
CPCG06F40/289G06F40/242G06F40/216G06F40/40Y02D10/00
Inventor 余正涛武照渊黄于欣
Owner KUNMING UNIV OF SCI & TECH
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More