A bilingual word embedding-based cross-language text similarity assessment technique

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A text similarity and cross-language technology, applied in natural language data processing, semantic analysis, instrumentation, etc., can solve problems such as difficult multi-task, multi-label specification, multi-language learning, and inability to fully express semantic associations

Active Publication Date: 2019-01-15

HARBIN ENG UNIV

View PDF3 Cites 55 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

These symbolic representations and shallow models do not describe the semantic information contained in the data, so they cannot fully express the semantic relationship between data in different languages. It is difficult to use a unified and effective method for multi-task, multi-label specification, and multi-language learning.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0041] The main processing procedure of the present invention will be described in more detail below in conjunction with the accompanying drawings.

[0042] The invention describes a cross-language text similarity evaluation technology based on bilingual word embedding. Use natural language processing technology to perform preprocessing operations such as word segmentation and de-staying words on the text, and use words as text units to learn word vector representations and build bilingual word embedding models. Through this model, word embedding representations shared by bilinguals can be generated, and the spatial distance between words can be used to measure the semantic similarity between them. Based on word vector correlation theory and Skip-Gram model, word vector training is carried out on artificially constructed pseudo-bilingual corpus. Second, in order to make the generated word embedding space as complete as possible, a monolingual corpus is also used as a suppleme...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention belongs to the field of language processing, in particular to a cross-language text similarity evaluation technology based on bilingual word embedding. The technical route and workflow of cross-language text similarity evaluation technology based on bilingual word embedding can be divided into three stages: the construction of bilingual word embedding model, the construction of textsimilarity calculation framework based on multi-neural network, and the cross-language similarity calculation. Through this model, a bilingual shared word embedding representation can be generated, which is based on the word vector correlation theory and Skip-Gram model is used to train word vectors on artificially constructed pseudo-bilingual corpus. Secondly, in order to make the generated wordembedding space as complete as possible, monolingual corpus is used as a supplement to learn additional word embedding knowledge. The similarity score of sentences is obtained by combining several neural network structures to learn the semantic representation of sentences. By dividing short text into paragraphs and treating paragraphs as long sentences as sequence input, the similarity iteration on a larger scale can be realized.

Description

technical field [0001] The invention belongs to the field of language processing, and in particular relates to a cross-language text similarity evaluation technology based on bilingual word embedding. Background technique [0002] Methods based on statistical machine learning are currently the mainstream of research in the field of natural language processing. These methods usually automatically or semi-automatically obtain statistical knowledge of language from training data, and can effectively establish language representation models. However, methods based on statistical machine learning depend to a large extent on the scale, representativeness, correctness, and processing depth of training data. The more language data trained and the stronger the domain, the better the fit of the language model . It can be said that the quality of training data determines the effect of statistical machine learning methods to a large extent. Therefore, by expanding the corpus to continu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/27

CPCG06F40/279G06F40/30G06F40/289

Inventor刘刚张翰墨左权

OwnerHARBIN ENG UNIV

A bilingual word embedding-based cross-language text similarity assessment technique

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology