A bilingual word embedding-based cross-language text similarity assessment technique

A text similarity and cross-language technology, applied in natural language data processing, semantic analysis, instrumentation, etc., can solve problems such as difficult multi-task, multi-label specification, multi-language learning, and inability to fully express semantic associations

Active Publication Date: 2019-01-15
HARBIN ENG UNIV
View PDF3 Cites 55 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

These symbolic representations and shallow models do not describe the semantic information contained in the data, so they cannot fully express the semantic relationship between data in different languages. It is difficult to use a unified and effective method for multi-task, multi-label specification, and multi-language learning.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A bilingual word embedding-based cross-language text similarity assessment technique
  • A bilingual word embedding-based cross-language text similarity assessment technique
  • A bilingual word embedding-based cross-language text similarity assessment technique

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The main processing procedure of the present invention will be described in more detail below in conjunction with the accompanying drawings.

[0042] The invention describes a cross-language text similarity evaluation technology based on bilingual word embedding. Use natural language processing technology to perform preprocessing operations such as word segmentation and de-staying words on the text, and use words as text units to learn word vector representations and build bilingual word embedding models. Through this model, word embedding representations shared by bilinguals can be generated, and the spatial distance between words can be used to measure the semantic similarity between them. Based on word vector correlation theory and Skip-Gram model, word vector training is carried out on artificially constructed pseudo-bilingual corpus. Second, in order to make the generated word embedding space as complete as possible, a monolingual corpus is also used as a suppleme...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the field of language processing, in particular to a cross-language text similarity evaluation technology based on bilingual word embedding. The technical route and workflow of cross-language text similarity evaluation technology based on bilingual word embedding can be divided into three stages: the construction of bilingual word embedding model, the construction of textsimilarity calculation framework based on multi-neural network, and the cross-language similarity calculation. Through this model, a bilingual shared word embedding representation can be generated, which is based on the word vector correlation theory and Skip-Gram model is used to train word vectors on artificially constructed pseudo-bilingual corpus. Secondly, in order to make the generated wordembedding space as complete as possible, monolingual corpus is used as a supplement to learn additional word embedding knowledge. The similarity score of sentences is obtained by combining several neural network structures to learn the semantic representation of sentences. By dividing short text into paragraphs and treating paragraphs as long sentences as sequence input, the similarity iteration on a larger scale can be realized.

Description

technical field [0001] The invention belongs to the field of language processing, and in particular relates to a cross-language text similarity evaluation technology based on bilingual word embedding. Background technique [0002] Methods based on statistical machine learning are currently the mainstream of research in the field of natural language processing. These methods usually automatically or semi-automatically obtain statistical knowledge of language from training data, and can effectively establish language representation models. However, methods based on statistical machine learning depend to a large extent on the scale, representativeness, correctness, and processing depth of training data. The more language data trained and the stronger the domain, the better the fit of the language model . It can be said that the quality of training data determines the effect of statistical machine learning methods to a large extent. Therefore, by expanding the corpus to continu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/279G06F40/30G06F40/289
Inventor 刘刚张翰墨左权
Owner HARBIN ENG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products