Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A Word Vector Generation Method Based on Gaussian Distribution

A Gaussian distribution and word vector technology, applied in the field of natural language processing, can solve problems such as inability to represent probability distribution and fixed number of word meanings, and achieve the effect of accelerating the text clustering process, reducing the amount of calculation, and reducing the amount of communication

Active Publication Date: 2022-03-25
SUN YAT SEN UNIV
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, and provides a method for generating word vectors based on Gaussian distribution, which uses Gaussian distribution to represent words, so as to overcome the shortcomings that traditional word vector models cannot represent probability distributions based on point estimation , while solving the problem that the prior art model assumes a fixed number of word senses

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Word Vector Generation Method Based on Gaussian Distribution
  • A Word Vector Generation Method Based on Gaussian Distribution
  • A Word Vector Generation Method Based on Gaussian Distribution

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0047] The invention provides a word vector generation method based on Gaussian distribution, which firstly preprocesses a corpus; secondly, divides the corpus into contexts by using punctuation marks; then infers word meanings in combination with local and global information, and determines the mapping relationship between words and word meanings; Finally, word vectors are obtained by optimizing the objective function. The specific process of the present invention will be described in detail below with reference to the accompanying drawings and using specific implementation methods.

[0048] Please refer to figure 1 , a method for generating word vectors based on Gaussian distribution, including the following steps:

[0049] S1. Obtain the training corpus, and preprocess the corpus; the specific method of preprocessing the corpus is: remove stop words and low-frequency words, restore part of speech, and convert case to form an effective corpus; in addition, corpus preprocess...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for generating word vectors based on Gaussian distribution. Firstly, the corpus is preprocessed; secondly, punctuation marks are used to divide the context of the corpus; Optimize the objective function to obtain word vectors. The innovations and beneficial effects of the technical solution of the present invention are as follows: 1. Words are represented based on Gaussian distribution, which avoids the point estimation characteristics of traditional word vectors, and can bring probability quality to word vectors, and richer information such as word meaning implication and inclusion relationship . 2. Using multiple Gaussian distributions to represent words can cope with the polysemy of a word in natural language. 3. The similarity between Gaussian distributions is defined based on the Hellinger distance, and the parameter update and word sense discrimination are combined to adaptively infer the number of word meanings, which solves the problem of the assumption that the number of word senses in the prior art model is fixed.

Description

technical field [0001] The invention relates to the field of natural language processing, in particular to a word vector generation method based on Gaussian distribution. Background technique [0002] Word vector is a method for mathematical modeling of words in Natural Language Processing (NLP). The earliest word vector representation is one-hot code (One-Hot). The one-hot code represents each word as a high-dimensional vector, and the position of 1 in the vector is the index of the word in the dictionary. One-hot codes have disadvantages such as high dimensionality, sparseness, and ignoring semantic and syntactic information. With the development of deep learning, methods of using neural networks to train word vectors have emerged. The neural network-based word vector model captures the syntactic and semantic information of the context for the co-occurrence of words, and represents the word as a low-dimensional, dense real-valued vector. Word vectors are often used as f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/30G06K9/62
CPCG06F40/30G06F18/24
Inventor 沈鸿曹渝
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products