Word vector generation method based on Gaussian distribution

A Gaussian distribution and word vector technology, applied in the field of natural language processing, can solve problems such as inability to represent probability distribution and fixed number of word meanings, and achieve the effects of accelerating the text clustering process, good classification effect, and reducing the amount of calculation and communication

Active Publication Date: 2018-11-02
SUN YAT SEN UNIV
View PDF6 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, and provides a method for generating word vectors based on Gaussian distribution, which uses Gaussian distribution to represent words, so as to overcome the shortcomings that traditional word vector models cannot represent probability distributions based on point estimation , while solving the problem that the prior art model assumes a fixed number of word senses

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word vector generation method based on Gaussian distribution
  • Word vector generation method based on Gaussian distribution
  • Word vector generation method based on Gaussian distribution

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0047] The invention provides a method for generating word vectors based on Gaussian distribution, which first preprocesses the corpus; secondly divides the corpus into contexts by using punctuation marks; then infers word meanings in combination with local and global information, and determines the mapping relationship between words and word meanings; Finally, the word vector is obtained by optimizing the objective function. The specific process of the present invention will be described in detail below with specific implementation methods in conjunction with the accompanying drawings.

[0048] Please refer to figure 1 , a method for generating word vectors based on Gaussian distribution, including the following steps:

[0049] S1. Obtain the training corpus, and preprocess the corpus; the specific method of preprocessing the corpus is: remove stop words and low-frequency words, restore part of speech, convert case, and form an effective corpus; in addition, use python for c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses a word vector generation method based on the Gaussian distribution. The method comprises: firstly, preprocessing the corpus; secondly, using the punctuation to performtext division on the corpus; then combining the local and global information to infer the word meaning, and determining the mapping relationship between the word and the word meaning; and finally, obtaining a word vector by optimizing the objective function. The innovations and beneficial effects of the technical scheme of the present invention are as follows that: 1, words are represented based on the Gaussian distribution, point estimation characteristics of traditional word vectors are avoided, and more abundant information such as probabilistic quality, meaning connotation, an inclusion relationship, and the like can be brought to the word vectors; 2, multiple Gaussian distributions are used to represent the words, so that the linguistic characteristics of a word in the natural language can be coopered with; and 3, the similarity between the Gaussian distributions is defined based on the Hellinger distance, and by combining parameter updating and word meaning discrimination, the number of word meanings can be inferred adaptively, and the problem that the number of hypothetical word meanings of the model in the prior art is fixed is solved.

Description

technical field [0001] The invention relates to the field of natural language processing, in particular to a method for generating word vectors based on Gaussian distribution. Background technique [0002] Word vector is a method for mathematical modeling of words in Natural Language Processing (NLP). The earliest word vector representation is one-hot code (One-Hot), which represents each word as a high-dimensional vector, and the position of 1 in the vector is the index of the word in the dictionary. One-hot encoding has the disadvantages of high dimensionality, sparseness, and neglect of semantic and syntactic information. With the development of deep learning, there has been a method of using neural networks to train word vectors. The word vector model based on neural network captures the grammatical and semantic information of the context for the co-occurrence of words, and represents words as low-dimensional, dense real-valued vectors. Word vectors are often used as ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/62
CPCG06F40/30G06F18/24
Inventor 沈鸿曹渝
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products