Unlock instant, AI-driven research and patent intelligence for your innovation.

Coding method for word vectorization by using dot mutual information between words

A coding method and mutual information technology, applied in the field of vectorized coding of words, can solve the problems of poor interpretability of the coding model, and achieve the effect of good interpretability and good description.

Pending Publication Date: 2022-04-01
UNIV OF SCI & TECH OF CHINA
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In order to obtain the word embedding of words, the traditional method represented by Word2Vec mainly uses the co-occurrence matrix that can describe the co-occurrence frequency between words to estimate the word vector, but there is a problem of poor interpretability of the coding model, and there is still room for improvement in performance.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Coding method for word vectorization by using dot mutual information between words
  • Coding method for word vectorization by using dot mutual information between words
  • Coding method for word vectorization by using dot mutual information between words

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the specific content of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention. The content not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art.

[0029] Such as figure 1 As shown, the embodiment of the present invention provides a coding method for word vectorization using point-like mutual information between words, which can achieve more accurate fitting of point-like mutual information between words, including:

[0030] Step S0, traversing predictions, performing word segme...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a coding method for performing word vectorization by utilizing dot mutual information among words, which comprises the following steps of: S0, traversing corpora, and performing word segmentation on all texts to obtain a word set; the method comprises the following steps: S1, initializing word embedding parameters and counting the word frequency of each word; s2, traversing the corpus, and extracting positive and negative sample pairs for training word embedding parameters from the corpus; s3, calculating a loss function of the positive and negative sample pairs, solving a gradient of a corresponding word embedding parameter, and updating the gradient; and S4, after traversing the corpus for a plurality of times, storing a context word vector of each word, and embedding the context word vector as a solved word. According to the invention, the point-like mutual information between words can be fitted more accurately.

Description

technical field [0001] The invention relates to the field of artificial intelligence, in particular to a method for vectorizing and encoding words (including English words or Chinese words). Background technique [0002] Word quantization is a very critical step in natural language processing. The goal of word quantization is to convert symbolic information such as natural language into digital information in vector form, and use a continuous vector to represent each word to facilitate subsequent information processing. This process is also known as word encoding or word embedding. The current common word quantification methods are divided into two categories: one-hot representation model and distribution representation model. The former uses a sparse vector to represent a word. The length of the vector is the size N of the dictionary. Each vector has only one dimension which is 1, indicating the position of the word in the dictionary, and the other dimensions are all 0. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/126G06F40/284G06F40/289
Inventor 庄连生姚命宏李厚强
Owner UNIV OF SCI & TECH OF CHINA