Polysemous-word word vector disambiguation method

A word vector and polysemy technology, applied in the intersection of text mining and machine learning, can solve the problems of sparse corpus features, low accuracy, and high corpus dependence.

Inactive Publication Date: 2018-11-23
TAIYUAN UNIV OF TECH
View PDF4 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] 1) Knowledge-based methods, based on artificially constructed text databases and corpora, have the advantage of relatively high accuracy of results, which mainly come from various confirmed corpora, but for some fields, the corpus is relatively small. Sound, so it is inevitable to encounter the problem of sparse corpus features, which is limited by the completeness of knowledge construction
[0004] 2) The supervised method is based on manual labeling of corpus data, but strongly depends on the already marked corpus data. However, for languages ​​without labels, this method is not suitable for word sense disambiguation. This method is not suitable for labeling Data dependency is too strong
[0005] 3) The unsupervised method is based on the incomplete need for corpus and various language annotation information, so it has better applicability, but the accuracy is relatively low
[0006] It can be seen that the existing word sense disambiguation methods face the problems of high corpus dependence, cumbersome manual annotation and low accuracy, so the word sense disambiguation method needs to be further explored

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Polysemous-word word vector disambiguation method
  • Polysemous-word word vector disambiguation method
  • Polysemous-word word vector disambiguation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0020] Example 1: Dataset composition

[0021] The data set of this experiment is divided into two parts, one is e-commerce comment data and Simplified Chinese Wikipedia corpus, and the other is Sogou news text classification corpus, the former is mainly used for qualitative evaluation, and the latter is used for quantitative evaluation evaluate.

[0022] First, in order to conduct a better qualitative evaluation of polysemous words, here is the combination of e-commerce review data and Simplified Chinese Wikipedia corpus, because there are relatively few polysemous words in text corpora in a single field, and it is difficult to evaluate polysemous words. Qualitative description of semantic phenomena, and Wikipedia has a wide range of fields, which is suitable for mining polysemy. The e-commerce review data has a total of 4,904,600 comments. After processing the Wikipedia corpus, a total of 361,668 entries and 1,576,564 lines were obtained after extracting the text content, c...

Embodiment 2

[0028] Embodiment 2: Polysemous word vector training

[0029] figure 1 As shown in the polysemous word vector algorithm, the word vector and topic vector are calculated separately, combined with the result word: (topic number) after BTM topic model training, the word vector and subject vector are calculated by the formula

[0030]

[0031] to connect, Represents the connection of word vector w and topic vector z, w z The length of is the sum of the lengths of word vector w and topic vector z. Here the length of the word vector and the length of the subject vector need not be the same.

[0032] BTM topic annotation. The parameter inference of the BTM topic model uses the Gibbs sampling process to obtain the hidden topic corresponding to each word in each document in the text set. At the same time, since the training effect of Skip-gram on a larger corpus is better than that of the CBOW model, this chapter uses Skip-gram training model here. The topic sampling formula ...

Embodiment 3

[0058] Embodiment 3: experimental verification process

[0059] (1) Quantitative verification

[0060] 1) Word vector similarity

[0061] Table 4 word2vec word vector similarity results

[0062]

[0063] Table 5 Polysemous word vector similarity results

[0064]

[0065] First, the e-commerce reviews and Wikipedia corpus are preprocessed, including removing stop words and word segmentation. In order to filter the network words in the corpus, the list of stop words is expanded here to reduce the noise of the corpus. At the same time, due to some frequency Lower words contain less information and affect the effect of word vectors, so words with a frequency lower than 5 are deleted. The word vector similarity mainly compares the word vector similarity between the polysemous word vector algorithm and the word2vec algorithm. The results are shown in Table 4 and Table 5.

[0066] Through the analysis of Table 4 and Table 5, the word2vec algorithm has only a single unique ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a polysemous-word word vector disambiguation method, and belongs to the cross-technical field of text mining and machine learning. The method comprises: (1) text corpus acquisition and preprocessing, wherein a Sogou news text classification corpus is adopted, and then preprocessing steps of word segmentation and stop word removing are carried out; (2) BTM topic model modeling; (3) probability calculation of corresponding of words to topics; (4) calculation of a vector model on word vectors and topic vectors; and (5) polysemous-word word vector construction, wherein thetopic vectors in a connection process are weighted in corresponding to probability P(z|w) of the topics, distinction on different meanings of the same word in different contexts is realized, and correct polysemous-word word vectors are obtained. According to the method, extension of Chinese word meaning disambiguation to the short-text field is facilitated; manpower data labeling is not needed through using combination of a topic model and the word vectors, massive short-text data mining is facilitated, and time and effort are more saved; and personalized product recommendation of e-commerce websites is facilitated.

Description

technical field [0001] The invention belongs to the cross technical field of text mining and machine learning, specifically relates to a polysemous word vector model, in particular to a word sense disambiguation method of the polysemous word vector model and its disambiguation application in short texts. Background technique [0002] There are various expressions in Chinese, and different words have different meanings in different contexts. Experts have many solutions to how to get the correct semantics of words, and there are also many problems, mainly including: [0003] 1) Knowledge-based methods, based on artificially constructed text databases and corpora, have the advantage of relatively high accuracy of results, which mainly come from various confirmed corpora, but for some fields, the corpus is relatively small. Sound, so it is inevitable to encounter the problem of sparse corpus features, which is limited by the completeness of knowledge construction. [0004] 2) T...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30G06K9/62G06Q30/06
CPCG06Q30/0631G06F40/284G06F40/30G06F18/2411
Inventor 谢珺李思宇梁凤梅刘建霞
Owner TAIYUAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products