Topic modeling method based on word co-occurrence network

A topic modeling and topic technology, applied in the fields of instruments, electrical digital data processing, computing, etc., can solve the problems of strong assumptions, high probability, limited ability of short text sparsity, etc., to avoid energy and enhance co-occurrence information. Effect

Active Publication Date: 2020-09-29
SOUTH CHINA UNIV OF TECH
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disadvantage of this approach is that not all data will have corresponding information, and even if there is, it will be relatively small. For example, some users only publish a Weibo, and the ability to finally alleviate the sparsity problem of short text is limited; the second is to use external Knowledge to force the probability of assigning two words that have not co-occurred or co-occurred less but have similar semantics to the same topic, such as using the WordNet knowledge base to extract the synsets of words and encoding the information of these synsets into the model to improve the effect of the model; the disadvantage of this method is that it is necessary to find a suitable external knowledge base to extract the required relationship information between words and words; the third category is to directly change the assumptions of the model according to the characteristics of the data set itself , if someone proposes to directly model the co-occurrence between words and words, a BTM model is proposed, which converts the original data set into a set composed of biterm, which is composed of two different words in the document A collection where words in each biterm share a topic, and each short text is assumed to have only one topic
The defect of this method is that the assumption that the short text has only one topic is too strong, and it does not distinguish the importance of the two words in a biterm to the topic

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Topic modeling method based on word co-occurrence network
  • Topic modeling method based on word co-occurrence network
  • Topic modeling method based on word co-occurrence network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0015] Such as figure 1 Shown is a flow chart of a topic modeling method based on a word co-occurrence network, said method comprising steps:

[0016] (1) Construct a word co-occurrence network according to a given corpus or text collection, including steps:

[0017] Let the collection of documents be D={d 1 , d 2 ,...,d n}, each document consists of multiple words

[0018] (1-1) Perform preprocessing operations on the text collection D, including: converting all words to lowercase, removing stop words and punctuation marks, removing words with less than 3 characters and documents with less than 3 words. The document collection obtained after preprocessing is still recorded as D;

[0019] (1-2) Deduplicate all the words that have appeared in the text set D, and form a set V, which is called a vocabulary. Each word in V corresponds to a node in the word co-occurrence network. If two words in the vocabulary co-occur in a document, an undirected edge is connected between ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a topic modeling method based on a word co-occurrence network. The topic modeling method comprises the following steps: constructing the word co-occurrence network according toa given corpus or text set; constructing a new document set according to the obtained word co-occurrence network; and inputting the obtained new document set into a Gibbs sampling algorithm of a standard topic model LDA to obtain a document-topic matrix and a topic-word matrix corresponding to the new document set. The method does not need to depend on any external knowledge, avoids the energy ofcollecting additional knowledge, and improves the result of the topic model only through the information contained in the data set.

Description

technical field [0001] The invention relates to the fields of natural language processing technology and text mining technology, in particular to a topic modeling method based on a word co-occurrence network. Background technique [0002] How to obtain the information you need in a large amount of text is an important issue in text mining. Especially the rapid development of the Internet now has a large number of short texts on the Internet, such as Weibo, online comments, etc. These texts are characterized by huge quantity, small text length and certain noise. In the face of a large amount of text, it will be time-consuming and labor-intensive to manually judge and identify the content of each document one by one. How to use computers to assist humans to better absorb the information contained in a large amount of text? The topic model provides us with a solution. Topic model is an effective method to discover hidden structural information in text. Since it was proposed,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/237G06F40/30
CPCG06F40/237G06F40/30
Inventor 蔡毅朱冰山
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products