Text categorization method based on probability word selection and supervision subject model

A topic model and text classification technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as the inability to use word ambiguity, affecting the performance of topic models, and affecting the performance of topic models.

Active Publication Date: 2013-12-25
ZHEJIANG UNIV
View PDF1 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Without preprocessing or improper preprocessing, the text data faced by the model will contain redundant data, which will affect the performance of the topic model
On the other hand, ignoring the different importance (or discrimination) of words in the topic relative to the identification information will also affect the performance of the topic model
Finally, supervised models based directly on word rather than topic structure cannot exploit the widespread word polysemy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text categorization method based on probability word selection and supervision subject model
  • Text categorization method based on probability word selection and supervision subject model
  • Text categorization method based on probability word selection and supervision subject model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0172] From http: / / web.ist.utl.pt / ~acardoso / datasets / Downloaded training text 20 ng-train-all-terms and test text 20ng-test-all-terms, remove the text that appears no more than 3 words, get D tr =11285 training texts and D tr =8571 test texts. In the experiment, the number of topics K is set to 20, and other experimental parameters are selected as shown in Table 1:

[0173] Table 1

[0174]

[0175] For the training text, perform the following steps:

[0176] 1) Remove punctuation marks, count word frequency information and category information, and form a word list with a size of 73712 and a category list with a size of 20;

[0177] 2) Initialize the topic proportion vector α, the topic word matrix β, the topic word discrimination matrix ψ and the regression coefficient matrix η:

[0178] (2.1) For α, ψ and η, α k =0.1, ψ kv =0.5, η cv =0,k=1,...,K,c=1,...,C,v=1,...,V;

[0179] (2.2) For β, shilling k=1,...,K, v=1,...,V, where the rand function randomly genera...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text categorization method based on probability word selection and a supervision subject model. The method includes the following steps that (1) punctuation marks in a training text are removed, word frequency information and category information are collected, and a word list and a category list are formed; (2) a subject proportion vector, a subject word matrix, a subject word distinguishing degree matrix and a regression coefficient matrix are initialized; (3) the subject proportion vector, the subject word matrix, the subject word distinguishing degree matrix and the regression coefficient matrix are updated according to the word list of the training text and category iteration of the word list; (4) for a testing text, the word frequency information is collected, and categorization is conducted by using the subject proportion vector, the subject word matrix, the subject word distinguishing degree matrix and the regression coefficient matrix. By means of the method, a complex pre-processing process during text categorization can be reduced to the maximum degree, and the testing text can be categorized more accurately. By means of the method, the distinguishing degree of words in a subject can be excavated, and the significance of the words in the text can be displayed visually.

Description

technical field [0001] The invention relates to probabilistic word selection and supervised topic models, in particular to a text classification method based on probabilistic word selection and supervised topic models. Background technique [0002] The emergence of the Internet has made it easier for people to obtain information. However, the massive data generated by the rapid development of the Internet also brings great difficulties to people's analysis and utilization of data. Therefore, it becomes more and more important to organize, manage and mine data automatically. Because of the interpretability of the underlying structure of topic models, such as PLSA (Probabilistic Latent Semantic Analysis), LDA (Latent Dirichlet Allocation), etc., they are widely used to mine low-dimensional representations of text. Topic models assume that all the words in the text are generated from a multinomial distribution called "topics", and the text is a mixture of these topics. [00...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 庄越挺吴飞高海东
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products