The invention discloses a
text categorization method based on probability
word selection and a supervision subject model. The method includes the following steps that (1)
punctuation marks in a training text are removed, word frequency information and category information are collected, and a
word list and a category
list are formed; (2) a subject proportion vector, a subject word matrix, a subject word distinguishing
degree matrix and a regression
coefficient matrix are initialized; (3) the subject proportion vector, the subject word matrix, the subject word distinguishing
degree matrix and the regression
coefficient matrix are updated according to the
word list of the training text and category iteration of the
word list; (4) for a testing text, the word frequency information is collected, and
categorization is conducted by using the subject proportion vector, the subject word matrix, the subject word distinguishing
degree matrix and the regression
coefficient matrix. By means of the method, a complex pre-
processing process during
text categorization can be reduced to the maximum degree, and the testing text can be categorized more accurately. By means of the method, the distinguishing degree of words in a subject can be excavated, and the significance of the words in the text can be displayed visually.