The present invention provides a
mutual information based parallel
feature selection method for
document classification, which comprises: a, selecting samples and performing classification; b, solving TF-ID values of words; c, generating an initialized
data set D = { x1, x2, ..., xN }; d, carrying out distributed calculating and evenly distributing all sub data sets to m calculation nodes; e, establishing sets, wherein S = phi and V = { X1, X2,..., XM }; f, calculating joint probability distribution and
conditional probability distribution; g, calculating
mutual information; h, selecting a feature variable; i, determining if the number is enough; and i, performing
document classification. According to the parallel
feature selection method for
document classification, which is provided by the present invention, Rayleigh entropy based
mutual information is used for measuring correlation between the feature variable and a
class variable, so that the finally selected feature variable can further represent a document classification feature, a classification effect is more accurate, and a
classification result is better than a result obtained by using a common
feature selection method. The
selection method has advantageous effects, and is suitable for promotion and application.