The invention provides a parallel random
label subset multi-
label text classification method for Spark-based
big data platform. First of all, large scale text data sets and configuration information files are read, the distributed
data set RDD is created, the training
data set and the prediction
data set are cached in memory to complete the initialization operation. Secondly, the
label subset of the required numbers is randomly generated in parallel, a new
training set is generated for each label subset by the original
training set, once again, multiple tags of the new
training set through the tag power set are converted into single labels, the data sets are converted into
single label multiple data sets, and a base classifier is trained in parallel for these data sets. Then the
single label multiple prediction results obtained by prediction are converted into multi-label results. Finally, all the predicted results are collected and voted, to obtain the final multi-label prediction results of the
test set. The multi-label text classification method improves the classification accuracy, dramatically reduces the learning time of handling large scale multi-label data.