Chinese text classification method based on pre-trained word vector model and random forest algorithm
A random forest and word vector technology, applied in text database clustering/classification, neural learning methods, biological neural network models, etc., can solve problems such as redundancy, insufficient model generalization ability, and insufficient semantic information to fully express.
Pending Publication Date: 2021-02-26
中国科学院电子学研究所苏州研究院
0 Cites 1 Cited by
AI-Extracted Technical Summary
Problems solved by technology
View more
Abstract
The invention provides a method for automatically classifying Chinese texts by using a pre-trained word vector model and a random forest classifier. The method comprises the following steps: traininga Word2Vec pre-training model based on an external corpus; extracting a keyword feature set of the training sample based on a TextRank algorithm, generating a word vector corresponding to each keywordin combination with a Word2Vec pre-training model, and further obtaining text features of the whole training set; performing dimension reduction on the text features of the whole training set based on a PCA principal component analysis algorithm; training a random forest classifier according to the text features and labels after dimension reduction of the whole training set; and extracting text features of the whole test set, and inputting the text features into the trained random forest classifier after dimension reduction to obtain category information of the test sample. According to the invention, the problems that an existing classification method based on machine learning has excessive redundant features, the extracted semantic information amount is limited, text semantic information is difficult to completely express, the generalization ability is insufficient, and the classification accuracy is low are solved.
Application Domain
Character and pattern recognitionNatural language data processing +4
Technology Topic
Principal component analysis algorithmFeature (machine learning) +13
Image
Examples
- Experimental program(1)
Example Embodiment
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.