Chinese text classification method based on pre-trained word vector model and random forest algorithm

A random forest and word vector technology, applied in text database clustering/classification, neural learning methods, biological neural network models, etc., can solve problems such as redundancy, insufficient model generalization ability, and insufficient semantic information to fully express.

Pending Publication Date: 2021-02-26
中国科学院电子学研究所苏州研究院
0 Cites 1 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0003] However, traditional text classification methods usually take every word after text segmentation into account, which is not conducive to highlighting the importance of key words representing the text, resulting in too many redundant features
In addition, the traditional method is usually based on a certain training set for feature extraction, without the help of the huge word vector semantic information c...
View more

Abstract

The invention provides a method for automatically classifying Chinese texts by using a pre-trained word vector model and a random forest classifier. The method comprises the following steps: traininga Word2Vec pre-training model based on an external corpus; extracting a keyword feature set of the training sample based on a TextRank algorithm, generating a word vector corresponding to each keywordin combination with a Word2Vec pre-training model, and further obtaining text features of the whole training set; performing dimension reduction on the text features of the whole training set based on a PCA principal component analysis algorithm; training a random forest classifier according to the text features and labels after dimension reduction of the whole training set; and extracting text features of the whole test set, and inputting the text features into the trained random forest classifier after dimension reduction to obtain category information of the test sample. According to the invention, the problems that an existing classification method based on machine learning has excessive redundant features, the extracted semantic information amount is limited, text semantic information is difficult to completely express, the generalization ability is insufficient, and the classification accuracy is low are solved.

Application Domain

Character and pattern recognitionNatural language data processing +4

Technology Topic

Principal component analysis algorithmFeature (machine learning) +13

Image

  • Chinese text classification method based on pre-trained word vector model and random forest algorithm
  • Chinese text classification method based on pre-trained word vector model and random forest algorithm
  • Chinese text classification method based on pre-trained word vector model and random forest algorithm

Examples

  • Experimental program(1)

Example Embodiment

[0097]Example
[0098]In order to verify the effectiveness of the solution of the present invention, the following is a specific example for the Chinese text classification method based on the pre-training word vector model and random forest algorithm proposed by the present invention:
[0099]Input: external corpus Ω, the dimension q of the word vector in Word2Vec, the training sample set of Chinese text classification of known categories Φ=[T1,T2,...,Tn] And the training label set L=[l1,l2,...,Ln](n is the number of training samples), the test sample set of Chinese text classification of unknown category Ψ=[T1,T2,...,Tm] (m is the number of test samples), the number U of decision trees in the random forest, the dimension e of PCA dimensionality reduction, the number of attributes in the random attribute subset r, the number of keywords extracted by TextRank k.
[0100]Step 1: Based on the external corpus Ω, use the method described in Phase I to generate a pre-trained Word2Vec word vector model Mword2vec.
[0101]Step 2: Based on the training sample set Φ=[T1,T2,...,Tn], use the TextRank algorithm in Phase II to obtain the keyword feature set of the training sample With the help of the generated pre-trained word vector model Mword2vec , Get each text TiEach keyword in wi,j Corresponding word vector Where j = 1, 2, ..., k.
[0102]Step 3: Translate each text TiThe word vectors of all keywords in the vector are spliced ​​as text TiEigenvector fi=[wvi,1,wvi,2,...,Wvi,k ], and then obtain the feature matrix F=[f1,f2,...,Fn].
[0103]Step 4: Use the PCA dimensionality reduction method in Phase III to obtain the dimensionality reduction text feature matrix H∈Rn×e.
[0104]Step 5: Based on the training text feature matrix H∈Rn×e And the training label set L=[l1,l2,...,Ln], use the random forest classifier in stage IV for training to generate a text classification model MRF.
[0105]Step 6: Based on the test sample set Ψ=[T1,T2,...,Tm], use the same method in Step 2~Step 4 to obtain the feature matrix of the test text set, denoted as HT∈Rm×e.
[0106]Step 7: Based on the test text feature matrix HT∈Rm×e, With the help of the generated text classification model MRF , Use the prediction method in stage V to get the category label of the test sample set

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products