Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Chinese text classification method based on pre-trained word vector model and random forest algorithm

A random forest and word vector technology, applied in text database clustering/classification, neural learning methods, biological neural network models, etc., can solve problems such as redundancy, insufficient model generalization ability, and insufficient semantic information to fully express.

Pending Publication Date: 2021-02-26
中国科学院电子学研究所苏州研究院
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, traditional text classification methods usually take every word after text segmentation into account, which is not conducive to highlighting the importance of key words representing the text, resulting in too many redundant features
In addition, the traditional method is usually based on a certain training set for feature extraction, without the help of the huge word vector semantic information contained in the external corpus, the amount of information that can be extracted is limited, and it is not enough to fully express the semantic information of the text
Moreover, most of the traditional methods use a single machine learning algorithm without the advantages of integrated learning algorithms, so the model generalization ability is insufficient, and the accuracy of Chinese text classification is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese text classification method based on pre-trained word vector model and random forest algorithm
  • Chinese text classification method based on pre-trained word vector model and random forest algorithm
  • Chinese text classification method based on pre-trained word vector model and random forest algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0098] In order to verify the effectiveness of the scheme of the present invention, a kind of Chinese text classification method based on the pre-training word vector model and the random forest algorithm proposed by the present invention will be specifically exemplified below:

[0099] Input: external corpus Ω, word vector dimension q in Word2Vec, Chinese text classification training sample set of known categories Φ=[T 1 , T 2 ,...,T n ] and training label set L=[l 1 , l 2 ,...,l n ] (n is the number of training samples), the Chinese text classification test sample set of unknown category Ψ=[T 1 , T 2 ,...,T m ] (m is the number of test samples), the number U of decision trees in the random forest, the dimension e of PCA dimensionality reduction, the number r of attributes in the random attribute subset, and the number k of keywords extracted by TextRank.

[0100] Step 1: Based on the external corpus Ω, use the method described in stage I to generate a pre-trained Word...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for automatically classifying Chinese texts by using a pre-trained word vector model and a random forest classifier. The method comprises the following steps: traininga Word2Vec pre-training model based on an external corpus; extracting a keyword feature set of the training sample based on a TextRank algorithm, generating a word vector corresponding to each keywordin combination with a Word2Vec pre-training model, and further obtaining text features of the whole training set; performing dimension reduction on the text features of the whole training set based on a PCA principal component analysis algorithm; training a random forest classifier according to the text features and labels after dimension reduction of the whole training set; and extracting text features of the whole test set, and inputting the text features into the trained random forest classifier after dimension reduction to obtain category information of the test sample. According to the invention, the problems that an existing classification method based on machine learning has excessive redundant features, the extracted semantic information amount is limited, text semantic information is difficult to completely express, the generalization ability is insufficient, and the classification accuracy is low are solved.

Description

technical field [0001] The invention belongs to the field of natural language processing, and relates to a method for automatically classifying Chinese texts by using a pre-trained word vector model and a random forest classifier. Background technique [0002] In the era of big data, a large amount of information on the Internet is constantly pouring out. Faced with such a huge amount of data, how to allow users to quickly and effectively locate the information they need is the usual method of text classification and intelligent recommendation. Text classification technology can effectively organize and classify massive amounts of information, and use algorithmic models to classify unlabeled texts into predefined categories. Prior to this, the content of the text was generally read and then classified manually, which often required a lot of human time cost. Therefore, it is of great significance to use machine learning algorithms to automatically classify texts. Common algo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/289G06F40/216G06F16/35G06K9/62G06N3/04G06N3/08
CPCG06F40/289G06F40/216G06F16/35G06N3/08G06N3/045G06F18/2135G06F18/24323G06F18/214
Inventor 金康荣胡岩峰顾爽付啟明潘月浩陈尚胡惊涛
Owner 中国科学院电子学研究所苏州研究院
Features
  • Generate Ideas
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More