Chinese text classification method based on pre-trained word vector model and random forest algorithm

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A random forest and word vector technology, applied in text database clustering/classification, neural learning methods, biological neural network models, etc., can solve problems such as redundancy, insufficient model generalization ability, and insufficient semantic information to fully express.

Pending Publication Date: 2021-02-26

中国科学院电子学研究所苏州研究院

View PDF0 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] However, traditional text classification methods usually take every word after text segmentation into account, which is not conducive to highlighting the importance of key words representing the text, resulting in too many redundant features

In addition, the traditional method is usually based on a certain training set for feature extraction, without the help of the huge word vector semantic information contained in the external corpus, the amount of information that can be extracted is limited, and it is not enough to fully express the semantic information of the text

Moreover, most of the traditional methods use a single machine learning algorithm without the advantages of integrated learning algorithms, so the model generalization ability is insufficient, and the accuracy of Chinese text classification is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0098] In order to verify the effectiveness of the scheme of the present invention, a kind of Chinese text classification method based on the pre-training word vector model and the random forest algorithm proposed by the present invention will be specifically exemplified below:

[0099] Input: external corpus Ω, word vector dimension q in Word2Vec, Chinese text classification training sample set of known categories Φ=[T 1 , T 2 ,...,T n ] and training label set L=[l 1 , l 2 ,...,l n ] (n is the number of training samples), the Chinese text classification test sample set of unknown category Ψ=[T 1 , T 2 ,...,T m ] (m is the number of test samples), the number U of decision trees in the random forest, the dimension e of PCA dimensionality reduction, the number r of attributes in the random attribute subset, and the number k of keywords extracted by TextRank.

[0100] Step 1: Based on the external corpus Ω, use the method described in stage I to generate a pre-trained Word...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a method for automatically classifying Chinese texts by using a pre-trained word vector model and a random forest classifier. The method comprises the following steps: traininga Word2Vec pre-training model based on an external corpus; extracting a keyword feature set of the training sample based on a TextRank algorithm, generating a word vector corresponding to each keywordin combination with a Word2Vec pre-training model, and further obtaining text features of the whole training set; performing dimension reduction on the text features of the whole training set based on a PCA principal component analysis algorithm; training a random forest classifier according to the text features and labels after dimension reduction of the whole training set; and extracting text features of the whole test set, and inputting the text features into the trained random forest classifier after dimension reduction to obtain category information of the test sample. According to the invention, the problems that an existing classification method based on machine learning has excessive redundant features, the extracted semantic information amount is limited, text semantic information is difficult to completely express, the generalization ability is insufficient, and the classification accuracy is low are solved.

Description

technical field [0001] The invention belongs to the field of natural language processing, and relates to a method for automatically classifying Chinese texts by using a pre-trained word vector model and a random forest classifier. Background technique [0002] In the era of big data, a large amount of information on the Internet is constantly pouring out. Faced with such a huge amount of data, how to allow users to quickly and effectively locate the information they need is the usual method of text classification and intelligent recommendation. Text classification technology can effectively organize and classify massive amounts of information, and use algorithmic models to classify unlabeled texts into predefined categories. Prior to this, the content of the text was generally read and then classified manually, which often required a lot of human time cost. Therefore, it is of great significance to use machine learning algorithms to automatically classify texts. Common algo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F40/289G06F40/216G06F16/35G06K9/62G06N3/04G06N3/08

CPCG06F40/289G06F40/216G06F16/35G06N3/08G06N3/045G06F18/2135G06F18/24323G06F18/214

Inventor 金康荣胡岩峰顾爽付啟明潘月浩陈尚胡惊涛

Owner 中国科学院电子学研究所苏州研究院

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Chinese text classification method based on pre-trained word vector model and random forest algorithm

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology