Text classification method based on CNN-SVM-KNN combined model

A technology of text classification and combined models, which is applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., and can solve the problems of low text classification accuracy

Inactive Publication Date: 2019-11-05
HARBIN INST OF TECH
View PDF2 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to solve the problem of low accuracy rate of text classification in existing methods, and propose a text classification method based on CNN-SVM-KNN combination model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification method based on CNN-SVM-KNN combined model
  • Text classification method based on CNN-SVM-KNN combined model
  • Text classification method based on CNN-SVM-KNN combined model

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0037] Specific implementation mode one: combine figure 1 Describe this embodiment, the specific process of the text classification method based on CNN-SVM-KNN combination model in this embodiment is:

[0038] The general process of text classification can generally be divided into the following processes: text preprocessing, feature selection, training and testing, and index evaluation. First, use the training set to establish a classifier model, and then use the model in the test set for classification, and finally compare the predicted category label with the real label, and judge the quality of the classifier through indicators.

[0039] Step 1: Text preprocessing;

[0040] Step 2: Perform feature extraction on the text after step 1 preprocessing to obtain the text after feature extraction;

[0041] Step 3: Establish a CNN model based on step 2;

[0042] Step 4: Establish a CNN-SVM model;

[0043] Step 5: Establish CNN-KNN model;

[0044] Step 6: artificially set the...

specific Embodiment approach 2

[0054] Specific embodiment two: the difference between this embodiment and specific embodiment one is that the text is preprocessed in the step 1; the specific process is:

[0055] Text information is usually composed of words and sentences. Computers cannot directly recognize these text information. Therefore, it is necessary to preprocess the text to remove useless information and convert it into a language that can be recognized by the computer. Since the preprocessing methods of Chinese and English are different, they need to be operated separately.

[0056] Each word in the English text is connected by spaces, so its word segmentation operation can be completed by using spaces to perform word segmentation. Such as Figure 9 ;

[0057] The English text preprocessing process is:

[0058] (1) Convert uppercase letters to lowercase;

[0059] (2) Remove stop words, such as a, an, the words that have no practical meaning;

[0060] (3) morphological restoration; all English...

specific Embodiment approach 3

[0066] Specific embodiment three: the difference between this embodiment and specific embodiment one or two is that in the step 2, the text after the preprocessing of step one is subjected to feature extraction to obtain the text after the feature extraction; the process is:

[0067] Feature selection is to select b(b

[0068] So in this case, the new features are only a subset of the original features. The discarded features are considered to be of no importance and cannot represent the theme of the article. After the preprocessing operation, usually the feature matrix at this time will be very large and the dimension is very high, which leads to problems such as excessive calculation, long training time, and low classification accuracy, and feature selection is to eliminate those Part of the unimportant noise retains the features that can highlight the theme of the article, thereby achieving the purpose ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text classification method based on a CNN-SVM-KNN combined model, and relates to a text classification method based on a combined model. The objective of the invention is tosolve the problem of low text classification accuracy of an existing method. The method specifically comprises the steps of 1, text preprocessing; 2, performing feature extraction on the text preprocessed in the step 1 to obtain a text subjected to feature extraction; 3, establishing a CNN model based on the step 2; 4, establishing a CNN-SVM model; 5, establishing a CNN-KNN model; 6, setting a distinguishing threshold d; 7, calculating the distance: calculating the optimal classification surface distance tmp from the to-be-classified sample points to the CNN-SVM classifier; 8, comparing distances: when tmp is greater than d, selecting a CNN-SVM classifier; otherwise, selecting a CNN-KNN classifier; and 9, repeatedly executing the steps 6 to 9, and searching for the optimal d value of the evaluation index. The method is applied to the field of text classification.

Description

technical field [0001] The invention relates to a text classification method based on a combined model. The invention is used in the field of text classification. Background technique [0002] With the vigorous development of network technology, information on the Internet emerges in an endless stream. It would be too impractical to rely on manual classification of massive information on the Internet. Manual classification will consume a lot of time and resources, and it is difficult to achieve a unified classification result due to the differences between different people. Therefore, after the 1990s, the automatic classification technology through statistics and machine learning has always been the focus of people's attention, and it is also the main application technology of people. However, with the gradual expansion of text resources, it has become more and more difficult to meet people's actual needs. , which brings a severe test to the text classification technology....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35
CPCG06F16/353
Inventor 郑文斌凤雷刘冰付平孙媛媛石金龙叶俊涛王天城魏明晨徐明珠吴瑞东
Owner HARBIN INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products