Text classification method based on TF-IDF matrix and capsule network

A TF-IDF, text classification technology, applied in the field of text classification based on TF-IDF matrix and capsule network, can solve problems such as large amount of calculation and poor interpretability, and achieve the effect of improving efficiency and reducing text features

Active Publication Date: 2019-08-06
TIANJIN UNIV
View PDF10 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Existing text classification methods such as the KNN (K-Nearest Neighbor) algorithm mainly rely on the limited surrounding samples, but the interpretability of the output is not strong, and the amount of calculation is large. When the samples are unbalanced, it may cause when the input When there is a new sample, the samples of the high-capacity class among the K neighbors of the sample account for the majority

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification method based on TF-IDF matrix and capsule network
  • Text classification method based on TF-IDF matrix and capsule network
  • Text classification method based on TF-IDF matrix and capsule network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] In order to achieve the above purpose, the embodiment of the present invention proposes a text classification method based on TF-IDF matrix and capsule network, see figure 1 , the method includes the following steps:

[0038] 101: Perform word segmentation processing on the input text data;

[0039] 102: Use the weakly related vocabulary removal algorithm of TF-IDF matrix to remove stop words from the text data, delete some words in the text data set D, and obtain a text data set D' with more obvious features after processing, as a classifier enter;

[0040] 103: Obtain text vector embedding through doc2vec algorithm processing;

[0041] 104: Use the obtained text vector embedding as the input of the capsule network-based text classification, and train the capsule network text classification model.

[0042] In one embodiment, step 101 performs word segmentation processing on the text data, and the specific steps are as follows:

[0043] For text data, when performin...

Embodiment 2

[0056] The scheme in embodiment 1 is verified for feasibility below in conjunction with specific calculation formulas and examples, see the following description for details:

[0057] 201: Before classifying the text, the text data must be segmented first, separated by spaces, and the dictionary Dic corresponding to the text data set is constructed, and the words appearing in the text are not repeatedly counted. The constructed dictionary is included in the text Dic_n different words appearing in the data;

[0058] 202: The data obtained after word segmentation is used to remove stop words based on the weakly related vocabulary removal algorithm based on TF-IDF matrix, so as to reduce the storage space of text data and improve the operation efficiency of the algorithm, comprehensively analyze the TF-IDF matrix M, and obtain the satisfaction Conditional global threshold α;

[0059] Among them, all the values ​​of the TF-IDF matrix M are sorted to obtain the threshold α that sa...

Embodiment 3

[0089] Below in conjunction with concrete example, data, the scheme in embodiment 1 and 2 is carried out feasibility verification, see the following description for details:

[0090] In the experiment of the weakly relevant vocabulary removal algorithm based on TF-IDF matrix, the final threshold value of each text is calculated, and the weakly relevant vocabulary set of each text is determined through the final threshold value, and all the words that appear in the text in this piece of text data are The vocabulary in the weakly related vocabulary set is deleted, the processed text data is retained, and finally all the processed text data are integrated to generate a new text data set.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text classification method based on a TF-IDF matrix and a capsule network. The method comprises the steps of adopting the TF-IDF matrix for analysis of text data obtained after word segmentation processing to obtain a global threshold value meeting a preset condition alpha, and performing personalized analysis on each piece of text data to obtain a threshold value tj corresponding to each piece of text data and a set Salpha formed by the threshold values tj; comparing the global threshold value alpha with a threshold value alphai obtained by personality analysis of the text; obtaining a final threshold betai corresponding to each piece of text data and a set Sbeta formed by the final threshold betai, processing the text data set according to the finally obtained set Sbeta, carrying out word frequency analysis, and labeling the text data set according to the appearance sequence of vocabularies in the dictionary so as to realize text vector embedding; and through a doc2vec algorithm, expressing the embedded text vector by using a text matrix, and training a capsule network text classification model by taking the embedded text vector as an input of capsule network-based text classification. According to the method, some vocabularies which have low influence on text classification in the text data can be effectively removed, and text features are reduced.

Description

technical field [0001] The invention relates to the fields of natural language processing and information retrieval, in particular to a text classification method based on TF-IDF (term frequency-inverse text frequency index) matrix and capsule network. Background technique [0002] The text classification process first needs to perform text preprocessing and text feature processing on the data to obtain the feature vector of the text, which lays the foundation for the subsequent operations of the classification process. In the process of feature processing, traditional machine learning methods need to manually specify the specific form of features to represent the original data. [0003] Traditional text classification methods have high text feature dimensions, sparse data, and weak representation capabilities. Existing text classification methods such as the KNN (K-Nearest Neighbor) algorithm mainly rely on the limited surrounding samples, but the interpretability of the o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06F17/27
CPCG06F40/20G06F18/2411G06F18/24G06F18/214Y02D10/00
Inventor 喻梅胡悦刘志强于健赵满坤于瑞国王建荣张功
Owner TIANJIN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products