TF-IDF feature extraction based short text classification method

A TF-IDF and feature extraction technology, applied in the field of data processing, can solve problems such as technical solutions cannot achieve solutions, and achieve the effect of improving algorithm performance and enhancing weights

Active Publication Date: 2017-03-22
广东广业开元科技有限公司
View PDF6 Cites 32 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If these external resources cannot be obtained, and there are not enough internal resources to pre-establi

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • TF-IDF feature extraction based short text classification method
  • TF-IDF feature extraction based short text classification method
  • TF-IDF feature extraction based short text classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The specific embodiment of the present invention will be further described below in conjunction with accompanying drawing:

[0029] refer to image 3 , a short text classification method based on TF-IDF feature extraction, including the following steps:

[0030] Step A: Dataset annotation and preprocessing

[0031] Extract short text data from the overall data set as the training data of the SVM classifier, classify and label the extracted data according to the classification requirements, and then perform word segmentation to divide the short text data into multiple words;

[0032] Further as a preferred embodiment, in the step A, a stuttering word segmentation method is used for word segmentation.

[0033] Step B: Compute the TFIDF vector for classification enhancement

[0034] Extract data according to the classification and labeling of the above steps, and randomly divide the data in each category into two groups according to the proportion, as the training set a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a TF-IDF feature extraction based short text classification method. According to the method, short texts are merged into a long text so as to enhance the TF-IDF feature of the short texts; dimension reduction is performed so as to generate a feature word list and a feature word dictionary; a mechanism compensation is established for a class having a relative unobvious feature while the feature word list is established, and the text feature vector weight is enhanced; and other word banks or word vector dictionaries do not have to be constructed or trained, and then the algorithm performance can be improved on the premise of ensuring the feature expression result of the texts. The TF-IDF feature extraction based short text classification method can be widely applied to the field of data processing.

Description

technical field [0001] The invention relates to the field of data processing, in particular to a short text classification method based on TF-IDF feature extraction. Background technique [0002] With the rise of social media, short texts such as mobile SMS, Tweet and Weibo have emerged in an endless stream. Due to the large number of participants and the rapid release frequency, the scale of short texts has grown rapidly. In addition, short text plays an important role in fields such as search engines, automatic question answering, and topic tracking. Moreover, with the implementation and deepening of e-government construction, government departments are also faced with the problem of dealing with a large number of short texts. However, due to the short text content is less and the features are not obvious, so in the processing of short text data, how to realize the simple and effective classification of a large number of short text data is of great significance. [0003...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/355G06F16/36
Inventor 纪晓阳孔祥明林成创蔡斯凯蔡禹贾义动
Owner 广东广业开元科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products