IG TF-IDF text feature vector generation and text classification methods

A feature vector, text classification technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc. Well-designed effects

Active Publication Date: 2019-01-25
NORTHEASTERN UNIV
View PDF3 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the entry weights calculated by supervised weighting methods such as TFATF are related to the categories of specific texts, and the categories of the newsbooks to be classified are unknown. Either use the TFATF algorithm to calculate the weights for all

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • IG TF-IDF text feature vector generation and text classification methods
  • IG TF-IDF text feature vector generation and text classification methods
  • IG TF-IDF text feature vector generation and text classification methods

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0061] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and implementation examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0062] The present invention proposes a IG TF-IDF text feature vector generation and text classification method, such as figure 1 shown, including the following steps:

[0063] Step 1: Generate text feature vectors:

[0064] Input a text set, each text set includes several texts, and several texts form several data sets according to their text categories; based on the IG TF-IDF method, the following steps 1.1 to 1.4 are performed in order to generate the feature vector of each text; the IGTF- IDF is information gain term frequency-inverse document frequency, that is, Information Gain,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention particularly relates to an IG TF-IDF text feature vector generation and text classification method, and belongs to the field of text mining and machine learning. The method comprises thefollowing steps: 1) generating a text feature vector; 2) train that classifier; 3) evaluate that classification performance; 4) classify that target text set; The weight calculated by the invention can more truly reflect the importance of different terms to the text classification, so that the term with strong class discrimination ability is allocated with larger weight, the weight calculation ismore reasonable, and the accuracy of the text classification is improved. Moreover, the calculated term weights do not need to know the specific categories, thus overcoming the shortcomings of supervised methods such as TFADF in multi-category text classification.

Description

technical field [0001] The invention belongs to the field of text mining and machine learning, and in particular relates to an IG TF-IDF text feature vector generation and text classification method. Background technique [0002] With the advent of the Internet era, texts are presented in the form of electronic texts, resulting in a sharp increase in the number of electronic documents. Therefore, how to effectively organize and mine massive data texts has become more and more important. Automatic classification is one of the most widely used technical means. Classification is the division of text into predefined categories, and it is a research hotspot in the fields of information retrieval and data mining. In general, some text with category marks is used as training data, a classifier is obtained through a machine learning algorithm, and then its category is judged according to the text content. Before classifying the text, it needs to be expressed in a form that can be ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/35G06F17/27
CPCG06F40/279
Inventor 朱志良梁洁李德洋刘国奇于海
Owner NORTHEASTERN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products