Feature selection and weight calculation method of imbalance text set

A feature selection and weight calculation technology, applied in computing, unstructured text data retrieval, text database clustering/classification, etc.

Active Publication Date: 2014-06-25
GOONIE INT SOFTWARE BEIJING
View PDF5 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to propose a feature selection and

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Feature selection and weight calculation method of imbalance text set
  • Feature selection and weight calculation method of imbalance text set
  • Feature selection and weight calculation method of imbalance text set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and examples. according to figure 1 Shown, the method that the present invention proposes is to realize by following steps successively:

[0044] Step 1: Perform text preprocessing on the unbalanced text set to extract words containing semantic information.

[0045] Step 1.1: Use Chinese lexical processing software to perform word segmentation and part-of-speech tagging on the file collection.

[0046] The experimental word segmentation process uses the Chinese lexical analysis system ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System).

[0047] Step 1.2: Filter out stop words after word segmentation processing. Such as modal particles, prepositions, adverbs, etc.

[0048] If there are a large number of stop words in the text, it will cause noise interference to its effective information. After...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a feature selection and weight calculation method of an imbalance text set, and belongs to the field of text information processing. In order to solve the classification problem of imbalance text data, a feature selection and weight calculation method and system are provided. The category discrimination degree and the average word frequency factor are combined, the chi-squared statistic method is improved so as to conduct feature selection, meanwhile, a commonly-used feature weight calculation method is improved, and on the basis of the improvement, the TF-IDF weight calculation method is provided. The effect of the method on solving the imbalance data set problem is superior to that of a traditional feature selection method, and the method is effective and feasible in effectively improving the classification accuracy.

Description

technical field [0001] The invention belongs to the field of text information processing, and in particular relates to a feature selection and weight calculation method of an unbalanced text set. Background technique [0002] With the rapid development of information technology and the popularization of the Internet, text information resources have expanded rapidly. These information resources are enriching people's knowledge and providing convenience, but they also contain a lot of junk information. As one of the main technologies of information retrieval technology, text classification technology has high application value in improving the performance of information retrieval and filtering systems. [0003] Usually, the source of the text includes not only web pages and emails, but also text messages, Weibo and forum posts, etc. In the process of text classification, if the text is expressed in vector form, there may be tens of thousands of features in the training set. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35G06F40/205
Inventor 刘磊
Owner GOONIE INT SOFTWARE BEIJING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products