Unlock instant, AI-driven research and patent intelligence for your innovation.

Term selection method for filtering harmful text information

A feature selection method and text information technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of high calculation results, difficult to filter, and incorrect retention, and achieve the effect of improving the effect.

Inactive Publication Date: 2018-08-07
CHANGAN UNIV
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Because in the calculation process, BC will be much larger than AD, which directly leads to χ 2 The calculation result of statistics is high, and it is not easy to be filtered in the screening process, so that there are no strong representative feature items left by mistake

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Term selection method for filtering harmful text information
  • Term selection method for filtering harmful text information
  • Term selection method for filtering harmful text information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0047] The invention provides a feature selection method for filtering bad text information, which is applied to the filtering process of bad text information, and is a feature selection method used for extracting feature items of bad categories when classifying bad categories. This method takes the traditional χ 2 Based on the statistical feature selection method, the CTW value of the classification feature weight value is used as the basis for feature selection. The factor for calculating the CTW value in this method includes the traditional χ 2 Including statistics, three additional factors are added, which are the improved inverse document frequency IDF value, inverse category frequency ICF value and inverse bad document frequency IHDF value; after the feature weight value (CTW) is calculated, the The feature weight values ​​of the feature items are sorted from large to small, and then the optimal number of feature items is selected to form a new feature item set. At this ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a term selection method for filtering harmful text information. The term selection method comprises the following steps of firstly, extracting all term items from a category corpus, and constructing an initial term item set; afterwards, calculating the chi<2> statistic chi<2>(tj, Ci) of any category Ci in a harmful category, modified inverse document frequency (IDF), reverse category frequency (ICF) and inverse harmful document frequency (IHDF) according to a category including a term item tj to obtain category term weight (CTW) values, utilizing the CTW values as the bases of term selection, and screening term items; finally, sequencing the term items in the initial term item set screened in the step S2 according to the sizes of the CTW values from larger ones to smaller ones, and choosing a term items to form a final term item set. The term selection method is used for solving the problem that the intra-category and inter-category distribution conditions of the term items are not considered in a chi<2> statistic term selection method, and meanwhile, is used for solving the problem of the skew of a data set of each category, and further, the filtration effect of the harmful text information is improved.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, in particular to the technical field of text content filtering, and specifically relates to a feature selection method for filtering bad text information. Background technique [0002] In the process of filtering bad text information, the "curse of dimensionality" is a major problem that must be solved. The text information processed through Chinese text word segmentation has a huge number of feature items. Due to the huge corpus, the dimensionality in the training text set is as high as tens of thousands to hundreds of thousands of dimensions. Such a huge dimensionality will cause serious damage to the computer. Due to the increase of computing time, it directly leads to the reduction of bad text information filtering effect. At the same time, there must be information noise in such a high-dimensional feature item set, that is, there is a negative effect on classification. F...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/353G06F40/284
Inventor 闫茂德赵文柯伟陈宇李超飞田野林海
Owner CHANGAN UNIV