Unlock instant, AI-driven research and patent intelligence for your innovation.

Feature dimension reduction method for automatic classification of Chinese text

A technology of automatic classification and feature dimensionality reduction, applied to instruments, character and pattern recognition, computer components, etc., can solve problems such as high-dimensional problem obstacles and arduous tasks of dimensionality reduction

Inactive Publication Date: 2006-04-19
TSINGHUA UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The problem now is: how to make users access the information they want conveniently and efficiently if they organize and manage these massive amounts of information effectively
Therefore, in the VSM model, the high-dimensional problem is a huge obstacle
But for a text set, the feature set (millions) of binary strings is much larger than the feature set (hundreds of thousands) of words, so in the Chinese text classification using binary strings as features, the task of dimensionality reduction is more important. for the daunting

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Feature dimension reduction method for automatic classification of Chinese text
  • Feature dimension reduction method for automatic classification of Chinese text
  • Feature dimension reduction method for automatic classification of Chinese text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0092] A feature dimensionality reduction method for automatic classification of Chinese text, including the following steps:

[0093] In the learning phase, the following steps are involved:

[0094] (1). Determine the feature selection method (statistics), feature vector weight calculation method and the value of related parameters;

[0095] (2). Preprocessing the learning text set;

[0096] (3). Perform one-element, two-element, and three-element string indexing (Indexing) on ​​the learning text set respectively to obtain the original feature set of one-element string, original feature set of two-element string and original feature set of three-element string. According to the original feature set of binary strings, the feature frequency vectors of each learning text are generated, as shown in formula 1.

[0097] d=(tf(T 1d ), tf(T 2d ),...,tf(T nd )) (1)

[0098] d is any learning text; n is the total number of features contained in the binary string ori...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention features that one characteristic selecting method is first selected to lowering the dimension of original characteristic set to obtain intermediate characteristic set; the intermediate characteristic set is then analyzed to find out 'high superposed binary string' and 'high deviated binary string'; merging the high superposed binary strings into corresponding ternary string and deleting high deviated binary strings to obtain the learning characteristic set for machine to learn; and finally obtaining classifier for use in classifying stage. The present invention makes best use of the characteristics of language, and lowers the dimensions greatly on the basis of intermediate characteristic set to ensure that the selected characteristic possesses high classifying capacity and description capacity, being superior to characteristic selection adopting statistic amount only.

Description

technical field [0001] The feature dimension reduction method for automatic classification of Chinese texts belongs to the technical field of automatic classification of Chinese texts, and in particular relates to the technical field of automatic classification of Chinese texts based on various Chinese character strings as features. Background technique [0002] The development of computer network and electronic technology has completely changed the way people work, live and obtain information. The vast majority of human information has been placed online. The problem now is: how to enable users to access the desired information conveniently and effectively if the massive information is effectively organized and managed. Text Classification (TC) technology provides an effective way to solve these problems. It uses the computer as a tool and applies machine learning technology to enable the computer to automatically classify natural language electronic texts according to a ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06K9/80
Inventor 孙茂松薛德军
Owner TSINGHUA UNIV