Supercharge Your Innovation With Domain-Expert AI Agents!

Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

A text classification and statistical technology, applied in the field of computer science, can solve the problems of not considering the semantic similarity of words, low accuracy of text classifiers, etc., to achieve the effect of improving classification accuracy

Inactive Publication Date: 2012-08-01
INST OF ACOUSTICS CHINESE ACAD OF SCI
View PDF1 Cites 36 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to overcome the problem that the current TF*IDF algorithm does not consider the semantic similarity between words when calculating the weight of feature items, which causes the low accuracy of the text classifier based on the TF*IDF algorithm, and provides a Statistical Text Classification System and Method Based on TF*IDF Algorithm

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
  • Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
  • Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0041] The invention provides a new method for calculating vector weights for effective use of knowledge engineering knowledge in statistical text classification methods. This method introduces the CIV variable into the TF*IDF method, thus obtaining an improved method, that is, the TF*IDF*CIV (Term Frequency, Inverse Document Frequency, Concept Information Value) method. Experiments prove that this method can effectively improve the evaluation indicators of text classification, such as accuracy rate, recall rate and F1 measure.

[0042] Such as figure 1 As shown, the figure is a flow chart of the steps of the statistical text classification method based on the TF*IDF algorithm. The specific steps are as follows:

[0043]Step 1 collects corpus from the Internet, part of which is used as training corpus, and the other part is used as test corpu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a statistic text classification method based on a term frequency-inverse document frequency (TF*IDF) algorithm, which proposes a novel characteristic vector weight method (TF*IDF*CIV), a variable, namely concept information amount (CIV) is introduced in the TF*IDF algorithm, the concept information amount of a characteristic vector serves as a variable to be considered in the calculating process of the characteristic vector weight. A formula of the algorithm is that sharing concept number sim (ci, C) is matched and equal concept number of a concept set ci corresponding to a characteristic item ti in a category concept set C. The shortage of the TF*IDF algorithm is overcome, and the TF*IDF algorithm is widely used for calculating the characteristic vector weight. The statistic text classification method cannot express relevance between characteristic items and ignores the influence of the relevance between characteristic items to the weight. Therefore, experimental results prove that the accuracy rate of a whole text classification system can be effectively improved by means of the novel method.

Description

technical field [0001] The invention relates to the field of computer science and technology, in particular to a new calculation method and device for feature vector weights oriented to text classification. Background technique [0002] With the rapid development and popularization of Internet technology and computer technology, a large amount of text information begins to exist in a computer-readable form, and computer automatic text classification technology emerges as the times require. At present, text classification technology is widely used in various research fields such as document indexing, bad information detection, topic recognition, automatic summarization, and intelligent information retrieval. [0003] Automatic classification research began in the late 1950s, and H.P. Luhn conducted pioneering research in this field. In 1961, Maron published the first paper on automatic classification, and then many famous intelligence scientists such as Sparck and Salton con...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 缪建明丁泽亚张全
Owner INST OF ACOUSTICS CHINESE ACAD OF SCI
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More