New method of characteristic vector weighting for text classification and its device

A feature vector and text classification technology, applied in the field of computer science, can solve the problem of low accuracy of the classifier

Inactive Publication Date: 2006-01-11
INST OF AUTOMATION CHINESE ACAD OF SCI
View PDF0 Cites 45 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, from the reported experimental results, the accuracy of the classifier using this method is not high, and the best F1 measure is 85%.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • New method of characteristic vector weighting for text classification and its device
  • New method of characteristic vector weighting for text classification and its device
  • New method of characteristic vector weighting for text classification and its device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0068] figure 1 In the method of feature vector weights for text classification, the specific steps are as follows:

[0069] Step S1, the collection of training corpus and test corpus, one, download training corpus from the Internet according to six fields (consumer information field, entertainment and game field, financial and economic field, news field, personal communication field, sports field), remove webpage text Some "garbage", word segmentation, part-of-speech tagging, and finally a total of 30.87 million words of training corpus. 2. The test corpus was downloaded from the Internet according to the same principle, sorted out, and a total of 1119 test texts were obtained. Word segmentation was performed after the corpus was collected.

[0070] Step S2,

[0071] 1) The total vocabulary of each category, and remove the words whose frequency is below 0.0001%. This is because words that occur too infrequently for a class are of little importance to that class.

[0072]...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention belongs to the field of computer science and technology, in particular, it relates to a new characteristic vector weight method for text classification. Said invention provides a new weighting method (TF*IWF*DBV). It is characterized by that in TF*IWF method n-th root of DBV and TF can be introduced, the tests show that the adoption of said new method can raise FI measure by 11.8%, and can fully show its effectiveness.

Description

technical field [0001] The invention relates to the field of computer science and technology, in particular to a new method and device for text classification-oriented feature vector weights. Background technique [0002] With the continuous development of science and technology, especially information technology, the way of communication between people has developed from simple face-to-face communication to more and more use of the language form of "text" as the information carrier. The most obvious examples are digital libraries and web texts. There is no doubt that effective management of these language resources can provide great convenience for users to obtain information. However, with the development of network communication, the amount of available text information on the Internet has expanded rapidly, and it can even be said to have grown exponentially. If we manually classify these texts as before, it will not only take time and effort, but also the accuracy canno...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/21
Inventor 宗成庆陈克利
Owner INST OF AUTOMATION CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products