Text categorization feature selection and weight computation method based on field knowledge

A technology of weight calculation and text classification, applied in the field of artificial intelligence, can solve the problems of affecting the classification accuracy, reducing the classification accuracy, low weight, etc., to achieve the effect of improving the effect of text classification

Inactive Publication Date: 2008-10-22
KUNMING UNIV OF SCI & TECH
View PDF0 Cites 87 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

These feature selection methods may cause some selected statistical features to contribute less to the classification, which will reduce the accuracy of the classification; and for domain text classification, some domain terms often appear in the text, and these domain terms have great influence on the d

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text categorization feature selection and weight computation method based on field knowledge
  • Text categorization feature selection and weight computation method based on field knowledge
  • Text categorization feature selection and weight computation method based on field knowledge

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] Experimental verification was carried out in the field of Yunnan tourism for the proposed method above. The specific steps are as follows: figure 1 :

[0030] Step a1: The experimental training corpus selects 700 documents in the field of tourism in Yunnan as training texts in the field, and 700 documents in the corpus of Fudan University (70 documents in environment, computer, transportation, education, economy, military, sports, medicine, art, and politics) as non-domain training text. The test corpus uses 200 documents in the field of tourism in Yunnan as the domain test text, and 200 documents from the Fudan University corpus (environment, computer, transportation, education, economy, military, sports, medicine, art, and political documents, 20 each) as non-domain documents. test text.

[0031] Step a2: Text preprocessing, including word segmentation, removal of stop words, word frequency statistics, document frequency statistics, etc. Firstly, the Chinese...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the artificial intelligence technical field, in particular to a text classification feature selection and weigh calculation method based on field knowledge. The method combines sample statistics and field glossaries to construct a filed classification feature space, utilizes internal knowledge relations in the field, calculates the similarity between the glossaries, and then adjusts the corresponding feature weight of classification feature vectors. Moreover, the method adopts a learning algorithm of a support vector machine to construct a field text classification model and then realize field text classification. As shown by text classification laboratory results of the Yunan tourist field and the non-tourist field, the classification accuracy of the method is improved by 4 percent compared with the text classification effect of the improved TFIDF feature weigh method.

Description

technical field [0001] The invention relates to the technical field of artificial intelligence, in particular to a text classification feature selection and weight calculation method based on domain knowledge. Background technique [0002] Text classification is a hot issue in the current natural language processing research. How to identify whether a text belongs to a specific field of text is a key issue in the current research on vertical search engines and question answering systems. Usually in text classification, feature selection is the most important part, which directly affects the accuracy of text classification. Conventional feature selection methods mostly use various evaluation functions such as document frequency (Document Frequency, DF), information gain (Information Gain, IG), mutual information (Mutual Information, MI), statistics (CHI), etc. for feature extraction. These feature selection methods are all based on statistical algorithms. A large amount of c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27G06N1/00G06N99/00
Inventor 余正涛韩露向凤红万舟熊新
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products