Webpage text classification method based on feature selection

A feature selection and text classification technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve the problems of slow classification speed and low accuracy, and achieve higher accuracy, higher recall, and shorter execution time. Effect

Inactive Publication Date: 2014-05-21
XIAN UNIV OF TECH
View PDF1 Cites 34 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a web page text classification method based on feature selection, which solves the problems of slow classification speed and low accuracy in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage text classification method based on feature selection
  • Webpage text classification method based on feature selection
  • Webpage text classification method based on feature selection

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0037] The classification method of the present invention combines the position of the characteristic words and the inter-class and intra-class distribution of the characteristic words when calculating the weight of the characteristic words, thereby avoiding the deficiency that those characteristic words that do not contribute to the classification are given a larger weight, and finally improve the classification accuracy.

[0038] Relevant definitions in the present invention are as follows:

[0039] Definition 1 (Term Frequency) Term Frequency (TF, Term Frequency) refers to the characteristic word t k in document d i The number of occurrences in , use tf ik (d i )express. On the premise of excluding stop words and individual high-frequency words, the feature word t k in document d i The more times it appears in , the more it represen...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Provided is a webpage text classification method based on feature selection. Firstly, data sets formed by a large number of webpages are divided into a training set and a testing set; secondly, different weights are endowed to labels according to webpage content expression capacity of information in a webpage label domain, and the weights (the product of a normalized word frequency and an inverse document frequency) of feature words in each webpage in the training set are calculated; on the basis of the obtained weights and through the combination of an intra-class distribution law and inter-class deviation, feather vectors of all webpages in the training set are calculated, and accordingly feature vectors of all classes in the training set are calculated; finally, the word frequency of feature words in each webpage in the testing set is calculated, the similarity between a webpage to be classified and each class in the training set is calculated, the class with the maximum similarity serves as the class where the webpage to be classified belongs, and a classification result is obtained.

Description

technical field [0001] The invention belongs to the technical field of data mining methods, and relates to a web page text classification method based on feature selection. Background technique [0002] With the rapid development of computer and communication technology and the rapid popularization and application of the Internet, the number of web pages on the network is increasing exponentially. Facing the explosive growth of massive network information, how to quickly and effectively obtain useful and interesting information is becoming more and more important. Therefore, effectively organizing and managing webpage resources and shortening the time for users to obtain required information has become an urgent problem to be solved at present. Webpage classification technology emerged as the times require, and has gradually become a research hotspot in the field of machine learning after text classification. [0003] Traditionally, the web page classification is first jud...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/353
Inventor 周红芳郭杰王鹏张国荣段文聪王心怡何馨依
Owner XIAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products