Web page text classification algorithm research based on web page link analysis and support vector machine

A technology of support vector machine and web page links, which is applied in the research field of web text classification algorithm, can solve problems such as inconsistent classification results, slow classification speed, and reduced classification accuracy, and achieve less memory requirements, short classification time, and fast learning speed Effect

Inactive Publication Date: 2015-12-30
HUNAN UNIV
View PDF5 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method of manual classification has many disadvantages: first, when the amount of webpage texts increases rapidly, it becomes impractical to use manual classification methods to classify, and it needs to consume a lot of human resources; Text classification cannot guarantee a high classification accuracy, mainly due to the differences in subjective factors such as each person's experience and knowledge, and the classification results may be inconsistent
There are still many problems in these web page text automatic classification technolog

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page text classification algorithm research based on web page link analysis and support vector machine
  • Web page text classification algorithm research based on web page link analysis and support vector machine
  • Web page text classification algorithm research based on web page link analysis and support vector machine

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0033] The specific implementation of the SVM topic classification method based on text similarity feedback according to the present invention is as follows: the mmseg4j word segmentation system is adopted, and the training and testing of the SVM model is developed and realized with the e1071 package of R software. The kernel function adopts RBF (RadialBasisFunction). Classify from the webpage of Changsha Dianwei.com, among which gourmet is classified as a specialty, and Hunan cuisine, farm cuisine, home cooking, hot pot, Sichuan cuisine, Cantonese cuisine, snacks, seafood, and private kitchen are classified as 9 subcategories, and 5000 of them are classified as Web pages are used as the training set, and 11,500 web page texts are used as the test set. The preprocessing of the webpage is mainly to segment the webpage, remove the noise information irrelevant to the classification in the webpage, and remove stop words, etc. For example, the content of the webpage text is "This ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses web page text classification algorithm research based on web page link analysis and a support vector machine and relates to the technical field of web page classification. The method includes the specific steps that 1, a large number of web pages are divided into a training set and a test set; 2, the web pages (including the training set and the test set) are preprocessed; 3, the word frequencies of feature words in each web page in the training set are calculated; 4, the weights of the feature words in each web page in the training set are calculated; 5, feature vectors of each class in the test set are calculated; 6, text feature vectors of each web page in the training set are calculated; 7, the minimum similarity value is determined as the threshold value; 8, the number of the feature words is reduced to the maximum degree; 9, text feature vectors of the web pages in the test set are classified; 10, the similarity between the classified web pages and the feature vectors is calculated and tested at the same time. A method in which a space vector model and the support vector machine is adopted is used, and the web page text classification algorithm research has the advantages of being short in classification time, high in recall rate, low in memory requirement and high in learning rate.

Description

Technical field: [0001] The invention relates to the research of webpage text classification algorithm based on webpage link analysis and support vector machine, and belongs to the technical field of webpage classification. Background technique: [0002] With the rapid development of computer and communication technology, the Internet is rapidly popularized and used, and the web pages on the network are growing at a geometric progression rate. Facing the explosive growth of massive network information, how to quickly and effectively obtain useful and interesting information is becoming more and more important. Therefore, effectively organizing and managing web resources and shortening the time for users to obtain the required information has become an urgent problem to be solved. Webpage classification technology emerged as the times require, and has gradually become a research hotspot in the field of machine learning after text classification. [0003] The traditional cla...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 王冰陈浩
Owner HUNAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products