Web page classification method based on training set

A webpage classification and training set technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as decline, webpages cannot be classified, classification accuracy, etc.

Inactive Publication Date: 2009-12-23
NANJING UNIV OF POSTS & TELECOMM
View PDF0 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The knowledge system of the Internet is developing extremely rapidly, and various new knowledge structures are constantly emerging. If the tra

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page classification method based on training set
  • Web page classification method based on training set
  • Web page classification method based on training set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0052] The present invention proposes a technical framework for automatically classifying webpages effectively, and designs a classification algorithm in detail, as shown in the attached figure 1 shown. It can be seen from the figure that the system is divided into three parts, namely: webpage content processing, webpage vector representation and webpage vector comparison.

[0053] There are 2 textual terms that need to be pointed out here. The training set refers to a large collection of web page source codes with known classifications. The source codes are stored in the form of text and stored in different folders according to the corresponding genres. These texts are finally processed and converted into corresponding vectors. Feature extraction refers to the process of determining each element of the web page vector, where the element is a keyword entry that can reflect the content of the web page, and the value of the element is the calculation result of the weight value ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a web page automatic classification method based on a training set. A classification process is the combination of methods of characteristic selection, characteristic weight value determination, text vector comparison, and the like. The automatic classification method based on a classification system mainly classes a document to be classified into a corresponding sort according to a beforehand established sort model, namely a training set. Along with the development of the multimedia technique, the content forms of web page information are also rich and colorful, and contents not only comprise text information but also comprise much structural information and other form information, such as sound, figures, images, and the like. However, because web pages based on texts still possess larger proportions, the classification based on web page texts still takes the precedence. The method has reliable theoretical support and favorable extensibility and accuracy and is easy to be in butt joint with application interfaces correlative to an operator.

Description

technical field [0001] The present invention is aimed at the research of automatic webpage content classification method for any Chinese webpage, mainly researches how to construct training set and use vector comparison method to accurately classify unknown webpage, designs automatic webpage classification model and algorithm, involves document feature extraction and feature Weight calculation and other technical fields. Background technique [0002] With the rapid development and popularization of Internet technology, the amount of web page information on the Web has increased rapidly, and people have entered an era of rich information. Faced with such a wealth of Web information, people often feel at a loss, how to effectively find the required resources has become a concern for people. As the most commonly used online information retrieval tools (such as baidu and google), keyword search engines have disadvantages such as low accuracy rate and large information redundanc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 王攀张顺颐汤琛于伟涛
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products