Method for classifying Chinese webpages based on keyword frequency analysis

A web page classification and frequency analysis technology, applied in special data processing applications, instruments, electrical digital data processing and other directions, can solve the problems of no specification of web page writing, high time cost and complexity of web page classification, and achieve broad meaning and application value. Effect

Inactive Publication Date: 2009-12-02
HUAIHAI INST OF TECH
View PDF0 Cites 107 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Second, Chinese webpages contain a lot of "noise". Many webpages are not standardized, and contain a lot of advertisements, annotations and other information.
[0007] Third, most of the current research on the classification of Chin

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] A method for classifying Chinese webpages based on keyword frequency analysis is to perform fuzzy matching of Chinese webpage classification according to the Chinese classification thesaurus according to the keywords of the analyzed Chinese webpages, and the steps are as follows:

[0030] 1) Obtain the HTML source code of the Chinese webpage according to the website URL input by the user, filter and denoise the acquired source code, and extract the Chinese text in the webpage;

[0031] The purpose is to preprocess all types of coded Chinese webpages and remove noise information irrelevant to the subject, including redundant information such as various tags, script language codes, advertisement and picture links, designer comments, function declarations, and copyright information . Noise information that has nothing to do with the topic will have a great impact on the speed and accuracy of extracting the content of the webpage text, so it is necessary to remove it.

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for classifying Chinese webpages based on keyword frequency analysis. According to the analyzed keywords of the Chinese webpages and a Chinese classification subject thesaurus, the fuzzy matching of the classification of the Chinese webpages is carried out; and through the obtained HTML source code of the webpage, the webpage is pretreated. Through the testing and analysis, a regular expression filter is utilized to filter noise information; a Chinese text of the webpage is extracted; then, through a word classifier and a keyword frequency analyzer, the extracted Chinese text information is subjected to word classification; through the weighed ranking of the word in the text and fuzzy classification algorithm of the webpage, the class ranking of the class which the webpage keyword belongs to is obtained; and the keywords ranking in the several tops are selected and subjected to calculation of membership rate to obtain the fuzzy matching result of the class which the webpage belongs to. The method is favorable for organizing mass information on the Internet with high efficiency and is used for interestingness analysis of Internet users, catalogue updating of search engines, mining of Web contents, online document management, and digital library construction.

Description

technical field [0001] The present invention is aimed at the research of the keyword frequency analysis of Chinese webpage and the webpage classification method based on the keyword frequency analysis, and mainly studies how to filter and extract the content of the Chinese webpage through technical means, word segmentation and frequency analysis of webpage keywords, It also studies how to classify webpages by weighted Chinese webpage keywords, involving technical fields such as automatic webpage acquisition, Chinese webpage preprocessing, Chinese word segmentation and keyword frequency analysis, and fuzzy classification of Chinese webpages. Background technique [0002] With the rapid development of Internet technology and Web technology, the number of web pages on the Internet is constantly increasing. The increase of network information greatly facilitates people to obtain information, but the excessive amount of information also brings a lot of difficulties for people to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 掌明垄成龙卢艳宏冯源杨瑞王攀
Owner HUAIHAI INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products