Content-based web page classification method and system

A webpage classification and webpage technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of high labor costs, coarse classification granularity, unclassified, etc., to reduce labor costs and improve classification accuracy Effect

Active Publication Date: 2012-12-12
BEIJINGNETENTSEC
View PDF6 Cites 27 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The wrong classification of a certain web page may lead to customer complaints
The language of most web pages is not standardized, which increases the complexity of using related methods to classify
In addition, the construction and maintenance of the thesaurus and classifiers are more complicated and costly;
[0006] 2) The classification granularity is relatively coarse
[0007] 3) Classification is not real-time enough
Due to the rapid evolution of the website, there are a huge number of outdated websites and new websites every day, so the maintenance of the database is very time-consuming and labor-intensive
In addition, for websites that are not collected in some databases in user scenarios, usually only unclassified results can be given;
[0008] 4) Most of the current classification methods are automatic classifier + manual check or pure manual classification, the database growth rate is slow, and the labor cost is high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Content-based web page classification method and system
  • Content-based web page classification method and system
  • Content-based web page classification method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0112] Image 6 It is a schematic flow chart of a content-based web page classification method according to an embodiment of the present invention, such as Image 6 As shown, the method includes:

[0113] 601. Send the URL to be classified to the cache for query, if the record is hit, the classification result is returned directly; if it is not hit, then enter step 602;

[0114] 602. Query the URL in the database module. If the record is hit, the classification result is returned directly, and the URL and the classification result are recorded in the cache at the same time; if it is not hit, then enter step 603;

[0115] 603. In the query for 602, it will return whether the domain name corresponding to the URL supports subdivision; if it is marked that the website supports subdivision, then enter step 604; otherwise, return the unclassified result directly;

[0116] 604. Carry out classification according to the URL feature, if the classification result is obtained, then dir...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a content-based web page classification method, which comprises the following steps of: acquiring, by user equipment, a characteristic keyword in a uniform resource locator URL of a web page to be accessed by a user, and querying a local URL characteristic library according to the characteristic keyword in the URL to acquire corresponding web page classification information; and further acquiring web page content of the web page to be accessed by the user when the corresponding web page classification information is not queried by the user equipment in the URL characteristic library and querying a local web page template library according to the web page content to acquire the corresponding web page classification information. The invention also correspondingly discloses a content-based web page classification system. According to the content-based web page classification method and the content-based web page classification system, web page granularity-based classification can be realized, the classification accuracy and the classification real-time property are improved, and labor cost is reduced.

Description

technical field [0001] The invention relates to the field of network security and monitoring, in particular to a content-based web page classification method and system. Background technique [0002] In the field of network security and monitoring, according to actual policy requirements, certain types of websites need to be blocked. In addition, in order to prevent important information leakage and review, enterprises need to record user access records and traffic information. Therefore, controlling, auditing, and logging the actual access content of customers is the main purpose of current online behavior management products. In this context, the real-time and accuracy of content recognition for websites and even webpages and related implementation technologies are the core technologies in this field. [0003] At present, for website classification, offline classification is mostly used. That is, a large number of pages are obtained through web crawler technology in adva...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 贾晋康吕烨张永臣
Owner BEIJINGNETENTSEC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products