Bad webpage recognition method based on URL

A technology for identifying methods and webpages, applied in data exchange networks, special data processing applications, instruments, etc., can solve problems such as inability to cope with new sites, large delays, and high complexity of methods

Inactive Publication Date: 2010-04-07
XI AN JIAOTONG UNIV
View PDF0 Cites 66 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0014] 2. The method based on image and streaming media recognition has a wide range of applications, but the method handles a large amount of data, is highly complex, has a large delay, consumes a lot of bandwidth resources, and is not suitable for real-time recognition and processing in a network environment. ;
The disadvantage of this method is that it has poor flexibility and cannot cope with new sites;
[0016] 4. At present, there is no literature on the identification of bad web pages through URL analysis and semantic understanding, so this invention makes up for the vacancy in this regard and provides a new idea for quickly identifying bad web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Bad webpage recognition method based on URL
  • Bad webpage recognition method based on URL
  • Bad webpage recognition method based on URL

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0066] In order to understand the present invention more clearly, the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0067] refer to figure 1 As shown, in the process of identifying URLs, the special characters are first filtered out through the preprocessing module, and the suffixes, main domain names, host names and other parts that have practical effects on the identification are extracted; Belongs to the exclusive suffix (.gov.edu): if it belongs, it will be directly judged as a normal URL, otherwise it will be judged in the next step; in the main judgment process, the domain name part is segmented and feature extraction is performed, and the host name part is Feature extraction: use the combined classifier to classify and judge the extracted results. If the result of the judgment is a normal URL, it will be further confirmed by subsequent tools. If it is bad, the user will be directly prohibited from accessing ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a bad webpage recognition method based on URL. The method judges whether a URL is the URL of a pornographic website through the semantic analysis to URL primary domain and the structure analysis to the whole URL. When judging, the two characteristics-sensitive string characteristic and structure characteristic contained in URL are extracted to be the basis for judging, and discriminator final comprehensive characteristic combined with SVM algorithm is adopted to perform secondary classification and obtain the judge result. The bad webpage recognition method based on URL of the invention can assist other recognition methods so as to fast recognize bad websites and provide healthy Internet environment; and the judgment can be performed without obtaining web contents so as to provide a high effective new idea for the recognition of pornographic websites.

Description

technical field [0001] The invention relates to a method for filtering bad information on the Internet, in particular to a method for identifying bad webpages based on URLs. The method involves the field of machine learning, and the final discrimination is accomplished by applying feature extraction and classification techniques in the field of machine learning. Background technique [0002] With the rapid development of the Internet, bad Internet culture is also flooding it, and the emergence of a large number of pornographic web pages has seriously affected the healthy development of young people. In recent years, research on automatic identification of pornographic content has made remarkable achievements. After a novelty search, the applicant retrieved two patents related to the present invention on the automatic identification of pornographic content, which are: [0003] 1. Multifunctional management system for network pornography and bad information detection [000...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L12/24G06F17/30
Inventor 郑庆华骞雅楠刘均常晓吴朝晖蒋路
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products