A URL-based webpage classifier construction method and its classification method

A web page classification and construction method technology, applied in the field of information security, can solve problems such as encryption or equivalent replacement, and achieve the effects of low false positive rate and false negative rate, improved classification accuracy, and simple operation.

Active Publication Date: 2021-03-19
JINAN UNIVERSITY
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the prior art, it is generally recognized whether a webpage is a malicious webpage by detecting the content and behavior of the webpage; when identifying a malicious webpage by detecting the content of the webpage, it is necessary to detect the text and picture content of the webpage, malicious code fragments, Behavior records in logs, etc. Therefore, identifying malicious webpages through the content of malicious webpages cannot avoid problems such as changing webpage content, encryption or equivalent replacement

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A URL-based webpage classifier construction method and its classification method
  • A URL-based webpage classifier construction method and its classification method
  • A URL-based webpage classifier construction method and its classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0039] The invention discloses a method for constructing a URL-based webpage classifier, such as figure 1 As shown, the steps are as follows:

[0040]Step S1, obtain URLs of a plurality of webpages, mark webpage attributes for each URL, and use each URL of the above-mentioned marked webpage attributes as a training sample to form a training sample set; in this embodiment, acquire URLs of multiple webpages, the training sample set includes a certain number of webpage attributes as malicious URLs and a certain number of webpage attributes as benign URLs.

[0041] Step S2, for each training sample in the training sample set, perform word segmentation processing on each training sample through the selected character, and then convert it into a word vector.

[0042] In this embodiment, the characters selected in this step include "?", "=", ".", "&", "-" and "#", that is, through "?", "=", "." , "&", "-" and "#" perform word segmentation processing for each training sample. For ex...

Embodiment 2

[0062] This embodiment discloses a method for constructing a URL-based webpage classifier, which differs from the method for constructing a URL-based webpage classifier in Embodiment 1 only as follows:

[0063] In this embodiment, after the training sample set is obtained in step S1, it also includes the step of deduplicating the training sample set, such as image 3 As shown, the details are as follows: first select an initial value for N, obtain the first N characters of each training sample in the training sample set, and for URLs with the same first N characters in the training sample set, only one remains after deduplication processing, and then judge Whether the total number of training samples in the training sample set is less than or equal to the threshold, if not, then reduce the value of N, and do the same processing as above, until the total number of training samples in the training sample set is reduced to less than or equal to the threshold; for deduplication pro...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for constructing a web page classifier based on a URL and a classification method thereof. Firstly, URLs of a plurality of web pages are obtained, web page attributesare marked for each URL, and each URL marked with the web page attributes is used as a training sample to form a training sample set. For each training sample in the training sample set, each trainingsample is segmented by selected characters, and then converted into a word vector. The word vector of each training sample marked with the attributes of the web page in the training sample set is used as input to train the convolution neural network, and the web page classifier is obtained. For the web pages that need to be classified, firstly, the URL of the web page is obtained as a test sample; then the selected characters are processed by word segmentation, and finally converted into word vectors. The word vector of the test sample is inputted into the web page classifier, and the classification result is outputted through the web page classifier. The invention greatly improves the classification accuracy rate of the malicious webpage.

Description

technical field [0001] The invention relates to the technical field of information security, in particular to a method for constructing a URL-based web page classifier and a classifying method thereof. Background technique [0002] The openness and virtualization of the Internet pose serious challenges to privacy, data and transaction security. In recent years, the use of malicious webpages for cybercrime has been rampant. According to statistics, nearly one-third of web pages are potentially malicious. Malicious web pages attack users by sending spam, phishing, etc., causing users without any security awareness to suffer various damages, including Loss of funds, theft of private information, etc., seriously threaten the security of users' property and information. For this reason, how to identify malicious webpages in a timely and effective manner has become an important problem to be solved urgently. [0003] In the prior art, it is generally recognized whether a webpage...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/958G06F16/955G06F16/35G06F40/289
CPCG06F40/289
Inventor 孙玉霞赵晶晶仇之
Owner JINAN UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products