Method for establishing machine learning model to check webpage hidden links through domain identification and theme identification

A machine learning model and web page technology, applied in the field of network security, can solve the problems of reducing the recognition effect and extracting features rough

Active Publication Date: 2018-01-09
上海斗象信息科技有限公司
View PDF4 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Only one of the currently published dark link detection patents uses a machine learning algorithm. The patent (application number 201410452221.2 publication number CN104239485A) uses machine learning to build a model to identify dark links based on all the anchor text extracted from the page as features. Extracting all the anchor text on the page will generate a lot of noise data, and the extracted features will be thicker, which will reduce the recognition effect, and because only the anchor text is used as a feature, the content of the page without hidden links will be tampered and misidentified as hidden links

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for establishing machine learning model to check webpage hidden links through domain identification and theme identification
  • Method for establishing machine learning model to check webpage hidden links through domain identification and theme identification
  • Method for establishing machine learning model to check webpage hidden links through domain identification and theme identification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0064] In order to make the technical means, creative features, goals and effects of the present invention easy to understand, the following embodiments will specifically explain the method of domain recognition plus topic recognition to build a machine learning model to detect hidden links in web pages in conjunction with the accompanying drawings.

[0065] Such as figure 1 , figure 2 As shown, the method of domain identification plus topic identification to build a machine learning model to detect dark links on web pages includes the following steps:

[0066] Step S1, collecting a large number of webpage source codes as a training set, which includes webpages marked as containing hidden links and webpages marked as normal.

[0067] Step S2, extracting the feature data used to build the machine learning model in the source code of the webpage in the training set and the source code of the webpage to be predicted. The feature data includes risk degree, theme abnormality deg...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for establishing a machine learning model to check webpage hidden links through domain identification and theme identification. The method comprises the steps of collecting a great number of webpage source codes which are marked as the webpage source codes comprising hidden links and the webpage source codes marked as the normal webpage source codes, and forming a training set based on the webpage source codes; extracting a risk text, risk degree, theme difference degree, a theme, a risk text vector, a risk text abnormal probability and the risk text length in each web page source code through suspicious domain identification, sensitive domain identification, secure domain identification, all-domain analysis and theme identification; carrying out model training on characteristic data of all web page source codes in the training set through utilization of a machine learning algorithm, thereby obtaining a classification distinguishing model; and importingthe characteristic data of to-be-predicted web page source codes into the classification distinguishing model, thereby obtaining a result of whether the to-be-predicted web page source codes comprisethe hidden links or not. According to the method, the highly mixed hidden link code identification effect is good, the feature extraction is relatively complete and the problem that a traditional method cannot accurately distinguish the hidden links and page tampering can be solved well.

Description

technical field [0001] The invention belongs to the technical field of network security, and in particular relates to a method for detecting dark links in webpages by constructing a machine learning model based on domain identification and topic identification. Background technique [0002] In recent years, the Internet industry has developed vigorously, and the Internet has become the main way for people to obtain information. With the emergence of various new websites, Internet information has grown exponentially. With this massive amount of information, search engines have become the main information search tools. Search engines crawl website information and calculate the weight of web page content to rank and display it in the search results. Since the websites displayed at the front of the search results have a higher probability of being visited by users, some website managers often resort to various cheating methods in order to obtain more visits. "Dark link" is a ch...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L29/06G06F17/30G06N99/00
Inventor 孟雷
Owner 上海斗象信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products