Statistical machine learning-based internet hidden link detection method

A statistical machine learning and dark chain detection technology, applied in the field of network technology and search, can solve the problems of weak identification of hidden methods, missed detection, and inability to automatically respond to hidden methods, and achieve the effect of effective detection

Active Publication Date: 2014-12-24
CHINA INTERNET NETWORK INFORMATION CENTER
View PDF6 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This detection method is weak in identifying one of the hidden methods used by dark links (definition of invisible codes in JavaScript scripts). At prese...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Statistical machine learning-based internet hidden link detection method
  • Statistical machine learning-based internet hidden link detection method
  • Statistical machine learning-based internet hidden link detection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] In order to make the above objects, features and advantages of the present invention more obvious and understandable, the present invention will be further described below through specific embodiments and accompanying drawings.

[0026] figure 1 It is an overall flowchart of the dark link detection method based on statistical machine learning of the present invention, including data preparation and preprocessing flow (collecting and classifying webpage source code samples, extracting anchor text, word segmentation and vectorization), and performing classification model training, Use the classification model for steps such as unknown web pages to be detected.

[0027] figure 2 The data preparation and preprocessing flow of the present invention is demonstrated. Proceed as follows

[0028] 1) Collect source code files containing dark links and HTML source code files without hidden links. The former is selected by human screening and identification; the latter selects ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a statistical machine learning-based hidden link detection method. The method comprises the following steps: (1) collecting real webpage source code data as a training set for a classification model, and dividing the data into a category containing hidden links and a category containing no hidden links; (2) extracting anchor texts, i.e., character contents of link fields, from Html source code files of all the collected webpages of the two categories respectively, then segmenting the anchor texts into single words; (3) vectoring the two categories of texts which are subjected to word segmentation; (4) performing dimension reduction processing on a vector corresponding to each text; (5) training the two categories of data obtained in the step (4) by using a classifier to obtain a classification model; (6) applying the obtained classification model to an unknown webpage to be detected to obtain a hidden link detection result. Whether a webpage contains the hidden link or not is effectively and automatically detected by using the source code of the webpage, so that theoretical and practical support can be provided for a search engine to crack down network cheating.

Description

technical field [0001] The invention belongs to the field of network technology and search technology, and in particular relates to a method for detecting dark links on the Internet based on statistical machine learning. Background technique [0002] As an important entrance to the Internet, search engines have become an indispensable tool for netizens every day, and the ranking of search results is very important for the presentation of search results. Search engines have special algorithms (such as Google's PageRank, etc.) to measure the relative importance of web pages, and use this to determine the ranking of search results. Since search engines use "crawlers" to grab webpage content along the links between webpages, in most algorithms to measure the importance of webpages, the external links of webpages are an important factor, that is, the more links from external websites pointing to the target webpage, The higher the weight value of the landing page, the easier it i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/951G06F18/24G06F21/60
Inventor 孟池洁王伟耿光刚隋鹏宇
Owner CHINA INTERNET NETWORK INFORMATION CENTER
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products