Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and system for discovering mirror website based on visual similarity

A similarity, website technology, applied in the field of network information, can solve the problems of inability to have both recall rate and accuracy rate, inapplicable mirror website identification algorithm, large difference in source code of mirror website, etc.

Active Publication Date: 2022-06-17
INST OF INFORMATION ENG CHINESE ACAD OF SCI +1
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method also has certain limitations. Nowadays, many websites use templates, which have a high similarity in the source code of the webpage. As a result, there are certain problems in the simhash algorithm, and the recall rate and accuracy rate cannot be achieved at the same time.
[0006] On the other hand, with the emergence of JavaScript, many web pages import JS, resulting in great differences in the source code of many mirror sites, but the displayed pages are similar, so the mirror site identification algorithm based on source code is no longer applicable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for discovering mirror website based on visual similarity
  • Method and system for discovering mirror website based on visual similarity
  • Method and system for discovering mirror website based on visual similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] In order to make the above-mentioned features and advantages of the present invention more obvious and easy to understand, the following embodiments are given and described in detail with the accompanying drawings as follows.

[0052] One of the obvious features of a mirror website is that it has a very high visual similarity with the original webpage. In order to effectively identify the mirror website, the mirror website can be efficiently identified based on the webpage page segmentation algorithm and the image similarity detection algorithm. Page segmentation can be divided into different semantic blocks according to the difference of web page content, and then each semantic block is converted into an image and compared with the reference web page, so as to identify the mirror website. This method can effectively solve the problem of code differences and encryption of mirror websites, and is of great significance for identifying mirror websites.

[0053] The present...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention provides a method and system for discovering a mirror website based on visual similarity. The steps of the method include: initially dividing the webpage into blocks, and using the obtained blocks as nodes of the DOM tree; continuing to divide the divisible nodes, Use the split new block as the child node of the node; for an indivisible node, store the block of the node as a page block in the page block pool, and iterate the blocks in this way until all the page blocks are obtained ; Detect the dividing bar in the page, determine the weight of the dividing bar; reconstruct based on the weight of the dividing bar, obtain the semantic block; convert the semantic block into an image, extract the signature feature of the image; extract the target web page and the reference web page according to the above steps Based on the signature features of each semantic block, the distance between the target webpage and the reference webpage is calculated through the EMD distance algorithm. If the distance is less than a set threshold, it is determined that the website of the target webpage belongs to the mirror website.

Description

technical field [0001] The invention relates to the technical field of network information, in particular to a method and system for discovering mirrored websites based on visual similarity. Background technique [0002] A mirror site is a copy of a site's content. Mirror sites are often used to provide different sources for the same content, especially to provide a reliable network connection during heavy downloads. The mirror website is not much different from the main website, or it can be regarded as a backup measure for the main website. The characteristics of the mirror website are: if the main website cannot be accessed normally (such as a server dies or other accidents), it can still be browsed normally through other servers. Relatively speaking, the main site is slightly better than the mirror site in terms of speed and other aspects. [0003] The most common are mirror sites. By duplicating the content of a website or webpage and assigning different domain name...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F11/14G06F11/20G06K9/62H04L67/02
CPCH04L67/02G06F11/1464G06F11/2056G06F18/22
Inventor 李睿杜翠兰李鹏霄张鹏陈志鹏杨兴东
Owner INST OF INFORMATION ENG CHINESE ACAD OF SCI