Check patentability & draft patents in minutes with Patsnap Eureka AI!

A method for detecting spam

A technology of spam web pages and detection methods, applied in the fields of natural language processing, information retrieval, and data mining, can solve serious problems, high time complexity, and the influence of noise points on clustering, etc. Persuasion and representation, the effect of maintaining cultural health

Active Publication Date: 2020-07-03
TIANJIN UNIV
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Often we cannot determine in advance the number of clusters that need to be clustered.
Second, the randomness of the center selection during the initial clustering may lead to a polarized aggregation effect
Third, noise points have a serious impact on clustering
Fourth, repeated calculations make the method have a higher time complexity

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for detecting spam
  • A method for detecting spam
  • A method for detecting spam

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] Below in conjunction with accompanying drawing, the present invention will be further described:

[0026] The invention provides a method for detecting garbage web pages, such as figure 1 Shown is the overall flow diagram of the method of the present invention, including:

[0027] Step S101: Carry out the K-Means algorithm on the data set, store all objects n in the data set D, and the expression form of D is shown in formula (1).

[0028] D={x i |x i =(x i1 ,x i2 ,...,x id ),i=1,2,…,n} (1)

[0029] In formula (1), x i =(x i1 ,x i2 ,...,x id ) is a d-dimensional vector representing d different attributes of the i-th data, where i is the sample size. The data set D used in this embodiment is from the WEBSPAM-UK2007 data set, and the characteristic attributes are provided by the WebSpam Challenge platform, and its link is http: / / webspam.lip6.fr / wiki / pmwiki.php.

[0030] Step S201: Perform IPR calculation on the data set D, and sort the IPR values ​​from high t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a garbage webpage detection method. The method comprises the steps of 1: performing a K-Means algorithm on a data set and storing all objects n by using a data set D; 2: performing IPR calculation on the data set D, and sorting IPR values according to a sequence from high to low; 3: selecting webpages with the maximum and minimum IPR values in the data set as initial clustering centers C; 4: calculating a distance dist(xi,cj) between xi and cj in the data set D, and dividing dist(xi,cj) in a cluster where the center closest to dist(xi,cj) is located; 5: viewing a clustering center when aggregation is finished, and obtaining a new cj expression; and 6: repeating the steps 4-6, representing a target function with SSE, stopping the algorithm until a minimum value of SSE is achieved, and obtaining a final clustering result, thereby identifying garbage webpages. According to the method, the shortcoming of ignoring of webpage importance during link weight allocation in a conventional recommendation technology can be overcome; and in combination with personalized webpage sorting, the purpose of detecting the garbage webpages in an aggregation form is achieved.

Description

technical field [0001] The invention relates to the fields of data mining, natural language processing and information retrieval, relates to garbage webpage detection technology and webpage clustering technology, in particular to a garbage webpage detection method based on webpage authority. Background technique [0002] Currently, in related technologies, recommendation technologies are mainly divided into two types: the first type of recommendation technology is link-based recommendation, such as the PageRank algorithm. Its advantage is that the authority is expressed in numerical form, and then arranged in order from high to low. It is precisely because of the digitalization of webpage quality that it is widely used to find spam webpages, forming a very good criterion for judging the authority of webpages. [0003] The defects of the PageRank algorithm are mainly manifested in two aspects. On the one hand, it ignores the relevance of web page content. For example: if sp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/9535G06F16/35
CPCG06F16/35
Inventor 张亚平马舒婕于瑞国喻梅王建荣孟莹
Owner TIANJIN UNIV
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More