Method and device for identifying cheat web-pages

A recognition method and web page technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of poor recognition effect of new cheat web pages, and achieve the effect of improving the recognition effect.

Inactive Publication Date: 2013-06-12
人民搜索网络股份公司
View PDF3 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The main purpose of the present invention is to provide a cheating web page identification method and device, to at least solve the problem in

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for identifying cheat web-pages
  • Method and device for identifying cheat web-pages
  • Method and device for identifying cheat web-pages

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0055] Embodiment one

[0056] This preferred embodiment provides a method for identifying cheating webpages based on active learning and semi-supervised learning, Image 6 It is a flow chart of each step of the fraudulent webpage identification method based on semi-supervised learning and active learning according to Embodiment 1 of the present invention, as Image 6 As shown, the method may include the following steps:

[0057] Step S602: Specify the webpage feature set F used. This step is mainly used to determine the features to be extracted from the webpage, including content features, structural features, link relationship features, and the like.

[0058] Step S604: Preprocessing the known webpage sample set S. The goal of this step is to convert each known webpage sample into a feature vector according to the feature set F determined in step S602, and divide the sample set S into two parts for model training and testing. It should be pointed out that the "known webp...

Example Embodiment

[0076] Embodiment two

[0077] The cheating webpage identification method based on active learning and semi-supervised learning proposed in this preferred embodiment, the overall flow of each step is as follows Image 6 shown. Wherein, step S602 determines the used webpage feature set F, step S604 preprocesses each webpage in the known webpage sample set S according to the feature set determined in step S602, and step S606 obtains some unmarked webpage samples (denoted as set U), step S608 trains a support vector machine model according to sets S and U, and utilizes the model to identify cheating webpages. Step S610 is used to add new features to the webpage feature collection F. Steps S612 and S614 adopt active learning methods to add to webpages A new sample is added to the sample set S. The main steps are described in detail next.

[0078] Step S602: Determine the webpage feature set F used.

[0079] This step will clearly characterize the feature set of the webpage fro...

Example Embodiment

[0111] Embodiment three

[0112] In this preferred embodiment, a method for identifying cheating webpages is provided, including: Step S2: clarifying the set of webpage features used, including content features, structural features, link relationship features, etc. of the webpage. Step S4: preprocessing the known webpage sample set, including vectorizing the webpage according to the webpage features of the step, and dividing the sample set into two parts of training and testing. Step S6: obtain the unknown webpage sample collection, step S8: according to known and unknown webpage samples, adopt the method for semi-supervised learning, generate recognition model: step S10: judge whether certain webpage cheats according to model, and carry out corresponding processing; Step S12: adding new webpage features; Step S14: adopting an active learning method to add new known webpage samples.

[0113] Preferably, the above-mentioned step of preprocessing the set of known web page sampl...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for identifying cheat web-pages. The method comprises the steps as follows: obtaining a set of known webpage samples, wherein the known webpage samples are the webpage samples known whether to be the cheat web-pages or not; generating an initial support vector machine used for judging the cheat web-pages according to the set of known webpage samples; obtaining a set of a first preset amount of unknown webpage samples, wherein the unknown webpage samples are the webpage samples unknown whether to be the cheat web-pages or not; adjusting the model parameters of the initial support vector machine according to the set of unknown webpage samples; and judging whether the web-pages to be detected are the cheat web-pages by the adjusted support vector machine. By the method, the problem of poorer effect of identification of the cheat webpage identification method based on machine learning in related technology on novel cheat web-pages is solved and the effect of identifying the novel cheat web-pages is improved.

Description

technical field [0001] The invention relates to the field of computer information retrieval, in particular to a cheating web page identification method and device. Background technique [0002] In the context of the current explosive growth of Internet information, search engines have become one of the important entrances for people to enter the Internet world according to their own needs. Therefore, the ranking position of a web page in a search engine affects the number of visits to the web page to a considerable extent. In order to obtain higher visit volume and further obtain more economic benefits, websites always hope that their pages appear in the top positions in the results returned by search engines. By improving the quality of the page, making its content more relevant to the user's query and more in line with the user's needs, it is a conventional method to improve the page ranking. However, some webpages adopt targeted deception methods according to the charac...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 杨甲东
Owner 人民搜索网络股份公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products