Method and device for identifying cheat web-pages
A recognition method and web page technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of poor recognition effect of new cheat web pages, and achieve the effect of improving the recognition effect.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0056] This preferred embodiment provides a method for identifying cheating webpages based on active learning and semi-supervised learning, Image 6 It is a flow chart of each step of the fraudulent webpage identification method based on semi-supervised learning and active learning according to Embodiment 1 of the present invention, as Image 6 As shown, the method may include the following steps:
[0057] Step S602: Specify the webpage feature set F used. This step is mainly used to determine the features to be extracted from the webpage, including content features, structural features, link relationship features, and the like.
[0058] Step S604: Preprocessing the known webpage sample set S. The goal of this step is to convert each known webpage sample into a feature vector according to the feature set F determined in step S602, and divide the sample set S into two parts for model training and testing. It should be pointed out that the "known webpage" in this article refe...
Embodiment 2
[0077] The cheating webpage identification method based on active learning and semi-supervised learning proposed in this preferred embodiment, the overall flow of each step is as follows Image 6 shown. Wherein, step S602 determines the used webpage feature set F, step S604 preprocesses each webpage in the known webpage sample set S according to the feature set determined in step S602, and step S606 obtains some unmarked webpage samples (denoted as set U), step S608 trains a support vector machine model according to sets S and U, and utilizes the model to identify cheating webpages. Step S610 is used to add new features to the webpage feature collection F. Steps S612 and S614 adopt active learning methods to add to webpages A new sample is added to the sample set S. The main steps are described in detail next.
[0078] Step S602: Determine the webpage feature set F used.
[0079] This step will clearly characterize the feature set of the webpage from the aspects of webpage ...
Embodiment 3
[0112] In this preferred embodiment, a method for identifying cheating webpages is provided, including: Step S2: clarifying the set of webpage features used, including content features, structural features, link relationship features, etc. of the webpage. Step S4: preprocessing the known webpage sample set, including vectorizing the webpage according to the webpage features of the step, and dividing the sample set into two parts of training and testing. Step S6: obtain the unknown webpage sample collection, step S8: according to known and unknown webpage samples, adopt the method for semi-supervised learning, generate recognition model: step S10: judge whether certain webpage cheats according to model, and carry out corresponding processing; Step S12: adding new webpage features; Step S14: adopting an active learning method to add new known webpage samples.
[0113] Preferably, the above-mentioned step of preprocessing the set of known web page samples may include: converting ...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com