A website classification method based on the comprehensive characteristics of darknet websites

A comprehensive feature and website technology, applied in the field of network data analysis, can solve the problems of increased cost of manual maintenance, difficulty in adapting to users' needs for classification of dark web websites, etc., achieve high classification accuracy and reduce costs

Active Publication Date: 2021-06-22
INST OF INFORMATION ENG CHINESE ACAD OF SCI
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] At present, the classification of dark net mostly relies on manual maintenance, which can ensure the accuracy of the classification. However, as the number of dark net websites increases, the cost of manual maintenance will also increase greatly, making it difficult to meet the needs of users for the classification of dark net websites.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A website classification method based on the comprehensive characteristics of darknet websites
  • A website classification method based on the comprehensive characteristics of darknet websites
  • A website classification method based on the comprehensive characteristics of darknet websites

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0030] Processing method of the present invention is:

[0031] The first step is to crawl the marked website (such as figure 1 shown):

[0032] (1) Crawl marked websites with Scrapy, check the current crawling depth when crawling, and only crawl webpages with a depth less than or equal to 2.

[0033] (2) Manually review labels and remove incorrectly labeled samples.

[0034] Step 2: Obtain the comprehensive characteristics of the website (such as figure 2 shown):

[0035] (1) Use the word-bag model to construct the word space vector model of the website, and use the TFidfVectorizer class in the scikit-learn library of Python to calculate the TF-IDF value of the word.

[0036] (2) Extract Keyword (keyword in html meta tag), Description (webpage description information in html meta tag), Title (htm title) tag, its weight is 0.6, other word weight is 0.4, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a website classification method based on the comprehensive characteristics of darknet websites. The method is as follows: 1) Crawl the target darknet website to obtain a marked darknet website training set; 2) extract the information of each website in the collection for word segmentation, construct the word space vector of the website, and calculate each word The weight of the word; the space vector after multiplying the word and the corresponding weight is used as the text feature of the website; 3) extract the label of each website in the training set of the dark net website, construct the space vector of the label of the website, and calculate each The weight of the label; the space vector multiplied by the label and the corresponding weight is used as the structural feature of the website; 4) the text feature and structural feature of each website are combined to obtain the comprehensive feature of the website; 5) the comprehensive features of each website are analyzed training to obtain a classification model; then use the classification model to predict the website to be classified to obtain the category of the website to be classified. The invention improves the website classification efficiency.

Description

technical field [0001] The invention belongs to the field of network data analysis and relates to a website classification method based on the comprehensive characteristics of darknet websites. Background technique [0002] Darknet refers to a private network that uses unconventional protocols and ports and trusted nodes to connect, and data transmission on the darknet is anonymous (Wikipedia). Today's typical darknet technologies include Tor, I2P, Freenet, OneSwam, etc. [0003] The biggest feature of the dark web is the realization of anonymous data transmission for privacy protection. Because of its anonymity, the dark web is often not used to transmit various sensitive information. For example, the dark web contains a lot of information on extremism, drugs, and gun transactions. At the same time, the dark web is also a gathering place for various hacker-related information. In the dark web, there are many hacker forums and hacker markets, where hacker information such ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/35
CPCG06F16/35G06F16/951
Inventor 谭庆丰时金桥王学宾尹泽林李抗蒋晓明陈牧谦高悦
Owner INST OF INFORMATION ENG CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products