Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Dark network data collection and extraction system and method

A data acquisition and extraction system technology, applied in the field of Internet information

Active Publication Date: 2018-03-16
HARBIN INST OF TECH AT WEIHAI +1
View PDF12 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this patent is mainly aimed at the dynamic data acquisition of the Deep Web mentioned above, that is, the Deep Web. The pages in the Deep Web have no definite links and can only be accessed by constructing a dynamic query request, but after constructing a dynamic query request Conventional crawlers can crawl directly

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dark network data collection and extraction system and method
  • Dark network data collection and extraction system and method
  • Dark network data collection and extraction system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0089]一种暗网数据采集与抽取系统,如图1所示,包括依次连接的暗网站点发现模块、暗网数据爬取模块、页面解析与内容抽取模块、数据融合与存储模块;

[0090]暗网站点发现模块获取多源数据中的暗网url,并发送至暗网数据爬取模块;

[0091]暗网数据爬取模块配置Tor服务,修改Nutch的配置,使其与socks协议交互,基于Nutch做进一步的开发,解决其表单登录和Cookie登录的相关问题;

[0092]页面解析与内容抽取模块进行页面解析、页面向量化、特征码生成、相似度计算与模板集更新、页面内容抽取;

[0093]页面解析与内容抽取模块抽取得到的内容是基于数据项粒度的,数据融合与存储模块采用数据对齐策略对抽取得到的内容重组,同时将内容相近的数据记录融合存入数据库中。

Embodiment 2

[0095]一种暗网数据采集与抽取的方法,如图2、图3所示,本实施例以暗网中毒品数据的爬取为例:包括:

[0096](1)手动或自动获取暗网url;

[0097]自动获取暗网url是指从明网和暗网中寻找包含".onion”域名的链接;自动获取的方式为敏感词查询或站点监控;

[0098]敏感词查询包括:

[0099]A、基于暗网售卖类相关网站中敏感信息构筑敏感词库;暗网中毒品数据的敏感词库中的敏感词包括大麻、冰毒、杜冷丁、可卡因、古柯等毒品种类;

[0100]B、使用敏感词库中的关键词作为查询,爬取搜索引擎返回的结果页面,设定结果页面列表中前10页作为待检测页面;

[0101]C、设计正则表达式,对得到的待检测页面进行提取,得到待检测页面的url,并将得到的url去重存入url存储列表中。

[0102]正则表达式就是在html中寻找链接,并且链接中域名的后缀是".onion”。做法是:先提取html中所有的链接,在获取的链接中过滤出域名的后缀是".onion”的链接。下面是简单的两个正则表达式,只是简单说明提取url采用的方式。

[0103]如获取html中链接:

[0104]Pattern="

[0105]过滤符合要求的链接:

[0106]Pattern=”(.*\.onion)|(.*\.onion / .*)”

[0107]url存储列表是一个简单的数据库,包含两列,一列为索引号,一列为url。

[0108]站点监控包括:

[0109]a、设定监控网站名单;如:一些宗教网站、社交网站、论坛网站等;另外,明网中的一些网站如:https: / / www.deepdotweb.com,会公布已经发现的暗网站点,这些网站也是我们的爬取对象,也列入监控网站名单。

[0110]b、设定爬取间隔T,每隔一个爬取间隔T对监控网站名单中的网站进行爬取;T为一周;

[0111]c、解析所有页面内容,利用正则表达式提取所有符合要求的链接,并将url去重存储;

[0112]d、明网中的一些网站如:https: / / www.deepdotweb.com,会公布已经发现的暗网站点,这些网站也是我们的爬取对象,对所有的url爬取之后去重存储。

[0113]手动获取暗网url,包括:

[0114]D、基于暗网售卖类相关网站中敏感信息构筑敏感词库;

[0115]E、在暗网中根据敏感词库,采用...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention relates to a dark network data collection and extraction system and method. The system comprises a dark network point discovery module, a dark network data crawling module, a page parsing and content extraction module, and a data fusion and storage module that are connected in sequence; the dark network point discovery module obtains dark web urls in multi-source data; the dark network data crawling module is configured with the Tor service, modifies Nutch configuration to make the Nutch interact with a socks protocol, and crawls data; the page parsing and content extraction module carries out page parsing, page vectoring, signature generation, similarity calculation, template set updating, and page content extraction; and the data fusion and storage module reorganizes the extracted content by using a data alignment strategy, and fuses similarly-closed data records into a database. According to the system and method disclosed by the present invention, a complete system design flow from page saving to content extraction is realized, and data support is provided for the discovery of illegal transaction activities in the dark network and the establishment of thedark network knowledge map.

Description

Technical field [0001] The invention relates to a dark network data collection and extraction system and method, and belongs to the technical field of internet network information. Background technique [0002] The Tor (The Onion Router) network provides users with anonymized services. While ensuring privacy, it also encourages criminal behavior. Many websites clearly mark the sale of drugs, guns and ammunition and other prohibited items. The Dark Web is different from the Surface Web and the Deep Web. The former is a page that can be directly crawled by search engines, and the latter is a page that must be accessed through dynamic requests. Tor achieves complete anonymity through the three-hop mechanism of routing. When using the Tor browser to access the dark web, there will be three transfer nodes between the entry node and the final destination server. The entry node knows the user's IP address, and the exit node knows the destination server's IP address and the transmitted d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/2228G06F16/2471G06F16/258G06F16/283G06F16/951
Inventor 程国标王佰玲刘扬王巍孙云霄辛国栋
Owner HARBIN INST OF TECH AT WEIHAI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products