Distributed web crawler performance optimization system for mass data acquisition

A distributed network and mass data technology, applied in the field of distributed web crawler performance optimization system, can solve the problems of low deduplication efficiency, server memory resources, inability to effectively eradicate, excessive consumption of junk links, etc., to improve deduplication efficiency, The effect of breaking through performance bottlenecks and improving crawling performance

Inactive Publication Date: 2020-03-06
北京京航计算通讯研究所
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The technical problem to be solved by the present invention is: how to solve the existing distributed web crawler based on the Redis memory database, when faced with massive data collection, the existing deduplication efficiency is not high, excessive consumption of server memory resources and garbage links cannot be effectively eradicated The problem,

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed web crawler performance optimization system for mass data acquisition
  • Distributed web crawler performance optimization system for mass data acquisition
  • Distributed web crawler performance optimization system for mass data acquisition

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] In order to make the purpose, content, and advantages of the present invention clearer, the specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.

[0050] In order to solve the above technical problems, the present invention provides a distributed network crawler performance optimization system for mass data collection, the distributed network crawler performance optimization system includes: an initialization module, a crawling module; wherein,

[0051] The initialization module is used to create a deduplication character string and a spam link characteristic string;

[0052] The crawling module is used to crawl the initial URL address to generate a URL task queue after the master node crawler reads the initial URL address;

[0053] The crawling module is also used to crawl web pages according to the URL task queue to complete the crawling work.

[0054] Wherein...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of software engineering, and particularly relates to a distributed web crawler performance optimization system for mass data acquisition. In the system, aninitialization module is used for newly establishing a deduplication character string and a junk link feature character string. The main node crawler is used for reading the initial URL address, andthe crawling module crawls the initial URL address to generate a URL task queue. The crawling module is used for crawling the webpage according to the URL task queue to finish crawling work. Comparedwith the prior art, the crawling performance bottleneck of the distributed web crawler is broken through, and the crawling performance is improved by 50% or above. The duplicate removal efficiency ofthe URL task queue is improved, and the efficiency requirement of mass data collection is met. The storage space of the URL task queue is optimized, and server memory resources are greatly saved. A junk link filtering link is added, so that server memory resources are saved, and crawler efficiency is remarkably improved.

Description

technical field [0001] The invention belongs to the technical field of software engineering, and in particular relates to a distributed network crawler performance optimization system facing mass data collection. Background technique [0002] Web crawlers, also known as web spiders, web ants, or web robots, can automatically obtain data from the web according to set rules. Distributed web crawlers can efficiently obtain large-scale data sets, and are widely used in search engines and big data analysis, and have become an important tool for massive data collection. [0003] Distributed web crawlers usually include a master node crawler and multiple slave node crawlers, and use the Redis memory database to persist the URL task queue and deduplication queue. The master node crawler crawls the webpage according to the initial URL (Uniform Resource Locator), obtains data, and also obtains a new URL, deduplicates the new URL and puts it into the URL task queue; the slave node cra...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/955
Inventor 王维纲张郭秋晨张凯云吴志成吴艳林纪纲孙鹏陈卓
Owner 北京京航计算通讯研究所
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products