Distributed web crawler performance optimization system for mass data acquisition

A distributed network and mass data technology, applied in the field of distributed web crawler performance optimization system, can solve the problems of low deduplication efficiency, server memory resources, inability to effectively eradicate, excessive consumption of junk links, etc., to improve deduplication efficiency, The effect of breaking through performance bottlenecks and improving crawling performance
CN110866166AInactive Publication Date: 2020-03-06北京京航计算通讯研究所

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
北京京航计算通讯研究所
Publication Date
2020-03-06
Estimated Expiration
Not applicable · inactive patent

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention belongs to the technical field of software engineering, and particularly relates to a distributed web crawler performance optimization system for mass data acquisition. In the system, aninitialization module is used for newly establishing a deduplication character string and a junk link feature character string. The main node crawler is used for reading the initial URL address, andthe crawling module crawls the initial URL address to generate a URL task queue. The crawling module is used for crawling the webpage according to the URL task queue to finish crawling work. Comparedwith the prior art, the crawling performance bottleneck of the distributed web crawler is broken through, and the crawling performance is improved by 50% or above. The duplicate removal efficiency ofthe URL task queue is improved, and the efficiency requirement of mass data collection is met. The storage space of the URL task queue is optimized, and server memory resources are greatly saved. A junk link filtering link is added, so that server memory resources are saved, and crawler efficiency is remarkably improved.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the technical field of software engineering, and in particular relates to a distributed network crawler performance optimization system facing mass data collection. Background technique

[0002] Web crawlers, also known as web spiders, web ants, or web robots, can automatically obtain data from the web according to set rules. Distributed web crawlers can efficiently obtain large-scale data sets, and are widely used in search engines and big data analysis, and have become an important tool for massive data collection.

[0003] Distributed web crawlers usually include a master node crawler and multiple slave node crawlers, and use the Redis memory database to persist the URL task queue and deduplication queue. The master node crawler crawls the webpage according to the initial URL (Uniform Resource Locator), obtains data, and also obtains a new URL, deduplicates the new URL and puts it into the URL task queue; the slave node cra...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More