Distributed web crawler performance optimization system for mass data acquisition
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 北京京航计算通讯研究所
- Publication Date
- 2020-03-06
- Estimated Expiration
- Not applicable · inactive patent
Smart Images

Figure 1 
Figure 2 
Figure 3
Abstract
Description
technical field
[0001] The invention belongs to the technical field of software engineering, and in particular relates to a distributed network crawler performance optimization system facing mass data collection. Background technique
[0002] Web crawlers, also known as web spiders, web ants, or web robots, can automatically obtain data from the web according to set rules. Distributed web crawlers can efficiently obtain large-scale data sets, and are widely used in search engines and big data analysis, and have become an important tool for massive data collection.
[0003] Distributed web crawlers usually include a master node crawler and multiple slave node crawlers, and use the Redis memory database to persist the URL task queue and deduplication queue. The master node crawler crawls the webpage according to the initial URL (Uniform Resource Locator), obtains data, and also obtains a new URL, deduplicates the new URL and puts it into the URL task queue; the slave node cra...