Distributed web crawler performance optimization system for mass data acquisition

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A distributed network and mass data technology, applied in the field of distributed web crawler performance optimization system, can solve the problems of low deduplication efficiency, server memory resources, inability to effectively eradicate, excessive consumption of junk links, etc., to improve deduplication efficiency, The effect of breaking through performance bottlenecks and improving crawling performance

Inactive Publication Date: 2020-03-06

北京京航计算通讯研究所

View PDF4 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0007] The technical problem to be solved by the present invention is: how to solve the existing distributed web crawler based on the Redis memory database, when faced with massive data collection, the existing deduplication efficiency is not high, excessive consumption of server memory resources and garbage links cannot be effectively eradicated The problem,

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0049] In order to make the purpose, content, and advantages of the present invention clearer, the specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.

[0050] In order to solve the above technical problems, the present invention provides a distributed network crawler performance optimization system for mass data collection, the distributed network crawler performance optimization system includes: an initialization module, a crawling module; wherein,

[0051] The initialization module is used to create a deduplication character string and a spam link characteristic string;

[0052] The crawling module is used to crawl the initial URL address to generate a URL task queue after the master node crawler reads the initial URL address;

[0053] The crawling module is also used to crawl web pages according to the URL task queue to complete the crawling work.

[0054] Wherein...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention belongs to the technical field of software engineering, and particularly relates to a distributed web crawler performance optimization system for mass data acquisition. In the system, aninitialization module is used for newly establishing a deduplication character string and a junk link feature character string. The main node crawler is used for reading the initial URL address, andthe crawling module crawls the initial URL address to generate a URL task queue. The crawling module is used for crawling the webpage according to the URL task queue to finish crawling work. Comparedwith the prior art, the crawling performance bottleneck of the distributed web crawler is broken through, and the crawling performance is improved by 50% or above. The duplicate removal efficiency ofthe URL task queue is improved, and the efficiency requirement of mass data collection is met. The storage space of the URL task queue is optimized, and server memory resources are greatly saved. A junk link filtering link is added, so that server memory resources are saved, and crawler efficiency is remarkably improved.

Description

technical field [0001] The invention belongs to the technical field of software engineering, and in particular relates to a distributed network crawler performance optimization system facing mass data collection. Background technique [0002] Web crawlers, also known as web spiders, web ants, or web robots, can automatically obtain data from the web according to set rules. Distributed web crawlers can efficiently obtain large-scale data sets, and are widely used in search engines and big data analysis, and have become an important tool for massive data collection. [0003] Distributed web crawlers usually include a master node crawler and multiple slave node crawlers, and use the Redis memory database to persist the URL task queue and deduplication queue. The master node crawler crawls the webpage according to the initial URL (Uniform Resource Locator), obtains data, and also obtains a new URL, deduplicates the new URL and puts it into the URL task queue; the slave node cra...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/951G06F16/955

CPCG06F16/951G06F16/955

Inventor 王维纲张郭秋晨张凯云吴志成吴艳林纪纲孙鹏陈卓

Owner 北京京航计算通讯研究所

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Distributed web crawler performance optimization system for mass data acquisition

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology