Web concurrent crawling method and system

A web page and website technology, applied in the field of concurrent web crawling, can solve the problems of slow website response, crash, poor web content analysis ability, etc., to ensure the response speed and improve the response speed.

Active Publication Date: 2015-05-27
ALIBABA GRP HLDG LTD
View PDF4 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] At present, web crawlers in the prior art have poor ability to analyze web page content, and can only continuously grab website information mechanically, and often send dozens or hundreds of requests for repeated crawling; because most of the website processing capacity Limited, so a large number of concurrent requests can easily cause the website to respond slowly or even crash

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web concurrent crawling method and system
  • Web concurrent crawling method and system
  • Web concurrent crawling method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0076] In order to make the above objects, features and advantages of the present application more obvious and comprehensible, the present application will be further described in detail below in conjunction with the accompanying drawings and specific implementation methods.

[0077] refer to figure 1 , which shows a flow chart of Embodiment 1 of a web page concurrent crawling method of the present application, which may specifically include:

[0078] Step 101, perform concurrent processing on the grabbing request to be processed, and monitor the processing event message corresponding to the handled grabbing request;

[0079] In the embodiment of this application, the pending crawling request can be used to represent an unprocessed crawling request. During the concurrent crawling process of a web page, a pending crawling request can be generated based on a new URL extracted from the current page and placed in the In the request queue, obtain the pending grabbing requests in t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

This invention provides a web concurrent crawling method and system. The method comprises performing concurrent process on crawling requests to be processed and monitoring processing event messages corresponding to the processed crawling requests; analyzing the processing event messages to obtain current crawling index parameters; and turning down the concurrent number of web concurrent crawling when the current crawling index parameters exceed preset safety range. According to the web concurrent crawling method and system, the response speed of websites during web concurrent crawling can be increased.

Description

technical field [0001] The present application relates to the field of network technology, in particular to a method and system for concurrent webpage crawling. Background technique [0002] A search engine refers to a system that uses specific computer programs to collect information from the Internet according to a certain strategy, organizes and processes the information, provides users with retrieval services, and displays relevant information to users. The process of collecting information from the Internet by the search engine relies on crawling of relevant website information by web crawlers. [0003] The web crawler is a program for automatically obtaining webpage content, and is an important part of a search engine. [0004] In the prior art, for ordinary search engines, traditional crawlers start from the URL (Uniform Resource Locator, Uniform Resource Locator) of one or several initial webpages, and obtain the URLs on the initial webpage. Extract new URLs from t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30H04L29/08
Inventor 金伟孟凡光
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products