Web concurrent crawling method and system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A web page and website technology, applied in the field of concurrent web crawling, can solve the problems of slow website response, crash, poor web content analysis ability, etc., to ensure the response speed and improve the response speed.

Active Publication Date: 2015-05-27

ALIBABA GRP HLDG LTD

View PDF4 Cites 5 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] At present, web crawlers in the prior art have poor ability to analyze web page content, and can only continuously grab website information mechanically, and often send dozens or hundreds of requests for repeated crawling; because most of the website processing capacity Limited, so a large number of concurrent requests can easily cause the website to respond slowly or even crash

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0076] In order to make the above objects, features and advantages of the present application more obvious and comprehensible, the present application will be further described in detail below in conjunction with the accompanying drawings and specific implementation methods.

[0077] refer to figure 1 , which shows a flow chart of Embodiment 1 of a web page concurrent crawling method of the present application, which may specifically include:

[0078] Step 101, perform concurrent processing on the grabbing request to be processed, and monitor the processing event message corresponding to the handled grabbing request;

[0079] In the embodiment of this application, the pending crawling request can be used to represent an unprocessed crawling request. During the concurrent crawling process of a web page, a pending crawling request can be generated based on a new URL extracted from the current page and placed in the In the request queue, obtain the pending grabbing requests in t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

This invention provides a web concurrent crawling method and system. The method comprises performing concurrent process on crawling requests to be processed and monitoring processing event messages corresponding to the processed crawling requests; analyzing the processing event messages to obtain current crawling index parameters; and turning down the concurrent number of web concurrent crawling when the current crawling index parameters exceed preset safety range. According to the web concurrent crawling method and system, the response speed of websites during web concurrent crawling can be increased.

Description

technical field [0001] The present application relates to the field of network technology, in particular to a method and system for concurrent webpage crawling. Background technique [0002] A search engine refers to a system that uses specific computer programs to collect information from the Internet according to a certain strategy, organizes and processes the information, provides users with retrieval services, and displays relevant information to users. The process of collecting information from the Internet by the search engine relies on crawling of relevant website information by web crawlers. [0003] The web crawler is a program for automatically obtaining webpage content, and is an important part of a search engine. [0004] In the prior art, for ordinary search engines, traditional crawlers start from the URL (Uniform Resource Locator, Uniform Resource Locator) of one or several initial webpages, and obtain the URLs on the initial webpage. Extract new URLs from t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30H04L29/08

Inventor金伟孟凡光

OwnerALIBABA GRP HLDG LTD

Web concurrent crawling method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology