Method and system for concurrent crawling of web pages

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology of webpages and websites, which is applied in the field of concurrent crawling of webpages, can solve problems such as slow website response, crash, and poor analysis ability of webpage content, and achieve the effect of ensuring and improving response speed

Active Publication Date: 2018-10-23

ALIBABA GRP HLDG LTD

View PDF4 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] At present, web crawlers in the prior art have poor ability to analyze web page content, and can only continuously grab website information mechanically, and often send dozens or hundreds of requests for repeated crawling; because most of the website processing capacity Limited, so a large number of concurrent requests can easily cause the website to respond slowly or even crash

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0076] In order to make the above objectives, features and advantages of the application more obvious and understandable, the application will be further described in detail below in conjunction with the drawings and specific implementations.

[0077] Reference figure 1 , Shows a flowchart of Embodiment 1 of a method for concurrently crawling webpages according to the present application, which may specifically include:

[0078] Step 101: Perform concurrent processing of the grab request to be processed, and monitor the processing event message corresponding to the processed grab request;

[0079] In the embodiment of this application, the pending crawl request can be used to represent the unprocessed crawl request. During the concurrent crawling of the webpage, the pending crawl request can be generated according to the new URL extracted from the current page and placed in In the request queue, the pending grab request is obtained in the request queue, and it is judged before proces...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

This invention provides a web concurrent crawling method and system. The method comprises performing concurrent process on crawling requests to be processed and monitoring processing event messages corresponding to the processed crawling requests; analyzing the processing event messages to obtain current crawling index parameters; and turning down the concurrent number of web concurrent crawling when the current crawling index parameters exceed preset safety range. According to the web concurrent crawling method and system, the response speed of websites during web concurrent crawling can be increased.

Description

Technical field [0001] This application relates to the field of network technology, in particular to a method and system for concurrently crawling web pages. Background technique [0002] A search engine refers to a system that collects information from the Internet according to certain strategies and uses specific computer programs, organizes and processes the information, provides users with search services, and displays relevant information related to user searches to users. The process of collecting information from the Internet by the search engine relies on the crawling of relevant website information by web crawlers. [0003] The web crawler is a program that automatically obtains web content, and is an important part of search engines. [0004] In the prior art, for ordinary search engines, traditional crawlers start from the URL (Uniform Resource Locator) of one or several initial webpages, and obtain the URL on the initial webpage. Extract the new URL from the current pag...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F17/30H04L29/08

Inventor金伟孟凡光

OwnerALIBABA GRP HLDG LTD

Method and system for concurrent crawling of web pages

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology