Method and system for concurrent crawling of web pages

A technology of webpages and websites, which is applied in the field of concurrent crawling of webpages, can solve problems such as slow website response, crash, and poor analysis ability of webpage content, and achieve the effect of ensuring and improving response speed

Active Publication Date: 2018-10-23
ALIBABA GRP HLDG LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] At present, web crawlers in the prior art have poor ability to analyze web page content, and can only continuously grab website information mechanically, and often send dozens or hundreds of requests for repeated crawling; because most of the website processing capacity Limited, so a large number of concurrent requests can easily cause the website to respond slowly or even crash

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for concurrent crawling of web pages
  • Method and system for concurrent crawling of web pages
  • Method and system for concurrent crawling of web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0076] In order to make the above objectives, features and advantages of the application more obvious and understandable, the application will be further described in detail below in conjunction with the drawings and specific implementations.

[0077] Reference figure 1 , Shows a flowchart of Embodiment 1 of a method for concurrently crawling webpages according to the present application, which may specifically include:

[0078] Step 101: Perform concurrent processing of the grab request to be processed, and monitor the processing event message corresponding to the processed grab request;

[0079] In the embodiment of this application, the pending crawl request can be used to represent the unprocessed crawl request. During the concurrent crawling of the webpage, the pending crawl request can be generated according to the new URL extracted from the current page and placed in In the request queue, the pending grab request is obtained in the request queue, and it is judged before proces...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

This invention provides a web concurrent crawling method and system. The method comprises performing concurrent process on crawling requests to be processed and monitoring processing event messages corresponding to the processed crawling requests; analyzing the processing event messages to obtain current crawling index parameters; and turning down the concurrent number of web concurrent crawling when the current crawling index parameters exceed preset safety range. According to the web concurrent crawling method and system, the response speed of websites during web concurrent crawling can be increased.

Description

Technical field [0001] This application relates to the field of network technology, in particular to a method and system for concurrently crawling web pages. Background technique [0002] A search engine refers to a system that collects information from the Internet according to certain strategies and uses specific computer programs, organizes and processes the information, provides users with search services, and displays relevant information related to user searches to users. The process of collecting information from the Internet by the search engine relies on the crawling of relevant website information by web crawlers. [0003] The web crawler is a program that automatically obtains web content, and is an important part of search engines. [0004] In the prior art, for ordinary search engines, traditional crawlers start from the URL (Uniform Resource Locator) of one or several initial webpages, and obtain the URL on the initial webpage. Extract the new URL from the current pag...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30H04L29/08
Inventor 金伟孟凡光
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products