Computer robot crawling task distribution method and device, and computer robot data crawling method and device

A web crawler and task allocation technology, applied in the network field, can solve the problems of repeated page crawling and unreasonable use of network resources.

Active Publication Date: 2016-12-07
ALIBABA GRP HLDG LTD
View PDF8 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to overcome the problems of unreasonable use of network resources in web crawler crawling tasks and repeated page crawling in re

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Computer robot crawling task distribution method and device, and computer robot data crawling method and device
  • Computer robot crawling task distribution method and device, and computer robot data crawling method and device
  • Computer robot crawling task distribution method and device, and computer robot data crawling method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0137] Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.

[0138] In the following detailed description, numerous specific details are set forth in order to provide a comprehensive understanding of the application, but those skilled in the art will understand that the application may be practiced without these specific details. In other embodiments, well-known methods, procedures, components and circuits have not been described in detail so as n...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a computer robot crawling task distribution method and device, and a computer robot data crawling method and device. The computer robot data crawling method comprises the following steps: a crawled historical page is stored into historical crawling data, whether a historical page corresponding to the URL(Uniform Resource Locator) cluster of a crawling task is in the presence in the historical crawling data or not is judged when the crawling task is in the presence, first target page data can be directly extracted from an available historical page if the historical page is in the presence and the existing historical page is available, the part of the historical page does not need to be repeatedly crawled, and system resources are saved. Meanwhile, the network bandwidth usage rates of all computer robots are detected in fixed time, the mean value E(w) and the variance D(w) of the network bandwidth usage rate of each computer robot are calculated and stored, and then, the availability of each computer robot is calculated according to E(w) and D(w). For the available computer robots, the computer robot for executing tasks is selected according to the descending order of an availability probability so as to reasonably distribute computer robot resources.

Description

technical field [0001] The present invention relates to the field of network technologies, in particular to a method and device for assigning web crawler crawling tasks, and a method and device for data crawling. Background technique [0002] Web crawler (Computer Robot, also known as web spider or web robot) is a program that automatically grabs Internet web page data according to certain rules, and is an important component of search engines. Usually, web crawlers download webpages from the Internet according to configured crawling tasks, parse and filter the webpages, and obtain target webpage data. All target web page data captured by web crawlers is stored in the crawler system and indexed for subsequent query and retrieval. [0003] The development of network and information technology has led to the rapid growth of the number of websites, web pages and web page data. A crawler system needs many web crawlers to grab a large amount of web page data. These web crawlers ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 刘庆张美德殷贤君邹启蒙
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products