Unlock instant, AI-driven research and patent intelligence for your innovation.

Web crawler crawling rule replacement method, scheduling end and crawling end

A web crawler and scheduling terminal technology, applied in the related fields of web crawler, can solve the problems of chaotic management of crawling nodes modifying crawling rules, frequent restart of crawling nodes, etc., to avoid management confusion.

Active Publication Date: 2016-11-09
BEIJING JINGDONG SHANGKE INFORMATION TECH CO LTD +1
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Based on this, it is necessary to address the large-scale crawler system in the prior art, the replacement method of grabbing rules is likely to cause frequent restarts of grabbing nodes, and the management of modifying grabbing rules for grabbing nodes is likely to lead to technical problems of confusion. Web crawler crawling rule replacement method, scheduling end and grabbing end

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web crawler crawling rule replacement method, scheduling end and crawling end
  • Web crawler crawling rule replacement method, scheduling end and crawling end
  • Web crawler crawling rule replacement method, scheduling end and crawling end

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

[0037] Such as figure 1 Shown is a working flow diagram of a web crawler crawling rule replacement method of the present invention, including:

[0038] Step S101, sending a capture task to a capture terminal that captures network information, the capture task includes a website to be captured, and a dispatcher version number of a dispatcher capture rule file corresponding to the website to be captured;

[0039] Step S102, receiving a request for obtaining a new rule file from the crawler including the rule website to be switched and the version number of the rule to be switched, sending the rule file to be switched and the rule website to be switched to the crawler, The rule file to be switched is a scheduling terminal capture rule file that is stored in the rule file library and is jointly identified by the website of the rule to be...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a crawler capturing rule replacement method, a scheduling end and a capturing end. The crawler capturing rule replacement method includes the steps of sending a capturing task to the capturing end used for capturing network information, and sending a file with a rule to be switched and a website with the rule to be switched to the capturing end, wherein the capturing task includes a website to be captured and the scheduling end version number of a scheduling end capturing rule file corresponding to the website to be captured, and the file with the rule to be switched is used for enabling the capturing end capturing rule file which is stored at the capturing end and corresponds to the website with the rule to be switched to be replaced by the file with the rule to be switched through the capturing end. According to the crawler capturing rule replacement method, the scheduling end and the capturing end, the capturing rules are independently stored into rule files through the capturing end, and when replacement is conducted, only the rule files need to be replaced and the whole capturing end does not need to be restarted. Meanwhile, all the rule files are managed and stored in a unified mode through the scheduling end, therefore, the rule files do not need to be independently uploaded for each capturing end, and the management disorder is avoided.

Description

technical field [0001] The invention relates to the technical field related to webpage crawlers, in particular to a webpage crawler crawling rule replacement method, a scheduling terminal and a crawling terminal. Background technique [0002] Web crawlers are the basis for obtaining website information. In order to obtain website information, corresponding rules need to be configured for different websites. However, the page style rules of the website are not static. Once the page style rules change, the original crawling rules will inevitably become invalid. . [0003] The existing practice is to reconfigure the rules for the page after the revision, and then need to restart the crawling node to deal with the page revision. This method is feasible for the crawling of a single website and the revision of the website is infrequent. When there are a large number of websites, it will also cause frequent restarts of crawling nodes. The disadvantages are very obvious: [0004]...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/958
Inventor 廖耀华黎小为
Owner BEIJING JINGDONG SHANGKE INFORMATION TECH CO LTD