Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Web crawler scheduling method and web crawler system applying same

A technology of web crawler and scheduling method, which is applied in the field of web page crawling, can solve problems such as unreasonable setting of refresh interval, failure of timely crawling of important pages, lack of crawling sequence, etc., to improve index quality and timeliness, preset Refresh time is reasonable to ensure the effect of retrieval effect

Active Publication Date: 2015-12-02
ALIBABA (CHINA) CO LTD
View PDF4 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] However, the scheduling method of the existing web crawler system is often due to factors such as unreasonable refresh interval settings and lack of reasonable control over the crawling sequence, resulting in excessive crawling, and some important pages cannot be crawled in time, which affects index quality and User search effect

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web crawler scheduling method and web crawler system applying same
  • Web crawler scheduling method and web crawler system applying same
  • Web crawler scheduling method and web crawler system applying same

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0059] Firstly, the embodiment of the web crawler scheduling method provided by this application will be described. figure 1 It is a flowchart of a web crawler scheduling method provided by an embodiment of the present application. refer to figure 1 , the network crawler scheduling method includes the following steps.

[0060] S11. Grab content page data related to the seed page.

[0061] S12. Analyze the content page data to obtain multiple sets of link information related to the seed page.

[0062] S13. Calculate the link quality of the corresponding content page on the seed page according to the link information.

[0063] S14. Perform a crawling operation on content pages corresponding to each link quality in descending order of the link quality.

[0064] According to the steps of the above method, for a seed page that needs to be maintained, first perform a crawling operation on it to obtain the content page data related to the seed page; and then obtain multiple sets ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a web crawler scheduling method and a web crawler system applying the same. Firstly, grabbed content page data related to a seed page are analyzed, and multiple groups of linkage messages are obtained; the linkage quality of a corresponding content page on the seed page is calculated according to each group of linkage messages, the grabbing sequence of the content pages and the preset refresh interval of the seed page are determined according to the linkage quality respectively, a corresponding important content page with high linkage quality and the seed page are grabbed preferentially, the indexing quality and the timeliness are improved, and the user retrieval effect is guaranteed.

Description

[0001] The present invention claims the priority of the Chinese patent application submitted to the China Patent Office on October 9, 2015, with the application number 201510649129.X, and the title of the invention is "Web crawler scheduling method and web crawler system using the same", the entire content of which is passed References are incorporated herein. technical field [0002] The invention relates to the technical field of webpage crawling, in particular to a web crawler scheduling method and a web crawler system using the same. Background technique [0003] Search engines usually provide minute-level real-time indexing to display time-sensitive webpage information to users in a timely manner, such as news information on news websites, video updates on video websites, and popular posts on forum netizens. In order to obtain these highly time-sensitive web page information in time, relevant search engines need to maintain a batch of seed pages (also known as list page...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/00G06F16/951
Inventor 周海建
Owner ALIBABA (CHINA) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products