Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Web crawler scheduling method and web crawler system using same

A web crawler and scheduling method technology, applied in the field of web crawling, can solve the problems of unreasonable refresh interval setting, important pages cannot be crawled in time, lack of crawling sequence, etc., so as to improve index quality and timeliness, preset The refresh time is reasonable and the retrieval effect is guaranteed

Active Publication Date: 2017-07-11
ALIBABA (CHINA) CO LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] However, the scheduling method of the existing web crawler system is often due to factors such as unreasonable refresh interval settings and lack of reasonable control over the crawling sequence, resulting in excessive crawling, and some important pages cannot be crawled in time, which affects index quality and User search effect

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web crawler scheduling method and web crawler system using same
  • Web crawler scheduling method and web crawler system using same
  • Web crawler scheduling method and web crawler system using same

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0059] Firstly, the embodiment of the web crawler scheduling method provided by this application will be described. figure 1 It is a flowchart of a web crawler scheduling method provided by an embodiment of the present application. refer to figure 1 , the network crawler scheduling method includes the following steps.

[0060] S11. Grab content page data related to the seed page.

[0061] S12. Analyze the content page data to obtain multiple sets of link information related to the seed page.

[0062] S13. Calculate the link quality of the corresponding content page on the seed page according to the link information.

[0063] S14. Perform a crawling operation on content pages corresponding to each link quality in descending order of the link quality.

[0064] According to the steps of the above method, for a seed page that needs to be maintained, first perform a crawling operation on it to obtain the content page data related to the seed page; and then obtain multiple sets ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A web crawler scheduling method and a web crawler system applying same. The method comprises: firstly parsing grabbed content page data relevant to a seed page to obtain multiple groups of link information, and then respectively performing a calculation according to each group of link information to obtain the link quality of a corresponding content page on the seed page, so as to respectively determine a grabbing sequence of the content page and a pre-set refresh interval of the seed page according to the link quality. The method ensures that an important content page and a seed page corresponding to a high link quality are preferentially grabbed, the index quality and the timeliness are improved, and a search effect for a user is safeguarded.

Description

[0001] The present invention claims the priority of the Chinese patent application submitted to the China Patent Office on October 9, 2015, with the application number 201510649129.X, and the title of the invention is "Web crawler scheduling method and web crawler system using the same", the entire content of which is passed References are incorporated herein. technical field [0002] The invention relates to the technical field of webpage crawling, in particular to a web crawler scheduling method and a web crawler system using the same. Background technique [0003] Search engines usually provide minute-level real-time indexing to display time-sensitive webpage information to users in a timely manner, such as news information on news websites, video updates on video websites, and popular posts on forum netizens. In order to obtain these highly time-sensitive web page information in time, relevant search engines need to maintain a batch of seed pages (also known as list page...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/00
Inventor 周海建
Owner ALIBABA (CHINA) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products