Method and equipment for crawling page

A page and device technology, applied in the Internet field, can solve problems such as incomplete data crawling, inability to record current data crawling status information, and slow diffusion speed, so as to improve crawling efficiency, realize configurability, and ensure integrity Effect
CN103226568AInactive Publication Date: 2013-07-31BEIJING BAIDU NETCOM SCI & TECH CO LTD

Patent Information

Authority / Receiving Office
CN ยท China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING BAIDU NETCOM SCI & TECH CO LTD
Publication Date
2013-07-31
Estimated Expiration
Not applicable ยท inactive patent

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention aims to provide a method and equipment for crawling a page. The method comprises the steps that crawling equipment obtains candidate page identification information corresponding to a candidate crawling page according to a crawled page; the candidate page identification information is added into a corresponding to-be-crawled page set according to relevant information of the candidate crawling page and the crawled page, and the to-be-crawled page set contains page identification information of one or more pages to be crawled; targeted crawling identification information of the page to be crawled is determined according to the to-be-crawled page set; and a targeted page corresponding to the targeted crawling identification information is crawled. Compared with the prior art, according to the relevant information, the method and the equipment for crawling the page control the crawling scheduling action effectively, so that the configurability of crawling scheduling is achieved, the crawling can be diffused purposefully; and moreover, the moving direction and speed of a crawler in webpages are controlled, the crawling efficiency of a vertical crawler is improved, and the completeness of crawling data is guaranteed.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the technical field of the Internet, in particular to a technology for crawling pages. Background technique

[0002] The current method for crawling web pages is to use a random breadth-first strategy. Therefore, for directional crawling, there are problems such as slow diffusion speed, difficult control of the diffusion direction and diffusion speed, and difficulty in spreading to the desired page within the desired time. For example, when crawling data in a vertical site, if the various dimensions of the data are distributed on different pages, there will be serious incomplete data crawling; at the same time, since the crawling of the current data cannot be recorded during the crawling process Therefore, for the incomplete data after crawling, it is impossible to judge whether the incompleteness of the data is due to the incompleteness of the data itself, or the crawling of the page has not been completed. Contents of the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More