Method and equipment for crawling page

A page and device technology, applied in the Internet field, can solve problems such as incomplete data crawling, inability to record current data crawling status information, and slow diffusion speed, so as to improve crawling efficiency, realize configurability, and ensure integrity Effect

Inactive Publication Date: 2013-07-31
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF2 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, for directional crawling, there are problems such as slow diffusion speed, difficult to control the diffusion direction and diffusion speed, and difficult to spread to the desired page within the desired time.
For example, when crawling data in a vertical site, if the various dimensions of the data are distributed on different pages, there will be serious incomplete data crawling; at the same time, since the crawling of the current data cannot be recorded during the crawling process Therefore, for the incomplete data after crawling, it is impossible to judge whether the incompleteness of the data is due to the incompleteness of the data itself, or the crawling of the page has not been completed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and equipment for crawling page
  • Method and equipment for crawling page
  • Method and equipment for crawling page

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0028] figure 1 It shows a schematic diagram of a crawling device for crawling pages according to one aspect of the present invention; wherein, the crawling device includes a candidate identification obtaining device 11, a grouping device 12, a crawling identification obtaining device 13, and a crawling device 14 . Specifically, the candidate identification obtaining means 11 acquires the candidate page identification information corresponding to the candidate crawled pages according to the crawled pages; The candidate page identification information is added to the corresponding set of pages to be crawled, wherein the set of pages to be crawled includes the page identification information of one or more pages to be crawled; the crawling identification acquisition device 13 according to the set of pages to be climbed, Determine the target crawling identific...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention aims to provide a method and equipment for crawling a page. The method comprises the steps that crawling equipment obtains candidate page identification information corresponding to a candidate crawling page according to a crawled page; the candidate page identification information is added into a corresponding to-be-crawled page set according to relevant information of the candidate crawling page and the crawled page, and the to-be-crawled page set contains page identification information of one or more pages to be crawled; targeted crawling identification information of the page to be crawled is determined according to the to-be-crawled page set; and a targeted page corresponding to the targeted crawling identification information is crawled. Compared with the prior art, according to the relevant information, the method and the equipment for crawling the page control the crawling scheduling action effectively, so that the configurability of crawling scheduling is achieved, the crawling can be diffused purposefully; and moreover, the moving direction and speed of a crawler in webpages are controlled, the crawling efficiency of a vertical crawler is improved, and the completeness of crawling data is guaranteed.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a technology for crawling pages. Background technique [0002] The current method for crawling web pages is to use a random breadth-first strategy. Therefore, for directional crawling, there are problems such as slow diffusion speed, difficult control of the diffusion direction and diffusion speed, and difficulty in spreading to the desired page within the desired time. For example, when crawling data in a vertical site, if the various dimensions of the data are distributed on different pages, there will be serious incomplete data crawling; at the same time, since the crawling of the current data cannot be recorded during the crawling process Therefore, for the incomplete data after crawling, it is impossible to judge whether the incompleteness of the data is due to the incompleteness of the data itself, or the crawling of the page has not been completed. Contents of the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 王江刘浩
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products