Dynamic webpage crawling method and device

A technology of webpage crawling and webpage application in the Internet field, which can solve the problems of repeated webpage crawling and high complexity of crawling time, and achieve the effect of improving the efficiency of leaving and entering the team and improving the efficiency of crawling

Inactive Publication Date: 2016-10-26
DATAGRAND TECH INC
View PDF1 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the technical problems of repetitive webpage crawling and high crawling time complexity faced by the existing webpage crawling process, and it is impossible to crawl according to the high priority of the webpages obtained in the parsing process, the pre

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dynamic webpage crawling method and device
  • Dynamic webpage crawling method and device
  • Dynamic webpage crawling method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] The present invention will be described in further detail below through specific embodiments and in conjunction with the accompanying drawings.

[0047] Existing webpage crawling generally realizes the crawling of the link library through a scheduling strategy, and the crawling process is carried out according to the priority of the webpages in the webpage database. However, when the number of webpages reaches a million level, each selection During the step of fetching the url list, the crawler can only wait, which wastes the crawling ability of the crawler.

[0048] In order to solve the above problems, the present invention provides a dynamic-based webpage crawling method. Such as figure 1 As shown, the method includes the following steps:

[0049] S101. Set up at least two queues, crawl and store URLs and priorities of webpages to be crawled in the at least two queues, and perform scheduling according to the priorities of the URLs stored in the at least two queues....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a dynamic webpage crawling method. The method comprises following steps: arranging at least two queues, crawling url of web-pages to be crawled and priorities, storing them into at least two queues and scheduling according to priorities of url stored in at least two queues; receiving elements of at least two queues called in order to obtain url of elements to be analyzed; and obtaining webpage content by analyzing url of queue elements. The dynamic webpage crawling method has following beneficial effects: procedures for crawling analyses and url of a link library can be scheduled simultaneously according to priorities so that webpages of higher priorities can be crawled firstly; by scheduling at least two queues, de-queuing efficiency and en-queuing efficiency of webpages can be improved; the time complexity is logN so that webpage crawling efficiency can be greatly improved.

Description

technical field [0001] The invention belongs to the technical field of the Internet, and in particular relates to a method and device for grabbing webpages based on dynamics. Background technique [0002] An important problem faced by web crawlers in the process of crawling massive Internet webpages is the repeated crawling of webpages. In order to avoid repeated crawling of webpages with different targets, when faced with some webpages that need to be crawled repeatedly (such as fast-updating news and information webpages, regularly updated webpages, and webpages that require real-time crawling requirements), it is necessary to base on the priority of the webpage level for crawling web pages. For example, focused crawler is a crawler system oriented to a specific topic. During the crawling process, URLs irrelevant to the topic will be filtered, and the crawling of URLs to be executed will be scheduled. Because it is for a specific topic (video, news), there may be a large...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 文辉
Owner DATAGRAND TECH INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products