Webpage crawling method and device

A web scraping, web page technology, applied in the field of search engines, can solve problems such as inability to update or download web resources, website server crashes, website response timeouts, etc.

Active Publication Date: 2014-01-22
BEIJING QIHOO TECH CO LTD
View PDF4 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If the task of crawling webpages exceeds the tolerance range of the site host, it will affect the normal access of website users, then the webpage crawling behavior of the crawler program will become an unfriendly behavior to the website, and in severe cases, it will affect the website response timeout. Even the website server crashes
Moreover, in order to protect the stability of the website, the website often monitors the access of crawlers, and restricts or even prohibits access to crawlers that produce unfriendly behaviors.
Once the crawler program is restricted or prohibited, the crawling efficiency of the search engine's web pages will become low, and even the web page resources of the website cannot be updated or downloaded, which will eventually have a negative impact on the provision of search services
[0004] At the same time, in the prior art, the traffic or frequency that the crawler program can grab the website is generally set by manual setting. Although this method reduces the conflict between the crawler program of the search engine and the crawled website, it does not affect the update of web page data. It has not been fully reflected, so the crawling behavior of the crawler program and the demand for website data update have not been reasonably balanced

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage crawling method and device
  • Webpage crawling method and device
  • Webpage crawling method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0119] See figure 1 , is a flow chart of the method for webpage crawling provided by the embodiment of the present invention, as shown in the figure, the method for webpage crawling provided by the embodiment of the present invention may include the following steps:

[0120] S110: Obtain a dynamic traffic quota value for web crawling on the target website;

[0121] In the process of crawling the web pages of the target website by the crawler program, in order to avoid unlimited crawling of the same website, which will affect the normal visit of the website, etc., it is usually necessary to crawl the crawler program on the target website. Fetching traffic or frequency is limited to a certain extent, and the dynamic traffic quota value is a restriction on the crawling traffic of the crawler program on the target website. The dynamic traffic quota value for web crawling on the target website can be understood as the traffic limit for crawling the same website within a unit of ti...

Embodiment 2

[0136] See figure 2 , is a flowchart of a method for determining a website crawling traffic quota provided in Embodiment 2 of the present invention. As shown in the figure, the method for determining a website crawling traffic quota provided in an embodiment of the present invention may include the following steps:

[0137] S210: Obtain the visited data of the target website to be captured;

[0138] Firstly, the visited data of the target website to be captured can be obtained. The visited data of the target website to be captured can be the click volume data of a certain day of the website, such as the parameter C in Table 1, and the visited data of the target website to be captured can be obtained. After accessing the data, the visit tolerance of the target website to be captured can be deduced based on the visited data of the target website.

[0139] The visited data of the target website can be obtained from various sources, for example, it can be obtained from published...

Embodiment 3

[0193] See image 3 , is a flowchart of a method for determining captured traffic provided in Embodiment 3 of the present invention. As shown in the figure, the method for determining captured traffic provided in this embodiment of the present invention may include the following steps.

[0194] S310: Obtain a task scaling factor according to the attribute characteristics of the target website;

[0195] S320: Based on the task scaling factor and the sum of webpage quality distributions in the target website, determine the task traffic of crawling the target website.

[0196] Wherein, the obtained task proportion factor can be the ratio of the number of webpages to be crawled to the total number of webpages in the target website; and / or, the ratio of the number of non-duplicated webpages in the target website percentage of the total number of pages. Obtaining the ratio of the number of webpages to be crawled in the target website to the total number of webpages in the target w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage crawling method and device. The webpage crawling method comprises the steps that a dynamic flow quota value for webpage crawling on a target website is obtained; according to the dynamic flow quota value, webpages on the website are crawled. By the adoption of the webpage crawling method, when the webpages of the website are crawled through a crawler program of a search engine, the conflict between the crawler program and the crawled website is reduced, and the crawling action of the crawler program and the updating need of the search engine are reasonably balanced.

Description

technical field [0001] The invention relates to the technical field of search engines, in particular to a method and device for webpage crawling. Background technique [0002] A search engine is an Internet information platform. A large amount of webpage information on the Internet can be collected through a search engine, and after processing, an information database and an index database can be established. Users can enter query words in the entry provided by the search engine, thereby Obtains the search results returned by the search engine for the query term. With the continuous development and maturity of search engine technology, the services it provides are becoming more and more perfect. When people obtain the required information from the large-scale Internet, search engines have become a very common and very convenient tool. [0003] In order to be able to download webpages on the Internet for analyzing webpage data and building indexes, search engines often need ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/957
Inventor 魏少俊
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products