Unlock instant, AI-driven research and patent intelligence for your innovation.

Determine the method and equipment for capturing traffic

A traffic and device technology, applied in the field of search engines, can solve problems such as website server crashes, unfriendly websites, crawling behaviors of crawlers and website data update requirements without a reasonable balance, and achieve the effect of reducing conflicts

Inactive Publication Date: 2018-04-24
BEIJING QIHOO TECH CO LTD +1
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If the task of crawling webpages exceeds the tolerance range of the site host, it will affect the normal access of website users, then the webpage crawling behavior of the crawler program will become an unfriendly behavior to the website, and in severe cases, it will affect the website response timeout. Even the website server crashes
Moreover, in order to protect the stability of the website, the website often monitors the access of crawlers, and restricts or even prohibits access to crawlers that produce unfriendly behaviors.
Once the crawler program is restricted or prohibited, the crawling efficiency of the search engine's web pages will become low, and even the web page resources of the website cannot be updated or downloaded, which will eventually have a negative impact on the provision of search services
[0004] At the same time, in the prior art, the traffic or frequency that the crawler program can grab the website is generally set by manual setting. Although this method reduces the conflict between the crawler program of the search engine and the crawled website, it does not affect the update of web page data. It has not been fully reflected, so the crawling behavior of the crawler program and the demand for website data update have not been reasonably balanced

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Determine the method and equipment for capturing traffic
  • Determine the method and equipment for capturing traffic
  • Determine the method and equipment for capturing traffic

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0071] See figure 1 , is a flow chart of the method for webpage crawling provided by the embodiment of the present invention. As shown in the figure, the method for webpage crawling provided by the embodiment of the present invention may include the following steps:

[0072] S110: Obtain a dynamic traffic quota value for web crawling on the target website;

[0073] In the process of crawling the web pages of the target website by the crawler program, in order to avoid unlimited crawling of the same website, which will affect the normal access of the website, etc., it is usually necessary to crawl the crawler program on the target website. Fetching traffic or frequency is limited to a certain extent, and the dynamic traffic quota value is a restriction on the crawling traffic of the crawler program on the target website. The dynamic traffic quota value for web crawling on the target website can be understood as the traffic quota for crawling the same website within a unit of t...

Embodiment 2

[0088] See figure 2 , is a flowchart of a method for determining a website crawling traffic quota provided in Embodiment 2 of the present invention. As shown in the figure, the method for determining a website crawling traffic quota provided in an embodiment of the present invention may include the following steps:

[0089] S210: Obtain the visited data of the target website to be captured;

[0090] Firstly, the visited data of the target website to be captured can be obtained. The visited data of the target website to be captured can be the click volume data of a certain day of the website, such as the parameter C in Table 1, and the visited data of the target website to be captured can be obtained. After accessing the data, the visit tolerance of the target website to be captured can be deduced based on the visited data of the target website.

[0091] The visited data of the target website can be obtained from various sources, for example, it can be obtained from published...

Embodiment 3

[0145] See image 3 , is a flowchart of a method for determining captured traffic provided in Embodiment 3 of the present invention. As shown in the figure, the method for determining captured traffic provided in this embodiment of the present invention may include the following steps.

[0146] S310: Obtain a task scaling factor according to the attribute characteristics of the target website;

[0147] S320: Based on the task scaling factor and the sum of webpage quality distributions in the target website, determine the task traffic of crawling the target website.

[0148] Wherein, the obtained task proportion factor can be the ratio of the number of webpages to be crawled to the total number of webpages in the target website; and / or, the ratio of the number of non-duplicated webpages in the target website percentage of the total number of pages. Obtaining the ratio of the number of webpages to be crawled in the target website to the total number of webpages in the target w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and device for determining capture flows. The method comprises the steps that task scale factors are obtained according to attributive characters of a target website, and task flows of the target website are determined according to the task scale factors and a webpage quality distribution sum in the target website. According to the method, while webpages in the website are captured through a crawler of a search engine, capture flows of the target website needed by a task can be better determined, conflict between the crawler and a captured site is reduced, and reasonable balance between crawler capture action and the search engine updating requirement is achieved.

Description

technical field [0001] The invention relates to the technical field of search engines, and in particular to a method and device for determining traffic to be captured. Background technique [0002] A search engine is an Internet information platform. A large amount of webpage information on the Internet can be collected through a search engine, and after processing, an information database and an index database can be established. Users can enter query words in the entry provided by the search engine, thereby Obtains the search results returned by the search engine for the query term. With the continuous development and maturity of search engine technology, the services it provides are becoming more and more perfect. When people obtain the required information from the large-scale Internet, search engines have become a very common and very convenient tool. [0003] In order to be able to download webpages on the Internet for analyzing webpage data and building indexes, sear...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/957
Inventor 魏少俊
Owner BEIJING QIHOO TECH CO LTD