Webpage data crawling method, device and equipment and medium

A web page data and crawler technology, applied in the computer field, can solve the problems of blocked IP addresses, difficulty in ensuring the reliability and efficiency of web page data crawling, waste of IP address pool IP resources, etc., to ensure overall efficiency, reliability, Avoid wasteful effects

Pending Publication Date: 2019-06-28
SANGFOR TECH INC
View PDF7 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, due to the relatively high security mechanism of each website at present, when the server using the same IP address crawls the information of the target website too frequently, it will often trigger the risk control mechanism of the target website, resulting in the situation that the IP address is blocked, thus This IP address cannot be used again to crawl data from the target website. Therefore, in the current webpage data crawling process, IP resources in the IP address pool may be wasted, and it is difficult to guarantee the overall reliability and reliability of the webpage data crawling process. efficiency
[0005] It can be seen that providing a webpage data crawling method to relatively avoid the waste of IP resources in the IP address pool and ensure the overall reliability and reliability of the webpage data crawling process is a problem to be solved by those skilled in the art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage data crawling method, device and equipment and medium
  • Webpage data crawling method, device and equipment and medium
  • Webpage data crawling method, device and equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] The following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

[0046] Currently, the security mechanism of each website is relatively high. When the server using the same IP address crawls the information of the target website too frequently, it will often trigger the risk control mechanism of the target website, which will cause the IP address to be blocked and prevent it from happening again. This IP address is used to crawl data on the target website. Therefore, in the current web page data crawli...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage data crawling method, device and equipment and a medium. The method comprises the steps that a crawling frequency threshold value for triggering a target website to initiate an IP blocking operation is obtained; crawler programs in the plurality of agent nodes are respectively controlled to perform collaborative crawling on the webpage data in the target website according to the crawling frequency which is less than the crawling frequency threshold value; wherein the IP addresses of the agent nodes are different from one another. According to the method, the reliability of the crawling process of the webpage data is ensured, and the waste of IP resources is avoided; in addition, according to the method, the webpage data in the target website are crawled inthe mode that the agent nodes work cooperatively, and the overall efficiency of the webpage data crawling process is relatively ensured. In addition, the invention further provides a webpage data crawling device, webpage data crawling equipment and a medium, and the beneficial effects are the same as described above.

Description

Technical field [0001] This application relates to the field of computer technology, and in particular to a method, device, equipment and medium for crawling webpage data. Background technique [0002] With the advent of the era of big data, companies need to analyze user behaviors, deficiencies in their products, or competitors’ information based on data, but the primary condition for these is data collection, so the crawler technology responsible for data collection has become Indispensable technical means in this era. [0003] At present, a large amount of valuable information on the Internet needs to be crawled from the page of the website to the local server for analysis by means of crawlers. For example, you can run a crawler script on the server and log in to the server to be crawled by entering the account and password. Website, and then crawl the data stored in the relevant pages of the website to be crawled. [0004] However, due to the relatively high security mechanisms...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951
Inventor 王立明
Owner SANGFOR TECH INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products