Distributed crawler method, electronic equipment and server

A distributed and server technology, applied in the field of crawler, can solve the problems of high cost, low efficiency, inability to effectively avoid the risk of being blocked and intercepted, and achieve the effect of avoiding interception and efficient crawling

Active Publication Date: 2018-05-15
LENOVO (BEIJING) LTD
View PDF8 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The traditional crawler is to write and use the crawler program to continuously traverse the specified website, search for relevant pages and record or store the data in its own database, but usually such a crawler program is easy to be analyzed by the operation and maintenance of the website and the administrator. Discover the amount of website requests and related users (useragent), and directly block and intercept them
On the basis of traditional crawlers, crawling by constantly changing IP and disguising useragent through IP proxy can only alleviate the probability of being blocked and blocked to a certain extent, but it is still easy to be blocked by website operation and maintenance and administrators. It is found and blocked by setting the request frequency limit of a certain IP for a certain period of time and checking whether the IP address is camouflaged through hostname. Therefore, setting an IP proxy still cannot effectively avoid the risk of being blocked and intercepted, and the efficiency is low and the cost is higher. Big

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed crawler method, electronic equipment and server
  • Distributed crawler method, electronic equipment and server
  • Distributed crawler method, electronic equipment and server

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0049] A distributed crawler method according to the embodiment of the present invention, a crawler is a program or script that automatically grabs information on the World Wide Web according to certain rules, and the crawler itself can be a terminal with a crawling program, or the crawling program itself etc., no limitation is made here. The crawler in this embodiment can prevent malicious blocking programs from blocking the operation of crawling pages of the crawler, such as figure 1 shown and combined with image 3 , the method includes the following steps:

[0050] S1, when browsing the browsing page 1, triggering access to the crawling page corresponding to the crawling page address configured in the browsing page 1. The user can use a terminal such as a computer to access the browsing page 1 that needs to be viewed, for example, use a computer to access the first site and view the browsing page 1 in the site, such as viewing the news page, entertainment page, etc. in t...

Embodiment 2

[0057] The embodiment of the present invention provides a distributed crawler method. A crawler is a program or script that automatically captures information on the World Wide Web according to certain rules. The crawler itself can be a terminal with a crawling program, or a crawling program. itself, etc., are not limited here. like figure 2 shown and combined with image 3 , the method includes the following steps:

[0058] S4. Configure the crawling page address in the browsing page 1, wherein, when the browsing page 1 is accessed by the terminal, the terminal accesses the crawling page corresponding to the crawling page address and obtains the target data of the crawling page. In one embodiment, the server (which may be server 3) can be used to configure the address of the crawled page in browsing page 1 through the network, so as to control the crawled pages that distributed crawlers need to crawl, such as modifying the The preset program set in the crawling page modif...

Embodiment 3

[0064] The present invention provides a distributed crawler device. The crawler is a program or script that automatically captures information on the World Wide Web according to certain rules. The crawler itself can be a terminal with a crawling program, or the crawling program itself, etc. It is not limited here. The device includes a trigger module, a first acquisition module and a communication module;

[0065] The triggering module is configured to trigger access to the crawled page corresponding to the crawled page address configured in the browsed page 1 when the browsed page 1 is accessed. The user can use a terminal such as a computer to access the browsing page 1 that needs to be viewed, for example, use a computer to access the first site and view the browsing page 1 in the site, such as viewing the news page, entertainment page, etc. in the site, in one embodiment Among them, when the user visits the browsing page 1, the triggering module can automatically trigger ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a distributed crawler method, electronic equipment and a server. The method comprises the steps that when a browsing page is accessed, a crawling page corresponding to a crawling page address configured in the accessed browsing page is triggered; target data of the crawling page is acquired; and the target data is uploaded to the server. Through the distributed data crawling method, a large quantity of common users can capture data on another web page by normally accessing a certain common web page, and by use of the characteristic that each common user uses an independent and different IP, interception of a data crawling behavior by an anti-crawler strategy is effectively avoided, and crawling is more efficient and convenient.

Description

technical field [0001] The invention relates to a crawler method, in particular to a distributed crawler method, electronic equipment and a server. Background technique [0002] At present, with the development of the network and the advent of the era of big data, the search, application and data collection of a large amount of information on the Internet has become an important technology and challenge. Therefore, a web crawler emerges at the historic moment, and a web crawler is a program or method for automatically extracting web pages, which is an important part of downloading data from the Internet. [0003] The traditional crawler is to write and use the crawler program to continuously traverse the specified website, search for relevant pages and record or store the data in its own database, but usually such a crawler program is easy to be analyzed by the operation and maintenance of the website and the administrator. Discover the amount of website requests and relate...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 李栋
Owner LENOVO (BEIJING) LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products