Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website

A server and web crawling technology, applied in transmission systems, instruments, computing, etc., can solve problems such as low collection efficiency, crawling failure, and denial of service, and achieve the effect of improving fault tolerance and efficiency.

Active Publication Date: 2012-05-23
NEW FOUNDER HLDG DEV LLC +2
View PDF2 Cites 41 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For a client, if multiple access requests sent by the client are assigned to the same server of the website, it may be denied service due to the restrictions of the server, or even blocked IP
[0003] Existing crawler systems generally control the access strategy according to the website. Due to the limitation of the number of concurrent website visits, the collection efficiency is low. If the number of working threads for crawling web pages is increased, it is easy to trigger the restricted access conditions of the website, resulting in crawling Failed or blocked IP

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website
  • Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website
  • Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0016] The present invention will be described in detail below in conjunction with specific embodiments and accompanying drawings.

[0017] figure 1 It shows the system structure of grabbing webpages from multiple servers with different IPs in the website according to the present invention. Such as figure 1 As shown, the system includes a distributing device 11 , a judging device 12 connected to the distributing device 11 , and a grabbing device 13 connected to the judging device 12 .

[0018] The allocating device 11 is used for allocating the IP of the target website server for the web page crawling task of the client. The webpage grabbing task includes the URL (webpage address) of the webpage to be grabbed; the target website refers to the website where the webpage to be grabbed is located.

[0019] The judging means 12 is used for judging whether the webpage crawling task meets the polite access condition of the server. The polite access conditions include the followin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a system for grabbing web pages from servers with different IPs (Internet Protocols) in a website. The method comprises the following steps of: distributing the IPs of target website servers for the web-page grabbing task of a client side, wherein the web-page grabbing task comprises grabbing of the addresses of the web pages to be grabbed; and then judging whether the web-page grabbing task conforms to the courteous access condition of the servers or not, if SO, utilizing the IP to establish connection with the servers, and grabbing the web pages at the web-page addresses from the servers. In the invention, the access strategy is based on IP level, so that an acquisition working thread is more conveniently controlled to carry out courteous access on the website; by the mode of caching a DNS (Domain Name Server), simultaneously using a plurality of IPs and preferably distributing the fastest IP, the efficiency for grabbing the web pages is greatly improved; and when individual servers of the target website can not be accessed, the servers with other IPs can be switched in time, and the fault-tolerant capability is improved.

Description

technical field [0001] The invention relates to a method and system for grabbing web pages from a website, in particular to a method and system for grabbing web pages from multiple servers with different IPs in the website. Background technique [0002] With the rapid development of the Internet, the scale of information on the Internet is increasing, and the number of website visits is also increasing. In order to meet the current Internet access requirements, most websites with large information scale or large visit volume provide multiple servers with different IPs (Internet Protocol, a protocol for interconnection between networks). Through intelligent DNS (Domain Name System (Domain Name System) server returns a list of server IPs in different orders according to the load balancing strategy, and the client will use the first server to access, thus distributing the user's access requests to different servers. In order to prevent the server from being under too much pres...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L29/08H04L29/12G06F17/30
Inventor 李湘军于晓明杨建武吴新丽
Owner NEW FOUNDER HLDG DEV LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products