Network crawling method, terminal and storage medium

A network crawler and effective technology, applied in the field of network crawlers, can solve the problem of limited crawling times or frequency of the same proxy IP, and achieve the effect of avoiding waste

Active Publication Date: 2018-09-18
PING AN TECH (SHENZHEN) CO LTD
View PDF11 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In view of the above, it is necessary to propose a web crawler method, terminal and storage medium, combine depth information, construct a proxy IP pool, and select proxy IPs from the proxy IP pool according to preset selection rules or strategies for crawling, effectively Solved the problem of limited crawling times or frequency of the same proxy IP

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Network crawling method, terminal and storage medium
  • Network crawling method, terminal and storage medium
  • Network crawling method, terminal and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] figure 1 It is a flow chart of the web crawler method provided by Embodiment 1 of the present invention. According to different requirements, the execution sequence in the flow chart can be changed, and some steps can be omitted.

[0054] 101: Store multiple proxy IPs acquired at preset time intervals in a preset proxy IP pool.

[0055] In this embodiment, a proxy IP pool is preset in the local database, and multiple acquired proxy IPs are added to the proxy IP pool for use by crawlers. The proxy IP can be found in the proxy IP website provided on the Internet, and the specific list can be obtained manually or automatically by another small crawler. It is also possible to purchase multiple proxy IPs through a third-party service organization, and add the obtained proxy IPs to the preset proxy IP pool.

[0056] In this embodiment, the proxy information of the proxy IP may include, but not limited to: IP address, name and port.

[0057] In this embodiment, it is possi...

Embodiment 2

[0075] figure 2 It is a flow chart of the web crawler method provided by Embodiment 2 of the present invention. According to different requirements, the execution sequence in the flow chart can be changed, and some steps can be omitted.

[0076] 201: Store multiple proxy IPs acquired at preset time intervals in a preset proxy IP pool.

[0077] Step 201 in this embodiment is the same as step 101 in Embodiment 1, and will not be described in detail here.

[0078] 202: Verify each proxy IP in the proxy IP pool one by one, and judge whether the obtained proxy IP has the first validity.

[0079] In this embodiment, the proxy IP that performs the first validity verification is referred to as the proxy IP to be verified, and the proxy IP to be verified is used to access a search engine (eg, Google, Baidu, etc.) to verify whether a response from the search engine is obtained. If a response from the search engine is obtained, it indicates that the proxy IP to be verified has the fi...

Embodiment 3

[0137] image 3 It is a functional block diagram of a preferred embodiment of the web crawler device of the present invention.

[0138] In some embodiments, the web crawler device 30 runs in a terminal. The web crawler device 30 may include a plurality of functional modules composed of program code segments. The program codes of each program segment in the web crawler device 30 can be stored in a memory, and executed by at least one processor to execute (see for details figure 1 and its related description) tracking of the hand region.

[0139] In this embodiment, the web crawler device 30 of the terminal can be divided into multiple functional modules according to the functions it performs. The functional modules may include: a storage module 301 , a judging module 302 , a recording module 303 , a selection module 304 and a crawling module 305 . The module referred to in the present invention refers to a series of computer program segments that can be executed by at least...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a network crawling method, and the method comprises the following steps: storing a plurality of proxy IPs acquired at intervals of a preset period in a preset proxy IP pool; verifying each proxy IP in the proxy IP pool one by one, and judging validity of the acquired proxy IPs; recording the proxy IPs, determined as valid, in a white list in the proxy IP pool, and recordingthe proxy IPs, determined as invalid, in a black list in the proxy IP pool; when that the current proxy IP satisfies a preset proxy substitution condition is detected, selecting one proxy IP from thewhite list in the proxy IP pool; and taking the selected proxy IP as a new proxy IP and performing data crawling. The invention also provides a terminal and a storage medium. With the method, the terminal and the storage medium provided by the invention, a problem of IP limitation in a process of quickly crawling data in quantity many times for a long time of the same proxy IP can be solved effectively.

Description

technical field [0001] The invention relates to the technical field of web crawlers, in particular to a web crawler method, a terminal and a storage medium. Background technique [0002] The web crawler is a very important part of the search engine system. It is responsible for collecting web pages from the Internet and collecting information. These web page information are used to set the index to provide support for the search engine. Its performance directly affects the effect of the search engine. . As the amount of network information increases geometrically, the requirements for the performance and efficiency of web crawler page collection are also getting higher and higher. [0003] We always hope to obtain more data in a shorter period of time, but this will cause a very high load on the website, and it will also bring about problems such as increased network traffic and leakage of private data. Many websites use crawler detection technology. Analyze the web access...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L29/06H04L12/24G06F17/30
CPCH04L41/5009H04L63/0876H04L63/101
Inventor 阮晓雯徐亮肖京
Owner PING AN TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products