Data crawling method and system and data crawling equipment

A technology of data and data content, applied in the field of data crawling methods, systems and equipment, can solve problems such as inaccessibility of IP addresses, achieve the effects of preserving continuity, reducing complexity, and increasing data crawling efficiency

Pending Publication Date: 2020-08-14
北京市科学技术情报研究所
View PDF4 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] For this reason, the embodiment of the present invention provides a data crawling method, system and equipment to solve the problem in the prior art that the IP address is blocked and cannot be accessed when crawling web page data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data crawling method and system and data crawling equipment
  • Data crawling method and system and data crawling equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0024] The embodiment of the present invention mainly aims at the situation that the banned IP address cannot continue to visit when crawling webpage data and making multiple frequent visits to the same webpage address, and proposes a data crawling system that automatically replaces the IP address of the data crawling agent. The data crawling system includes several basic functional modules that can realize specific functions. Each module can complete the predetermined basic support functions, and can realize the IP address of the data crawling agent during the data crawling process to ensure that the access is returned correctly. and output. At the front end of crawler data collection, the application of the embodiments of the present invention can ensure the success rate of access, thereby reducing the difficulty of later data processing.

[0025] Specifically, refer to figure 1 , the embodiment of the present invention provides a data crawling system, the basic functional ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a data crawling method and system and data crawling equipment. The invention relates to the technical field of network information processing. Data crawling is carried out by controlling IP address replacement of the data crawling agent through a process. Compared with a conventional proxy IP crawling technology, the method is advantageous in that the IPaddress of the data crawling agent can be frequently replaced on a large scale to access the target webpage; the method is more suitable for the situation that the requirement of the target webpage for the continuity of data acquisition after login is high, the problem that data acquisition is interrupted due to replacement of the IP address of the crawling agent is solved, the continuity of datais reserved to the maximum extent, the complexity of later data processing is greatly reduced, and the data crawling efficiency is improved.

Description

technical field [0001] Embodiments of the present invention relate to the technical field of network information processing, and in particular to a data crawling method, system and equipment. Background technique [0002] A web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. Traditional crawlers start from the URL of one or several initial webpages, obtain the URLs on the initial webpage, and continuously extract new URLs from the current page and put them into the queue during the process of crawling webpages until a certain stop condition of the system is met. The work flow of the focused crawler is relatively complicated. It needs to filter links that have nothing to do with the topic according to a certain webpage analysis algorithm, keep useful links and put them into the URL queue waiting to be crawled. Then, it will select the URL of the web ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/9566
Inventor 毛卫南苗润莲毛维娜张敏向宁张洪元
Owner 北京市科学技术情报研究所
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products