High-efficiency and high-precision network data capturing method and device

A technology of network data and crawling device, which is applied in network data indexing, network data retrieval, other database retrieval and other directions, can solve the problems of inability to achieve high-efficiency and high-precision network data capture, poor scalability, and low performance.

Pending Publication Date: 2022-07-15
XINJIANG UNIVERSITY
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Aiming at the problems of poor scalability and low performance in obtaining network data in existing single-machine multi-thread and multi-process web crawler methods, as well as the inability to achieve high-efficiency and high-precision network data capture, the present invention proposes a High-efficiency, high-precision network data capture method and device. The present invention realizes efficient capture of network data through distributed crawlers, and adopts a parallel improved Bloom Filter method to effectively remove duplicate URLs in the data capture process. Improves the efficiency and accuracy of obtaining network data, saves memory space resources, see the description below for details:

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-efficiency and high-precision network data capturing method and device
  • High-efficiency and high-precision network data capturing method and device
  • High-efficiency and high-precision network data capturing method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046] The embodiment of the present invention provides a high-efficiency and high-precision network data capture method, see figure 1 , the method includes the following steps:

[0047] Step 101: Using the Scrapy crawler framework, based on the master-slave distributed structure of Redis cluster scheduling and data caching, by rewriting the scheduler and crawler class, the scheduler can obtain the deduplicated URL from the Redis scheduling queue, and the completion Data interaction between the scheduler and Redis; realizes that the data captured by the crawler class can pass through the project pipeline class smoothly, and after passing through the project pipeline class, the captured data can be cached in the Redis database, and the crawler class and the The indirect interaction between Redis builds a new deduplication class based on Scrapy-Redis master-slave distributed crawler;

[0048] The Scrapy crawler framework is a set of components based on the Redis database and ru...

Embodiment 2

[0062] The scheme in Embodiment 1 is further introduced below in conjunction with specific calculation formulas and examples, and is described in detail below:

[0063] Step 201: In order to efficiently obtain network data information, the Scrapy framework and the Redis database are interacted with each other, and by rewriting the scheduler and the crawler class, the interaction between the scheduler, the crawler class, the project pipeline class and Redis is realized, and the construction Create a new deduplication class;

[0064] Among them, in the new deduplication class, the scheduler judges a crawler request URL by accessing the Redis database, and if it is not repeated, it will be added to the queue to be crawled in the Redis database. When the scheduler receives a request for obtaining a URL from the engine, the scheduling conditions have been met at this time, and the scheduler will perform the encapsulation operation of the engine and the resource download operation o...

Embodiment 3

[0097] A high-efficiency and high-precision network data capture device, see figure 1 , the device includes: Scrapy engine (engine), Scheduler (scheduler), Downloader (downloader), Spider (crawler), Item Pipline (pipeline), Downloader Middlewares (download middleware), Spider Middlewares (crawler middleware) and Master-slave distributed crawler cluster.

[0098] The crawler initiates a crawler request to the engine through the crawler middleware to request the first URL to be crawled. After the engine receives the crawler request, it requests the scheduler to obtain the URL. In the new deduplication class, the scheduler determines the weight of a crawler request URL by accessing the Redis database. Through an efficient URL deduplication algorithm, a large number of URLs are effectively deduplicated, and the deduplicated results are added to the scheduler queue in the Redis database. The scheduler gets the deduplicated URL from the Redis scheduling queue and returns it to the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a high-efficiency and high-precision network data capturing method and device, and the method comprises the steps: designing K high-availability hash functions, and carrying out the insertion operation of an improved Bloom Filter function through hash values in a hash value set; obtaining the position on the bit array corresponding to the hash value through the hash value in the hash value set, then obtaining the corresponding position setting value, carrying out AND operation, if the operation result is 1, executing the crawling operation, and judging that the URL is a repeated URL; carrying out deletion operation on the inserted URL mapping information through an improved Bloom Filter function, and carrying out duplicate removal on the URL information based on a parallel Bloom Filter URL duplicate removal method; the batch URLs subjected to final duplicate removal are stored in a URL duplicate removal queue based on Redis, non-duplicate URLs are continuously obtained from the URL duplicate removal queue through a scheduler, then crawler requests are initiated for the non-duplicate URLs, and network data resources are obtained. The device comprises a processor and a memory. According to the invention, the efficiency and accuracy of obtaining the network data are improved, and memory space resources are saved.

Description

technical field [0001] The invention relates to the field of network data capture, in particular to a high-efficiency and high-precision network data capture method and device. Background technique [0002] With the rapid development of the Internet and the advent of the era of big data, the Internet technology is changing with each passing day, and the data and information resources on the Internet have shown an exponential explosive growth. Internet information is closely related to people's lives, and people's demand for network information and data resources is also increasing. With the explosive growth of the current amount of data in the Internet, how to extract and utilize the required information content from hundreds of millions of massive information data is a key issue. However, for the acquisition of large-scale network data resources, the traditional single-machine-based web crawler methods and devices have faced challenges in scalability and performance, and c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/955G06F16/951
CPCG06F16/955G06F16/951
Inventor 贾振红冷正刚
Owner XINJIANG UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products