A method for distributed collection of public page data

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A page data and distributed technology, which is applied in the direction of network data indexing, network data retrieval, and other database retrieval, can solve the problems of resource preemption, weak cluster scalability, and inability to dynamically allocate resources, etc., to improve scalability, The effect of allocating cluster resource saving and improving resource utilization

Pending Publication Date: 2019-06-28

湖南衍金征信数据服务有限公司

View PDF7 Cites 6 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0011] Aiming at the shortcomings of the prior art, the purpose of the present invention is to propose a distributed crawler process management method based on docker swarm to solve the problems of resource preemption, inability to dynamically allocate resources, and weak cluster scalability of distributed crawlers in a cluster environment. , the user does not need to deal with the resource scheduling problem of the machine nodes in the cluster, and does not need to pay attention to the cluster programming environment dependency problem, but only needs to set the parameters of each task, such as the number of tasks to start in the cluster, the distribution of tasks in the cluster nodes, and the task corresponding black List machine nodes, package the crawler code and configuration files and upload them to the cluster, then the task can be automatically distributed to save development costs and avoid cluster resource allocation problems due to lack of dependencies and dependency conflicts in the cluster programming environment

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0031] Referring to the accompanying drawings, the following is a specific embodiment of a method for distributed collection of public page data of the present invention:

[0032] Multi-source news data collection: artificially designate three news sites as A, B, and C. Adapt the corresponding crawler programs PA, PB, and PC to the different page structure designs and data transmission interfaces of sites A, B, and C. Package the programs as docker images and upload them to the image server. Publish tasks and specify parameters on the cluster host, each machine node receives the task and starts the crawler image configured with parameters, and the crawler images of the three sites are distributed and started in the machine cluster. Use the distributed database to receive the data crawled by all machine clusters and store them in the database. Examples such as

[0033] Figure 4 shown.

[0034] see Figure 1~3 , whose execution steps are subdivided as follows:

[0035] 1...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for collecting public page data in a distributed manner, which comprises the following steps of: mirroring a crawler program, and packaging all programming environmentdependencies and software environment dependencies into a mirror image; distributing the mirror image to each machine node by using a weighted polling algorithm to ensure the overall load balance ofa crawler cluster, and managing resource allocation of a crawler task in a docker swarm command form to complete dynamic increase and decrease of cluster resources of the crawler task. The method hasthe advantages that problems of the resource preemption, incapability of dynamic resource allocation and weak cluster expandability of the distributed crawler in a cluster environment can be solved, auser only needs to set parameters, such as the starting number of the tasks in the cluster, the distribution of the tasks in the cluster nodes and blacklist machine nodes corresponding to the tasks,of each task , crawler codes and setting files are packaged and then uploaded to the cluster, and then automatic task distribution can be completed, the development cost is saved, and the problem thatcluster resource distribution is lack of cluster programming environment dependence and dependence conflicts is avoided.

Description

technical field [0001] The invention relates to office automation, relates to electrical digital data processing, and especially refers to a method for collecting public page data in a distributed manner. Background technique [0002] A web crawler (sometimes called a spider or spider bot, often shortened to crawler) is a web robot that systematically browses the World Wide Web, often for web indexing (web crawling). [0003] 1. Most of the existing crawler technologies use the http / https protocol to download data from the target website, and analyze and extract data after downloading. [0004] 2. Most distributed crawler architectures use the redis cluster as the distribution queue, import all tasks into the distribution queue, and use the list queue of the redis cluster to ensure the uniqueness of the tasks. [0005] 3. Start a specified number of crawler processes on each machine node. Each crawler process pulls urls from the distribution queue for crawling. Each crawler...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F16/955G06F16/951

Inventor 卜俊

Owner 湖南衍金征信数据服务有限公司

A method for distributed collection of public page data

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology