Distributed crawler task scheduling method based on actuator state control

By generating a proxy IP pool and a task hierarchy mechanism, combined with executor status monitoring and self-processing mechanisms, the problem of task queue blockage caused by executor anomalies in distributed crawler systems is solved, achieving efficient system operation and low-cost maintenance.

CN116339949BActive Publication Date: 2026-06-30XIDIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIDIAN UNIV
Filing Date
2023-03-28
Publication Date
2026-06-30

Smart Images

  • Figure CN116339949B_ABST
    Figure CN116339949B_ABST
Patent Text Reader

Abstract

This invention discloses a data collection task scheduling method based on executor state control in a distributed crawler system. It primarily addresses the problems of priority response for key tasks and the inability to handle exceptions automatically in existing technologies. The implementation scheme includes: 1) generating a proxy IP pool; 2) constructing a distributed data collection executor system based on the proxy IPs and pre-setting an executor state transition mechanism; 3) introducing a task hierarchy mechanism, generating tasks with priority characteristics, and distributing tasks to each distributed data collection executor according to the real-time status of the executors; 4) each executor executing tasks sequentially according to priority; and 5) defining various exceptions encountered by the executors during task execution and pre-setting corresponding self-handling mechanisms for each exception. This invention effectively improves the system's execution efficiency, avoids task queue congestion caused by executor exceptions, enhances system availability, and can be applied to large-scale data collection systems with multiple data collection executors on the Internet.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer technology, and specifically relates to a distributed crawler task scheduling method, which can be applied to large-scale data acquisition systems with multiple acquisition executors in the Internet. Background Technology

[0002] The rapid development of the internet has led to the increasingly widespread application of big data, making distributed web crawler technology increasingly important. The common execution logic of a web crawler system is as follows: the user sets a data collection task, and according to a preset execution time, the system configures the execution script in real time and submits it to the executor. The executor performs data collection operations according to preset rules and returns the results. Current research on task scheduling mechanisms in distributed web crawler systems is not yet mature. Even with strong hardware resources, if they are not used efficiently, it will still result in wasted computing power and inefficient system operation. Existing distributed web crawler task scheduling methods mainly rely on preset allocation mechanisms to schedule and distribute tasks based on the hardware capabilities of the executor. While this method allows for macro-level resource allocation and invocation at the node level, it does not consider the specific characteristics of web crawler tasks. If a node, i.e., the executor, experiences an anomaly, such as network fluctuations or IP blocking, it can lead to task backlog in the queue, system malfunctions, and significant risks.

[0003] Tianjin University of Technology, in its published paper "Research on Active Acquisition Distributed Web Crawler Cluster Method" (Research on Active Acquisition Distributed Web Crawler Cluster Method [J]. Computer Science, 2018, 45(S1): 428-432.), disclosed an active acquisition task-based distributed web crawler method. The specific steps of this method are: first, selecting a high-configuration machine as the central control machine; second, introducing a sub-node control module to each sub-node of the cluster; third, allocating and scheduling tasks according to a dynamic bidirectional priority task allocation and scheduling algorithm; and fourth, the corresponding node executing the task and returning the result. While this method allows nodes to actively identify anomalies, the lack of anomaly handling procedures necessitates manual troubleshooting and repair of node anomalies, significantly increasing labor costs and requiring substantial maintenance, resulting in low sustainability.

[0004] Shanghai Jiao Tong University disclosed a "Distributed Web Crawler Scheduling System and Method Based on Computational Resources" in its patent application CN202010464671.9. The implementation steps are as follows: First, the client obtains a user-defined crawler scheme to generate a navigation page crawler seed and stores it in a Redis database. Second, each node in the crawler producer node cluster calculates its own resources to obtain the navigation page crawler seed and stores the generated content page crawler seed in the client's Redis database. Third, each node in the crawler consumer node cluster calculates its own resources to obtain the content page crawler seed and operates on the content page crawler seed to collect target data. This method only considers system computational resources for task scheduling and does not account for potential uncertainties such as IP blocking and network fluctuations during actual operation. This can lead to task backlog for executors, potentially causing system crashes. Furthermore, this method does not differentiate task levels, preventing the system from prioritizing important tasks, resulting in low availability and unsuitability for task scheduling in large-scale distributed web crawler systems. Summary of the Invention

[0005] The purpose of this invention is to address the shortcomings of the prior art by proposing a distributed crawler task scheduling method based on executor state control. This method reduces manual maintenance costs and avoids task queue congestion caused by executor anomalies by pre-setting an executor anomaly self-handling mechanism. Furthermore, it introduces a task hierarchical mechanism to classify the priority of data collection tasks.

[0006] The technical solution for achieving the objective of this invention includes the following steps:

[0007] (1) Generate a proxy IP pool:

[0008] 1a) Generate a Secure Shell (SSH) key pair in the data acquisition server;

[0009] 1b) Purchase n VPS proxy servers and add the public key from the SSH key pair to the authorized_keys file under the / root / .ssh path on each VPS proxy server;

[0010] 1c) Send SSH commands on the data collection server side, establish remote connections between the data collection server and each proxy server using the private key in the SSH key pair, store the IP addresses of these proxy servers to form a proxy IP pool, and then set the status of each proxy IP to Free, waiting for subsequent calls from the executor.

[0011] (2) Construct a distributed data acquisition executor for the system:

[0012] 2a) Based on the WebDriver function library provided by the Selenium tool, create a browser page as a new data collection executor, and set the executor status to New;

[0013] 2b) The executor selects the proxy IP in the Free state from step 1c), establishes an SSH connection between the executor and the proxy server, updates the executor status to Initializing, updates the IP status to Busy, sets the executor's outgoing traffic address to the proxy IP, and updates the executor status to Idle.

[0014] A data acquisition executor based on a proxy IP was obtained;

[0015] 2c) Repeat steps 2a) and 2b) to add m data collection executors to complete the construction of the distributed data collection executor system, where m is less than n, to ensure that the number of proxy pool IPs is greater than the number of executors;

[0016] (3) Configure and generate data acquisition tasks:

[0017] 3a) Based on the characteristics of the target data crawled from the Internet, write a script to drive the executor to execute automatically, and set the priority of task execution, including three levels: high, medium and low;

[0018] 3b) Set the task execution cycle and generate a token that records relevant information about the task according to the Cron expression at regular intervals. This information includes the execution script and the running priority.

[0019] (4) Distributor schedules tasks:

[0020] 4a) The generated tokens are archived and registered. After deduplication based on the id field of the tokens, they are submitted to the distributor in the collection system for processing.

[0021] 4b) The distributor distributes the received tokens to the distributed collection executors of the system constructed in step (2);

[0022] 4c) After obtaining the Token, the executor adds it to the high, medium, or low priority queues corresponding to this executor according to the priority information of the Token.

[0023] (5) The executor executes the Token;

[0024] (6) Set the various exceptions during the execution of the Token by the executor, including: minor exception, serious exception, suspension exception and corruption exception;

[0025] (7) The actuator automatically handles various abnormalities that occur during the data acquisition process:

[0026] If a minor anomaly occurs, the executor will resubmit the running Token to the system distributor, increment the value of the minor anomaly by one, and execute step (8).

[0027] If a serious anomaly occurs, the executor changes its own state to Closing, disconnects the SSH connection, changes the associated proxy IP state from Busy to Free, adds it to the proxy pool, and executes step (10).

[0028] If a suspension exception occurs, the actuator will enter hibernation, and the system will issue a notification to wait for the user to manually operate the system or browser and execute step (11).

[0029] If a transmission corruption exception occurs, the executor changes to the Broken state, the token in the executor's queue is checked, and the notification is resubmitted to the system distributor. The administrator then deletes the executor process from the collection server level.

[0030] (8) Determine if the abnormal value reaches ten:

[0031] If the minor outlier is less than ten, return to step (5);

[0032] If the number of minor anomalies is ten, change the executor associated IP, that is, disconnect the executor SSH connection, change the proxy IP status to Frozen, and execute step (9);

[0033] (9) Determine if there are any free IPs in the proxy pool:

[0034] If it does not exist, the executor enters a pending exception;

[0035] If it exists, select a Free IP from the proxy pool, reset the minor anomaly count of the executor to zero, and return to step (5);

[0036] (10) The executor enters the Terminated state, deletes the executor driver, returns to step 2a), and retains the execution queue;

[0037] (11) The user triggers the executor, which then executes the subsequent steps of the current token.

[0038] Compared with the prior art, the present invention has the following advantages:

[0039] First, since this invention defines various exceptions that occur during the operation of executors and pre-sets self-processing mechanisms for executors according to different exception types, compared with traditional distributed crawler systems, this invention can automatically trigger repairs for various exceptions, effectively avoiding task queue blockage caused by executor exceptions, thereby ensuring the long-term effective execution of the system, improving the robustness of the crawler system, and effectively reducing the operation and maintenance costs of the crawler system.

[0040] Secondly, since this invention defines the running status of the crawler system executors, the system can learn about the real-time running status of each executor. Therefore, compared with the traditional distributed crawler system's scheduling method of setting up a public directory and allocating tasks according to computing resources, it can ensure the real-time optimal solution for task scheduling, thereby improving the execution efficiency of the crawler system.

[0041] Third, since this invention defines the execution priority of tasks and sets up three levels of execution queues (high, medium, and low) in each executor, compared with the traditional distributed crawler system that executes tasks in the order of task generation time, it can prioritize generated tasks and execute high-priority tasks first, thereby ensuring the timeliness of important task execution and improving the availability of the crawler system. Attached Figure Description

[0042] Figure 1 This is a flowchart illustrating the implementation of the present invention;

[0043] Figure 2 This is a flowchart of the actuator exception handling process in this invention;

[0044] Figure 3 This is the actuator state transition diagram in this invention;

[0045] Figure 4 This is the agent state transition diagram in this invention. Detailed Implementation

[0046] The embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

[0047] Reference Figure 1 The implementation steps for this example are as follows:

[0048] Step 1: Generate a system proxy IP pool.

[0049] 1.1) Run the command "ssh-keygen -t rsa" in the / root / .ssh path of the data collection server to generate a key pair, where id_rsa is the private key and id_rsa.pub is the public key. Copy the string in the generated id_rsa.pub file;

[0050] 1.2) Access n VPS proxy servers running Ubuntu operating systems, and paste the string copied in step 1.1) into the authorized_keys file in the / root / .ssh path of each server;

[0051] 1.3) Execute the command "ssh -i id_rsa -l root${IP}" on the data collection server to establish remote connections between the data collection server and each proxy server, where IP refers to the external IPv4 address of the VPS proxy server;

[0052] 1.4) Store the IP addresses of these proxy servers to form a proxy IP pool, and then set the status of each proxy IP to Free, waiting for subsequent calls from the executor.

[0053] Step 2: Construct a distributed data acquisition executor for the system.

[0054] 2.1) Construct an Agent object in the system backend code. This object contains state attributes, a high-priority queue, a medium-priority queue, a low-priority queue, associated browser attributes, and associated proxy IP attributes, where:

[0055] The status attribute is an enum type data, representing the executor status. The current status is New. Other assignable values ​​include: Initializing, Idle, Busy, Hangup, Closing, Terminated, and Broken.

[0056] The executor has high, medium, and low priority queues to store tokens of different priorities.

[0057] The associated browser attribute is used to associate the browser driver used by this executor;

[0058] The associated proxy IP attribute is used to associate the proxy IP used by this executor;

[0059] 2.2) Import the Selenium.Webdriver function library into the system backend code, create a new Chrome browser driver object driver, call the get function of the object to access the URL "https: / / www.baidu.com", complete the initialization of the browser, and assign the driver to the associated browser property in step 2.1) to complete the association between the executor and the browser;

[0060] 2.3) The executor selects a proxy IP in the Free state from the system proxy pool generated in step 1, assigns this IP to the associated proxy IP attribute in step 2.1), updates the executor state to Initializing, and changes the IP state to Busy, such as... Figure 4 The state changes from Free to Busy.

[0061] 2.4) Set the actuator's traffic exit address to the proxy IP, update the actuator status to idle ID, and obtain a data collection actuator based on the proxy IP;

[0062] 2.5) Repeat steps 2.1) to 2.4) to add m collection executors to complete the construction of the system's distributed collection executor, where m is less than n to ensure that the number of proxy pool IPs is greater than the number of executors.

[0063] Step 3: Configure and generate the data collection task.

[0064] 3.1) Based on the characteristics of the target data crawled from the Internet, write a script to drive the executor to execute automatically, and set the priority of task execution, including three levels: high, medium, and low;

[0065] 3.2) Set the task execution cycle and generate a token that records relevant information about the task according to the Cron expression. This information includes the execution script and the running priority.

[0066] Step 4: The dispatcher schedules the task.

[0067] 4.1) The system archives and registers the generated tokens, performs deduplication based on the token's id field, and then submits them to the distributor in the data collection system for processing.

[0068] 4.2) The distributor selects the distribution target for the received token based on the real-time running status of each executor. Specifically, it first uses a for loop to determine if any executor has an Idel state.

[0069] If it exists, select the executor with the largest number in the Idel state as the distribution target to push the token;

[0070] If it does not exist, use an if function to determine if there is a unique Busy state executor with the smallest queue size:

[0071] If so, select the executor with the smallest queue size as the distribution target to push the Token;

[0072] If not, select the executor with the largest number among the executors with the smallest queue size and push the Token as the distribution target;

[0073] 4c) After obtaining the Token, the executor parses the Token's priority information and adds it to the high, medium, and low priority queues corresponding to this executor, waiting for execution.

[0074] Step 5: The executor executes the Token.

[0075] 5.1) The executor retrieves tokens from the corresponding queues in high, medium, and low order, and changes its state to Busy.

[0076] 5.2) The executor parses the token, obtains the execution script information, and the browser runs the script to automatically execute the preset operations and complete the execution of a single token;

[0077] 5.3) The executor repeats steps 5.1) and 5.2) according to the order of the tokens in the queue until all priority queues are empty, completes the execution of all tokens, and changes its own state to Idel.

[0078] Step 6, set executor exceptions.

[0079] The system pre-defines various exceptions during the executor's execution of the token, including: minor exceptions, major exceptions, suspension exceptions, and corruption exceptions;

[0080] The minor anomalies refer to anomalies caused by fluctuations in the IP network of the executor-associated agent, including access timeout anomalies and page loading anomalies.

[0081] The serious exception refers to the exception caused by damage to the executor kernel driver, which includes runtime exceptions and unexpected exceptions;

[0082] The suspension exception refers to the exception caused when the actuator matches a preset situation, the specific situation of which is preset by the user;

[0083] The aforementioned damage anomaly refers to an anomaly caused by the actuator's inability to respond to system communications.

[0084] Step 7: The executor automatically handles various exceptions that occur during the execution of the Token.

[0085] Reference Figure 2 As shown, the specific implementation of this step is as follows:

[0086] If a minor anomaly occurs, the executor will resubmit the running Token to the system distributor, increment the value of the minor anomaly by one, and proceed to step 8.

[0087] If a serious anomaly occurs, the executor changes its own state to Closing, disconnects the SSH connection, changes the associated proxy IP state from Busy to Free, adds it to the proxy pool, and executes step 10.

[0088] If a suspension exception occurs, the actuator will enter hibernation, and the system will issue a notification to wait for the user to manually operate the system or browser before proceeding to step 11.

[0089] If a transmission corruption exception occurs, the executor changes to the Broken state, the token in the executor's queue is checked, and the notification is resubmitted to the system distributor. The administrator then deletes the executor process from the collection server level.

[0090] Step 8: Determine whether the abnormal value has reached the preset value.

[0091] The if statement is executed to determine whether the number of minor anomalies has reached ten:

[0092] If the minor outlier is less than ten, return to step 5;

[0093] If the minor anomaly count is ten, change the executor's associated IP address; that is, disconnect the executor's SSH connection and change the proxy IP status to Frozen. Figure 4 The state changes from Busy to Frozen; proceed to step 9.

[0094] Step 9: Determine if there are any free IPs in the proxy pool.

[0095] Set a for loop to poll the status of each IP in the proxy IP pool and verify if there are any IPs in a free state:

[0096] If it does not exist, the executor enters a pending exception;

[0097] If it exists, execute the setProxy function to select a Free IP in the proxy pool, reset the minor anomaly count of the executor to zero, and return to step 5;

[0098] Step 10: The actuator enters the termination process.

[0099] like Figure 3 As shown, the executor calls the setStatus function of the Agent object, changes the executor status to Terminated, deletes the executor driver, returns to step 2.2), and retains the execution queue.

[0100] Step 11: Trigger the executor.

[0101] When a user clicks the Trigger button in the software, the code executes the trigger function, which re-triggers the corresponding suspended executor, allowing that executor to continue executing the subsequent steps in its thread.

[0102] Terminology Explanation:

[0103] IP: an abbreviation for Internet Protocol, is a network interconnection protocol;

[0104] Redis: A non-relational database based on key-value pairs;

[0105] SSH: an abbreviation for Secure Shell, is a secure shell protocol;

[0106] VPS: Short for Virtual Private Server, which is a virtual private server;

[0107] Selenium: A tool for testing browser applications;

[0108] Webdriver: Browser driver;

[0109] Cron: A scheduled execution tool;

[0110] Token: Represents a token, which serves as a carrier of information;

[0111] Free: Indicates that the proxy is available;

[0112] Idel: Indicates that the actuator is in an idle state;

[0113] Busy: Indicates that the agent or executor is busy;

[0114] New: Indicates that the actuator status has been newly created;

[0115] Initializing: Indicates the initialization of the executor state;

[0116] Closing: Indicates that the actuator is in the closed state;

[0117] Terminated: indicates that the executor state has terminated;

[0118] Hangup: Indicates that the actuator is in a suspended state;

[0119] Broken: Indicates that the actuator is corrupted;

[0120] Frozen: Indicates that the proxy status is frozen;

[0121] Trigger: Indicates the triggering executor.

Claims

1. A distributed crawler task scheduling method based on actuator state control, characterized in that, Includes the following steps: (1) Generate a proxy IP pool: 1a) Generate a Secure Shell (SSH) key pair in the data acquisition server; 1b) Purchase n VPS proxy servers and add the public key from the SSH key pair to the authorized_keys file under the / root / .ssh path on each VPS proxy server; 1c) Send SSH commands on the data collection server to establish remote connections between the data collection server and each proxy server using the private key in the SSH key pair, store the IP addresses of these proxy servers to form a proxy IP pool, and then set the status of each proxy IP to Free, waiting for subsequent calls from the executor. (2) Construct a distributed data acquisition executor for the system: 2a) Based on the WebDriver function library provided by the Selenium tool, create a browser page as a new data collection executor, and set the executor status to New; 2b) The executor selects the proxy IP in the Free state in step 1c), establishes an SSH connection between the executor and the proxy server, updates the executor status to Initializing, updates the IP status to Busy, sets the executor's traffic exit address to the proxy IP, and updates the executor status to Idle, thus obtaining a data collection executor based on the proxy IP. 2c) Repeat steps 2a) and 2b) to add m data collection executors to complete the construction of the distributed data collection executor system, where m is less than n, to ensure that the number of proxy pool IPs is greater than the number of executors; (3) Configure and generate data acquisition tasks: 3a) Based on the characteristics of the target data crawled from the Internet, write a script to drive the executor to execute automatically, and set the priority of task execution, including three levels: high, medium and low; 3b) Set the task execution cycle and generate a token that records relevant information about the task according to the Cron expression at regular intervals. This information includes the execution script and the running priority. (4) Distributor scheduling tasks: 4a) The generated tokens are archived and registered. After deduplication based on the id field of the tokens, they are submitted to the distributor in the collection system for processing. 4b) The distributor distributes the received tokens to the distributed collection executors of the system constructed in step (2); 4c) After obtaining the Token, the executor adds it to the high, medium, or low priority queues corresponding to this executor based on the Token's priority information. (5) The executor executes the Token; (6) Set the various exceptions during the execution of the Token by the executor, including: minor exception, serious exception, suspension exception and corruption exception; (7) The actuator automatically handles various abnormalities that occur during the data acquisition process: If a minor anomaly occurs, the executor will resubmit the running Token to the system distributor, increment the minor anomaly count, and execute step (8). If a serious anomaly occurs, the executor changes its own state to Closing, disconnects the SSH connection, changes the associated proxy IP state from Busy to Free, adds it to the proxy pool, and executes step (10). If a suspension exception occurs, the actuator will enter hibernation, and the system will issue a notification to wait for the user to manually operate the system or browser and execute step (11). If a transmission corruption exception occurs, the executor changes to the Broken state, the token in the executor's queue is checked, and the notification is resubmitted to the system distributor. The administrator then deletes the executor process from the collection server level. (8) Determine if the abnormal value reaches ten: If the minor outlier is less than ten, return to step (5); If the number of minor anomalies is ten, change the executor associated IP, that is, disconnect the executor SSH connection, change the proxy IP status to Frozen, and execute step (9). (9) Determine if there are any free IPs in the proxy pool: If it does not exist, the executor enters a pending exception; If it exists, select a Free IP from the proxy pool, reset the minor anomaly count of the executor to zero, and return to step (5). (10) The executor enters the Terminated state, deletes the executor driver, returns to step 2a), and retains the execution queue; (11) The user triggers the executor, which then executes the subsequent steps of the current Token.

2. The method according to claim 1, characterized in that, In step 1a), the Secure Shell Protocol (SSH) key pair is generated as follows: Run the command "ssh-keygen -t rsa" in the / root / .ssh path of the data collection server to generate two files, where id_rsa is the private key and id_rsa.pub is the public key.

3. The method according to claim 1, characterized in that, In step 2a), the browser page is created based on the WebDriver function library provided by the Selenium tool. This is done by importing the Selenium.Webdriver function library into the code, creating a new browser driver object driver, and calling the driver.get function to access the target URL, thus completing the initialization and creation of the browser.

4. The method according to claim 1, characterized in that, In step 4b), the distributor distributes the received tokens to each executor, and the distribution varies depending on whether the current executor is in an idle state: If it exists, the received token will be distributed to the executor in the Idle state; If it does not exist, the token will be distributed to the executor with the smallest queue number in the Busy state.

5. The method according to claim 1, characterized in that, The executor in step (5) executes the Token as follows: 5a) The executor retrieves tokens from the corresponding queues in high, medium, and low order, and changes its state to Busy. 5b) The executor parses the token, obtains the execution script information, the browser runs the script, and automatically performs various preset operations to complete the execution of the token; 5c) The executor repeats 5a) and 5b) according to the order of the tokens in the queue until the queue to be executed is empty, completes the execution of all tokens, and changes its own state to Idle.

6. The method according to claim 1, characterized in that, In step (6), various anomalies that occur during the data collection process are defined, and the classification criteria are as follows: The minor anomalies refer to anomalies caused by fluctuations in the IP network of the executor-associated agent, including access timeout anomalies and page loading anomalies. The serious exception refers to the exception caused by damage to the executor kernel driver, which includes runtime exceptions and unexpected exceptions; The suspension exception refers to the exception caused when the actuator matches a preset situation, the specific situation of which is preset by the user; The aforementioned damage anomaly refers to an anomaly caused by the actuator's inability to respond to system communications.