An intelligent scheduling method, device and equipment of multi-protocol proxy resources and medium
By constructing a proxy resource pool and conducting multi-dimensional proactive detection and dynamic scoring, combined with the attributes of crawler tasks, a weighted random algorithm is used to select proxy IPs, thus solving the problem of rigid proxy IP scheduling and achieving efficient and accurate data collection and resource utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING LIANCHI SYSTEM TECHNOLOGY CO LTD
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies employ rigid proxy IP scheduling methods, which cannot adapt to the varying anti-scraping strengths of different target websites. This results in low data collection success rates, easy triggering of anti-scraping mechanisms, and difficulty in improving proxy resource utilization.
A proxy resource pool is constructed to store the metadata of proxy IPs. Through multi-dimensional active detection and dynamic comprehensive scoring model, a global comprehensive score and a contextualized score are generated. Combined with the crawler task attribute information, a weighted random algorithm is used to select target proxy IPs for scheduling.
It improves the data collection success rate, reduces the probability of triggering the target website's anti-scraping mechanism, enhances the utilization rate of proxy resources, and meets the collection needs of large-scale and multi-scenario applications.
Smart Images

Figure CN122248057A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of proxy service technology, specifically to an intelligent scheduling method, apparatus, device, and medium for multi-protocol proxy resources. Background Technology
[0002] In the field of large-scale network data collection, multi-protocol proxy IPs are a crucial resource for achieving efficient data collection. However, existing technologies for scheduling and managing proxy IPs have significant shortcomings, making it difficult to adapt to the needs of large-scale, multi-scenario collection. Current technologies often employ simple polling or fixed-rule switching for proxy IP scheduling, failing to consider proxy quality or selecting proxy IPs based solely on a single metric. This makes it difficult to adapt to the varying anti-scraping strengths of different target websites. Furthermore, the rigid scheduling strategies for proxy IPs are prone to mismatches between proxy IPs and collection tasks, resulting in low data collection success rates, increased susceptibility to triggering anti-scraping mechanisms on target websites, and difficulty in improving proxy resource utilization. Consequently, these technologies fail to meet the demands for refined and efficient proxy resource scheduling. Summary of the Invention
[0003] To address the aforementioned technical problems, this application provides an intelligent scheduling method, apparatus, device, and medium for multi-protocol proxy resources.
[0004] Firstly, this application provides an intelligent scheduling method for multi-protocol proxy resources, comprising: accessing proxy IPs of multiple protocol types and constructing a proxy resource pool; configuring and storing metadata for each proxy IP, the metadata including at least protocol type, IP attribute type, geographical location, and provider information; performing multi-dimensional active probing on the proxy IPs in the proxy resource pool at a preset frequency to obtain the real-time probing results of each proxy IP; combining the real-time probing results of each proxy IP with the corresponding metadata, determining the global comprehensive score of each proxy IP through a dynamic comprehensive scoring model; and simultaneously combining the historical access data of each proxy IP to different websites to generate a contextualized score for each proxy IP for each website, the contextualized score being used to represent the access history of each website. Access compatibility scoring; receiving proxy call requests from crawler tasks, each carrying task attribute information; filtering a set of candidate proxy IPs from the proxy resource pool based on the task attribute information, where the task attribute information includes at least the target website, allowed IP attribute types, and geographic location preferences; determining a decision score for each candidate proxy IP based on the global comprehensive score and the contextual score for the target website; using the decision score of each candidate proxy IP as a scheduling weight; selecting a target proxy IP from the candidate proxy IP set using a weighted random algorithm; and assigning the target proxy IP to the crawler task so that the crawler task can initiate data collection requests to the target website based on the target proxy IP.
[0005] By adopting the above technical solutions, a proxy resource pool is constructed and the metadata of proxy IPs is stored, providing comprehensive information for subsequent scheduling. Multi-dimensional proactive detection and dynamic comprehensive scoring models can accurately evaluate the overall performance of proxy IPs. Contextualized scoring reflects the adaptability of proxy IPs to different websites. Filtering the candidate proxy IP set based on task attribute information improves the matching degree between proxy IPs and crawler tasks. Determining the decision score based on global comprehensive scoring and contextualized scoring, and using a weighted random algorithm to select target proxy IPs, enables intelligent scheduling, improves data collection success rate, reduces the probability of triggering anti-crawling mechanisms on target websites, increases proxy resource utilization, and meets the needs of large-scale, multi-scenario data collection.
[0006] Optionally, multi-dimensional proactive detection includes at least one of the following: connectivity detection, used to detect whether the proxy IP can successfully establish a network connection; latency and speed detection, used to measure the response time and data transmission rate of the proxy IP; anonymity detection, used to detect whether the proxy IP will leak the real client IP; stability detection, used to statistically analyze the historical success rate of the proxy IP within a preset statistical period; and website reachability detection, used to detect the access availability of the proxy IP to the collected websites.
[0007] By adopting the above technical solutions, the connectivity detection in the multi-dimensional active detection can detect whether the proxy IP can successfully establish a network connection; latency and speed detection can measure response time and data transmission rate; anonymity detection can detect whether the real client IP will be leaked; stability detection can statistically analyze historical success rates; and website reachability detection can detect the access availability of the target website. This allows for a more comprehensive evaluation of the proxy IP quality, meeting the differences in anti-scraping strength of different target websites, improving the data collection success rate, avoiding triggering the target website's anti-scraping mechanism, enhancing proxy resource utilization, and meeting the needs for refined and efficient proxy resource scheduling.
[0008] Optionally, a global comprehensive score for each proxy IP can be determined through a dynamic comprehensive scoring model, including: assigning basic weights to various static attributes in the metadata and assigning performance weights to various dynamic performance indicators in the real-time detection results; and determining the global comprehensive score of the proxy IP based on the weighted calculation results of the basic weights and performance weights.
[0009] By adopting the above technical solution, basic weights are assigned to static attributes in metadata, performance weights are assigned to dynamic performance indicators in real-time detection results, and a global comprehensive score for proxy IPs is determined based on the weighted calculation results. This allows for a comprehensive evaluation of proxy IPs by combining their static attributes and dynamic performance, thereby more accurately measuring the overall quality of proxy IPs and providing a more reliable basis for subsequent scheduling decisions. This helps improve the matching degree between proxy IPs and crawler tasks, thereby increasing the data collection success rate, avoiding triggering the anti-crawling mechanism of the target website, improving the utilization rate of proxy resources, and meeting the needs of refined and efficient proxy resource scheduling.
[0010] Optionally, generate a contextualized score for each proxy IP for each website, including: for each proxy IP, obtain the historical access data of the proxy IP for each website, wherein the historical access data includes at least the number of successful visits, the number of failures, and the number of times the anti-scraping mechanism was triggered by the proxy IP on each website; and determine the contextualized score of the proxy IP for each website based on the historical access data.
[0011] By adopting the above technical solution, historical access data such as the number of successful visits, failures, and anti-scraping mechanisms triggered by the proxy IP for each website can be obtained. Based on this data, a contextualized score for the proxy IP for each website can be determined. This score can represent the access adaptability for each website, which helps to more accurately assess the applicability of the proxy IP to different websites, thereby improving the matching degree between the proxy IP and the data collection task, thus increasing the data collection success rate, reducing the probability of triggering the anti-scraping mechanism of the target website, and improving the utilization rate of proxy resources.
[0012] Optionally, a decision score for each candidate proxy IP is determined based on the global comprehensive score and the contextual score for the target website. This includes: weighting and summing the global comprehensive score and the contextual score according to a preset weight allocation rule, and using the weighted sum as the decision score, wherein the weight coefficient of the contextual score is greater than the weight coefficient of the global comprehensive score.
[0013] By adopting the above technical solution, the decision score is determined by weighted summation of the global comprehensive score and the contextual score, with a larger weight coefficient for the contextual score. This can more accurately measure the adaptability and performance of candidate proxy IPs to the target website, improve the matching degree between proxy IPs and crawler tasks, thereby increasing the data collection success rate, reducing the probability of triggering the target website's anti-crawling mechanism, improving the utilization rate of proxy resources, and meeting the needs of refined and efficient proxy resource scheduling.
[0014] Optionally, after assigning the target proxy IP to the crawler task, the above method further includes: monitoring the execution results of the crawler task using the target proxy IP to collect data; and dynamically adjusting the contextualized score of the target proxy IP for the target website based on the execution results.
[0015] By adopting the above technical solution, monitoring the execution results of crawler tasks using target proxy IPs for data collection allows for timely understanding of the target proxy IP usage. Dynamically adjusting the contextualized scores of target proxy IPs for target websites based on the execution results makes the scores more relevant to actual usage. By monitoring the execution results in real time and dynamically adjusting the scores, the contextualized scores accurately reflect the latest access performance of the target proxy IPs on the target websites, avoiding scheduling based on outdated scores and significantly improving the accuracy of subsequent scheduling decisions, thus making subsequent proxy IP scheduling more precise.
[0016] Optionally, the contextual score of the target proxy IP for the target website can be dynamically adjusted based on the execution result, including: lowering the contextual score of the target proxy IP for the target website in response to execution failure or triggering an anti-scraping mechanism; and raising the contextual score of the target proxy IP for the target website in response to execution success.
[0017] By adopting the above technical solution, the contextualized score of the target proxy IP for the target website is dynamically adjusted based on the execution result. The contextualized score is lowered when the execution result is failure or triggers an anti-scraping mechanism, and raised when the execution result is success. This makes the target proxy IP score more closely reflect the actual situation, allowing the scheduling of proxy IPs to more accurately adapt to the differences in anti-scraping strength across different websites. Through this clear rule of "success adds points, failure / triggering anti-scraping deducts points," the contextualized score can reflect the latest availability status of the proxy IP on the current website in real time, objectively and quantitatively, and directly affect the decision score and selection probability in subsequent scheduling, forming a fully automatic adaptive closed loop of "scheduling → usage → feedback → score correction → rescheduling."
[0018] Optionally, after lowering the contextual score of the target proxy IP for the target website, the above method further includes: marking the target proxy IP as in a cool-down state for the target website, in which the target proxy IP is not assigned to crawling tasks for the target website.
[0019] By adopting the above technical solution, the target proxy IP is marked as a cooling-off state after the contextual score is lowered, and a forced isolation strategy is implemented to prevent it from participating in the website's task scheduling for a certain period of time. This demonstrates the system's ability to respond quickly to risky behaviors and is a key protection mechanism for achieving a highly available and highly covert crawler system.
[0020] Optionally, the above method further includes: in response to the crawler task failing more than a first preset threshold number of times on the target website or triggering the anti-crawling mechanism more than a second preset threshold number of times, adjusting at least one of the following parameters of the crawler task: request frequency, request interval range, and number of consecutive accesses per IP.
[0021] By adopting the above technical solutions, when the number of consecutive failures of the crawler task on the target website exceeds the first preset threshold or the number of times the anti-crawling mechanism is triggered exceeds the second preset threshold, adjusting the request frequency can avoid triggering the target website's anti-crawling mechanism due to excessively frequent requests; adjusting the request interval range can make the requests more dispersed, reducing the possibility of the target website detecting patterns; adjusting the number of consecutive accesses from a single IP can reduce the risk caused by continuous access from the same IP, improve the data collection success rate, further enhance the utilization rate of proxy resources, and better adapt to the anti-crawling strength of different target websites.
[0022] Optionally, a weighted random algorithm is used to select the target proxy IP from the candidate proxy IP set, including: calculating the sum of the decision scores of all proxy IPs in the candidate proxy IP set as the total weight; generating a random number between 0 and the total weight; determining the weight range corresponding to each proxy IP by accumulating the decision scores of the candidate proxy IPs in sequence; and determining the corresponding proxy IP as the target proxy IP based on the weight range that the random number falls into.
[0023] By adopting the above technical solution, and by calculating the total weight, generating random numbers, determining the weight range, and selecting target proxy IPs based on the range in which the random numbers fall, intelligent scheduling of multi-protocol proxy resources can be achieved. The probability of each proxy being selected is strictly positively correlated with its decision score. Proxy with a high score gets more opportunities, but proxy with a low score also retains the possibility of being selected, which is both fair and scientific. Due to the introduction of randomness, even if the set of high-scoring proxies is relatively fixed, the specific proxy selected each time will be different, avoiding the solidification of access patterns caused by always using the same batch of optimal proxies, and reducing the risk of being blocked by target websites through pattern recognition.
[0024] Optionally, the above method further includes: configuring different scheduling strategies for different crawler tasks, wherein the scheduling strategy includes at least one of the following: allowed proxy types, geographical location preference, cost tolerance, maximum number of consecutive uses per IP, and request interval range; wherein, when receiving a proxy call request for a crawler task, a set of candidate proxy IPs is selected from the proxy resource pool according to the scheduling strategy corresponding to the crawler task.
[0025] By adopting the above technical solutions and configuring different scheduling strategies for different crawling tasks, the proxy scheduling can be more tailored to the characteristics of each task. It can select a suitable set of candidate proxy IPs from the proxy resource pool according to the task requirements, improve the matching degree between proxy IPs and crawling tasks, and avoid problems such as low data collection success rate and easy triggering of anti-crawling mechanisms of target websites due to mismatch between proxy IPs and collection tasks.
[0026] Optionally, the above method also includes: automatically performing isolation or offline operations on proxy IPs that remain below a preset score threshold within a preset time window.
[0027] By adopting the above technical solution, proxy IPs that remain below the preset scoring threshold within a preset time window can be automatically isolated or taken offline, avoiding the use of low-quality proxy IPs for data collection. This achieves intelligent identification and proactive isolation of low-quality proxy IPs, improves the utilization rate of proxy resources, increases the success rate of data collection, and reduces the risk of triggering the anti-scraping mechanism of the target website.
[0028] Optionally, the above method also includes: calculating the cost-effectiveness ratio of proxy IPs from different suppliers, with different IP attribute types, or different geographical locations, the cost-effectiveness ratio being determined based on the ratio of successful request counts to procurement costs; and generating proxy procurement optimization suggestions or automatically adjusting the scheduling priority of different proxy IPs based on the cost-effectiveness ratio.
[0029] By adopting the above technical solutions, the cost-effectiveness ratio of proxy IPs from different suppliers, with different IP attribute types, or in different geographical locations can be statistically analyzed. This provides a clear understanding of the cost-effectiveness of each proxy IP, and based on this ratio, optimization suggestions for proxy procurement can be generated, which helps to reduce procurement costs and improve resource utilization efficiency. Automatically adjusting the scheduling priority of different proxy IPs enables more reasonable allocation of proxy resources, improves data collection success rate and efficiency, and adapts to different collection needs.
[0030] In a second aspect of this application, an intelligent scheduling device for multi-protocol proxy resources is also provided, used to execute the intelligent scheduling method for multi-protocol proxy resources of any of the preceding claims, comprising: an access unit, used to access proxy IPs of multiple protocol types and construct a proxy resource pool, configuring and storing metadata for each proxy IP, the metadata including at least protocol type, IP attribute type, geographical location, and provider information; and an evaluation unit, used to perform multi-dimensional active probing on the proxy IPs in the proxy resource pool at a preset frequency, obtain the real-time probing results of each proxy IP, combine the real-time probing results of each proxy IP with the corresponding metadata, determine the global comprehensive score of each proxy IP through a dynamic comprehensive scoring model, and simultaneously combine the historical access data of each proxy IP to different websites to generate a contextualized score for each proxy IP for each website. The contextual score is used to represent the accessibility score for each website; the filtering unit is used to receive proxy call requests from crawler tasks. The proxy call requests carry the task attribute information of the crawler tasks. Based on the task attribute information, the unit filters out a set of candidate proxy IPs that meet the requirements from the proxy resource pool. The task attribute information includes at least the target website, allowed IP attribute types, and geographical location preferences; the scheduling unit is used to determine the decision score of each candidate proxy IP based on the global comprehensive score and the contextual score for the target website. The decision score of each candidate proxy IP is used as the scheduling weight. A weighted random algorithm is used to select the target proxy IP from the set of candidate proxy IPs and assign the target proxy IP to the crawler task so that the crawler task can initiate a data collection request to the target website based on the target proxy IP.
[0031] In a third aspect of this application, an electronic device is also provided, including a memory and a processor, wherein a computer program is stored in the memory, and the processor executes the program to implement the method steps of any of the above claims.
[0032] In a fourth aspect of this application, a computer-readable storage medium is also provided, which stores instructions that, when executed, perform the method steps of any of the above claims.
[0033] In summary, one or more technical solutions provided in this application have at least the following technical effects or advantages: 1. Building a proxy resource pool and storing proxy IP metadata provides comprehensive information for subsequent scheduling; multi-dimensional proactive detection and dynamic comprehensive scoring models can accurately evaluate the overall performance of proxy IPs; generating contextualized scores reflects the adaptability of proxy IPs to different websites; filtering candidate proxy IP sets based on task attribute information can improve the matching degree between proxy IPs and crawling tasks; determining decision scores based on global comprehensive scores and contextualized scores, and using a weighted random algorithm to select target proxy IPs, can achieve intelligent scheduling, improve the data collection success rate, reduce the probability of triggering the target website's anti-crawling mechanism, improve proxy resource utilization, and meet the needs of large-scale, multi-scenario collection. 2. It can obtain historical access data such as the number of successful visits, the number of failures, and the number of times anti-scraping mechanisms were triggered for each website by the proxy IP, and determine the contextual score of the proxy IP for each website based on this data. This score can indicate the access adaptability for each website, which helps to more accurately evaluate the applicability of the proxy IP to different websites, thereby improving the matching degree between the proxy IP and the data collection task, and thus improving the data collection success rate. 3. By monitoring and collecting execution results in real time and dynamically adjusting scores, contextualized scoring can accurately reflect the latest access performance of the target proxy IP on the target website, avoiding scheduling based on outdated scores, greatly improving the accuracy of subsequent scheduling decisions, and thus making subsequent proxy IP scheduling more precise. Attached Figure Description
[0034] Figure 1 This is a flowchart of an intelligent scheduling method for multi-protocol proxy resources provided in an embodiment of this application; Figure 2 This is a system architecture diagram of multi-protocol proxy pool management and IP rotation optimization provided in the embodiments of this application; Figure 3 This is a structural block diagram of an intelligent scheduling device for multi-protocol proxy resources provided in an embodiment of this application; Figure 4 This is a schematic diagram of the structure of an electronic device disclosed in an embodiment of this application.
[0035] Explanation of reference numerals in the attached drawings: 400 - Electronic device; 401 - Processor; 402 - Communication bus; 403 - User interface; 404 - Network interface; 405 - Memory. Detailed Implementation
[0036] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments.
[0037] In the description of the embodiments of this application, the words "for example" or "for instance" are used to indicate examples, illustrations, or explanations. Any embodiment or design that is described as "for example" or "for instance" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design options. Rather, the use of the words "for example" or "for instance" is intended to present the relevant concepts in a specific manner.
[0038] In the description of the embodiments of this application, the term "multiple" means two or more. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The terms "comprising," "including," "having," and variations thereof all mean "including but not limited to," unless otherwise specifically emphasized.
[0039] The following is in conjunction with the appendix Figure 1 -Appendix Figure 4 The embodiments of this application will be described.
[0040] This application provides an intelligent scheduling method for multi-protocol proxy resources, referring to... Figure 1 , Figure 1 This is a flowchart of an intelligent scheduling method for multi-protocol proxy resources provided in an embodiment of this application, including the following steps: Step S101: Connect to proxy IPs of multiple protocol types and build a proxy resource pool. Configure and store metadata for each proxy IP. The metadata shall include at least the protocol type, IP attribute type, geographical location and supplier information. Step S102: Perform multi-dimensional active probing on the proxy IPs in the proxy resource pool at a preset frequency, obtain the real-time probing results of each proxy IP, combine the real-time probing results of each proxy IP with the corresponding metadata, determine the global comprehensive score of each proxy IP through a dynamic comprehensive scoring model, and at the same time combine the historical access data of each proxy IP to different websites to generate a contextual score for each proxy IP for each website. The contextual score is used to represent the access adaptability score for each website. Step S103: Receive the proxy call request of the crawler task. The proxy call request carries the task attribute information of the crawler task. Based on the task attribute information, select a set of candidate proxy IPs that meet the requirements from the proxy resource pool. The task attribute information includes at least the target website, allowed IP attribute types and geographical location preferences. Step S104: Based on the global comprehensive score of each candidate proxy IP and the contextual score for the target website, determine the decision score of each candidate proxy IP. Use the decision score of each candidate proxy IP as the scheduling weight, and use a weighted random algorithm to select the target proxy IP from the candidate proxy IP set. Assign the target proxy IP to the crawler task so that the crawler task can initiate a data collection request to the target website based on the target proxy IP.
[0041] By constructing a proxy resource pool and storing the metadata of proxy IPs through the above steps, comprehensive information can be provided for subsequent scheduling. Multi-dimensional active detection and dynamic comprehensive scoring models can accurately evaluate the overall performance of proxy IPs. Contextualized scoring can reflect the adaptability of proxy IPs to different websites. Filtering the candidate proxy IP set based on task attribute information can improve the matching degree between proxy IPs and crawling tasks. Determining the decision score based on global comprehensive scoring and contextualized scoring, and using a weighted random algorithm to select target proxy IPs, can achieve intelligent scheduling, improve the data collection success rate, reduce the probability of triggering the anti-crawling mechanism of the target website, improve the utilization rate of proxy resources, and meet the needs of large-scale, multi-scenario collection.
[0042] This embodiment constructs a complete technical system for intelligent scheduling of multi-protocol proxy IPs, consisting of "unified pooling - multi-dimensional dual-scoring evaluation - strategy filtering - weighted random scheduling." The core is to achieve precise matching and intelligent scheduling of proxy IPs and data collection tasks by combining a general global quality assessment of proxy IPs with a contextualized assessment of website-specific adaptability, and considering the personalized needs of crawler tasks. Specifically: First, a resource pool is built by accessing multi-protocol (e.g., HTTP, HTTPS, SOCKS5) proxy IPs and storing the basic metadata of each IP. The metadata includes at least the protocol type, IP attribute type (e.g., residential IP, data center IP, mobile IP), geographical location (country, city), and supplier information. This step achieves structured description and classification management of proxy resources, laying the foundation for proxy quality evaluation. Second, real-time performance data of proxy IPs is obtained through multi-dimensional proactive probing. Combined with the metadata, a global comprehensive score reflecting the general performance of the proxy is calculated. Optionally, the system can perform multi-dimensional proactive probing on the IPs in the proxy resource pool at a preset frequency (e.g., every 5 minutes, every 10 minutes, or other frequencies). The process involves several steps: First, detection is performed. The detection results are combined with the proxy IP's metadata and input into a dynamic comprehensive scoring model (which can be built based on machine learning or weighted algorithms) to generate a global comprehensive score for each proxy IP. This score reflects the overall quality level of the proxy IP at the current moment and is dynamically updated over time, enabling real-time perception of proxy quality. Simultaneously, based on the proxy IP's historical access data to different websites (such as request success rate, response speed, and whether it was blocked by anti-crawling measures), a contextual score reflecting the proxy's adaptability to specific websites is generated, forming a two-dimensional proxy quality evaluation system. For example, if a proxy IP has a high success rate accessing website A but is frequently blocked on website B, its contextual score for website A is high, and its contextual score for website B is low. Then, based on the task attribute information carried by the crawling task, such as the target website and IP attribute requirements, a set of candidate proxy IPs that meet the requirements is selected from the resource pool. Finally, the global comprehensive score of the candidate proxy IPs and the contextual score of the corresponding target website are merged to obtain a decision score. This decision score is used as a scheduling weight, and a weighted random algorithm is used to select target proxy IPs and allocate them to the crawling task, achieving intelligent scheduling of proxy IPs. Existing technologies often rely solely on simple polling or switching proxy IPs according to fixed rules, failing to assess the actual quality of the proxy or judging its merits based on a single dimension. This solution achieves multi-dimensional quality evaluation of proxy IPs through proactive multi-dimensional probing combined with metadata. It also constructs a dual scoring system—global and contextual—to evaluate both the general performance of the proxy and its specific adaptability for different websites, enabling comprehensive and accurate perception of proxy quality. Furthermore, existing technologies lack website-specific proxy evaluation systems, and uniform scheduling methods struggle to address the anti-scraping requirements of different websites. This solution's contextualized scoring, generated based on the proxy's historical access data for various websites, accurately reflects the proxy's adaptability to specific websites and adapts to the varying anti-scraping strengths of different target websites.This embodiment uses a dual-scoring system to accurately assess the suitability of proxies for specific websites, and combines task attributes to filter candidate IPs, ensuring that data collection tasks use more suitable proxy IPs. This significantly reduces data collection failures caused by proxy mismatch and lowers the probability of triggering anti-scraping mechanisms on target websites. The global and contextualized dual-scoring system makes proxy quality assessment more comprehensive. Weighted random scheduling ensures high utilization of high-scoring, high-quality proxies while allowing other candidate proxies to be selected, avoiding resource hoarding and achieving intelligent allocation of proxy resources. Through precise matching of task attributes and proxy IPs, different types and suitability of proxy IPs can be matched with corresponding data collection tasks, maximizing the value of proxy resources and adapting to the needs of large-scale, multi-target website, and multi-type task network data collection.
[0043] In an optional embodiment, the multi-dimensional active detection includes at least one of the following: connectivity detection, used to detect whether the proxy IP can successfully establish a network connection; latency and speed detection, used to measure the response time and data transmission rate of the proxy IP; anonymity detection, used to detect whether the proxy IP will leak the real client IP; stability detection, used to statistically analyze the historical success rate of the proxy IP within a preset statistical period; and website reachability detection, used to detect the access availability of the proxy IP to the collected website.
[0044] In the above embodiments, connectivity detection in multi-dimensional active probing can detect whether the proxy IP can successfully establish a network connection; latency and speed detection can measure response time and data transmission rate; anonymity detection can detect whether the real client IP will be leaked; stability detection can statistically analyze historical success rates; and website reachability detection can detect the access availability of the collected website. This allows for a more comprehensive evaluation of the proxy IP quality to meet the differences in anti-scraping strength of different target websites, improve the data collection success rate, avoid triggering the anti-scraping mechanism of the target website, improve the utilization rate of proxy resources, and meet the needs of refined and efficient proxy resource scheduling.
[0045] This embodiment performs multi-dimensional proactive network probing on proxy IPs within the proxy resource pool at a preset frequency. Each probing dimension has a clear division of labor and focus, acquiring real-time performance data of the proxy IPs from five core aspects: basic network connectivity capabilities, transmission performance, anonymity and security attributes, reliability, and business scenario adaptability. The probing results from each dimension form a quantified real-time probing data set, which serves as the core data input for the dynamic comprehensive scoring model to calculate the global comprehensive score. This ensures that the generation of the global comprehensive score is supported by specific and comprehensive measured data, rather than abstract quality judgments, thus guaranteeing the scientificity and accuracy of the general quality assessment of proxy IPs. By actively probing instead of the passive discovery of existing technologies, the core performance status of proxy IPs, such as connectivity, speed, and anonymity, can be monitored in real time. This allows for timely detection of changes in proxy IP quality, preventing the assignment of invalid or low-quality proxy IPs to data collection tasks. It effectively compensates for the shortcomings of related proxy scheduling systems, such as a single dimension of quality perception, one-sided evaluation, and a lack of security and scenario adaptability considerations. Precise probing data makes the overall comprehensive score of proxy IPs more valuable, thereby making the generation of decision scores and the execution of weighted random scheduling more scientific, significantly improving the accuracy of proxy scheduling and reducing data collection failures caused by proxy quality issues.
[0046] In an optional embodiment, a global comprehensive score for each proxy IP is determined by a dynamic comprehensive scoring model, including: assigning basic weights to various static attributes in the metadata and assigning performance weights to various dynamic performance indicators in the real-time detection results; and determining the global comprehensive score of the proxy IP based on the weighted calculation result of the basic weights and performance weights.
[0047] In the above embodiments, basic weights are assigned to static attributes in metadata, performance weights are assigned to dynamic performance indicators in real-time detection results, and a global comprehensive score for proxy IPs is determined based on the weighted calculation results. This allows for a comprehensive evaluation of proxy IPs by combining their static attributes and dynamic performance, thereby more accurately measuring the overall quality of proxy IPs and providing a more reliable basis for subsequent scheduling decisions. This helps improve the matching degree between proxy IPs and crawler tasks, thereby increasing the data collection success rate, avoiding triggering the anti-crawling mechanism of the target website, improving the utilization rate of proxy resources, and meeting the needs of refined and efficient proxy resource scheduling.
[0048] This embodiment provides a weighted fusion evaluation mechanism that integrates static attributes and dynamic performance to scientifically and quantitatively determine the global comprehensive score of each proxy IP. Differentiated weight allocation designs are implemented for the two core evaluation criteria of the proxy IP—static attributes of metadata and dynamic performance indicators from multi-dimensional proactive probing. A corresponding basic weight is assigned to each static attribute in the metadata, such as protocol type, IP attribute type, geographical location, and vendor information. A corresponding performance weight is assigned to each dynamic performance indicator in the real-time probing results, such as connectivity, latency and speed, and anonymity. Based on preset weighted calculation rules, the attribute values of each static attribute are combined with the basic weight, and the quantitative values of each dynamic performance indicator are combined with the performance weight for weighted calculation. The comprehensive calculation result yields the global comprehensive score of the proxy IP. This ensures that the generation of the global comprehensive score takes into account both the inherent static attribute characteristics of the proxy IP and its real-time dynamic performance, achieving an objective and quantitative evaluation of the general quality of the proxy IP. This design not only improves the objectivity and interpretability of the score but also reflects a synergistic consideration of the "long-term value" and "instantaneous state" of the proxy IP, providing core algorithmic support for intelligent scheduling decisions. Optional, Global Overall Score = ∑(Static Attributes) i ×Basic Weight i )×α+∑(Dynamic Indicators) j ×Performance weight j )×β, where static attributes i Represents the i-th static attribute, with its base weight. i This represents the base weight corresponding to the i-th static attribute. There can be multiple static attributes. Similarly, dynamic indicators... j Represents the j-th dynamic metric, performance weight. j This represents the performance weight corresponding to the j-th dynamic indicator. There can be multiple dynamic indicators. α and β are adjustable proportional coefficients used to balance the contributions of static and dynamic indicators.
[0049] In an optional embodiment, generating a contextualized score for each proxy IP for each website includes: for each proxy IP, obtaining historical access data of the proxy IP for each website, wherein the historical access data includes at least the number of successful visits, the number of failed visits, and the number of times the anti-scraping mechanism was triggered by the proxy IP on each website; and determining the contextualized score of the proxy IP for each website based on the historical access data.
[0050] In the above embodiments, historical access data such as the number of successful visits, the number of failed visits, and the number of times anti-scraping mechanisms were triggered for each website by the proxy IP can be obtained. Based on this, a contextual score for the proxy IP for each website can be determined. This score can represent the access adaptability for each website, which helps to more accurately evaluate the applicability of the proxy IP to different websites, thereby improving the matching degree between the proxy IP and the data collection task, thus improving the data collection success rate, reducing the probability of triggering the anti-scraping mechanism of the target website, and improving the utilization rate of proxy resources.
[0051] For each proxy IP in the resource pool, the system establishes an independent behavior log for different websites. Whenever this proxy IP is used to access a website, regardless of success or failure, the system records the result of the access in detail, including at least three types of core data: the number of successes, the number of failures, and the number of times anti-scraping mechanisms were triggered. Historical access data may also include latency or other data. After accumulating a sufficient amount of historical access data, the system begins to process and model this data in a statistical sense. For each "proxy IP-website" combination, the system takes the above three types of data (success, failure, and anti-scraping trigger) as input and calculates a specific score through a preset mathematical model or algorithm. The system maintains contextualized scores for each proxy IP for different websites. First, specific historical access data for each website is collected for each proxy IP in the pool, with the number of successful visits, the number of failed visits, and the number of times anti-scraping mechanisms are triggered as the core data dimensions. Then, based on preset quantitative calculation rules, the above core historical access data is statistically analyzed and weighted to generate a specific contextual score for each proxy IP corresponding to different websites. This score only reflects the access performance and adaptability of the proxy IP on a specific website, complementing the global comprehensive score. This allows the proxy IP evaluation to include both general quality dimensions and website-specific adaptability dimensions, providing a contextual quality basis for subsequent scheduling for specific target websites. For example, the contextual score = (number of successful visits × weight_success - number of failed visits × weight_fail - number of times anti-scraping mechanisms are triggered × weight_block) / total number of visits. For example, weight_success = 0.6, weight_fail = 0.2, weight_block = 0.2. Of course, the weights can also be other combinations. This embodiment, through clearly defined historical access data dimensions and scoring generation logic, makes contextualized scoring a quantifiable and comparable website-specific evaluation indicator, filling the gap in existing technology for contextualized evaluation of proxy IPs. It can accurately reflect the actual access performance and anti-scraping evasion capabilities of proxy IPs on different websites. Different websites have different anti-scraping rules and intensities, and the access performance of each proxy IP on each website also varies. Contextualized scoring can accurately match this characteristic, allowing websites with high anti-scraping requirements to prioritize proxy IPs with high contextualized scores, fundamentally improving the targeting of data collection. Proxy IPs selected based on contextualized scoring are proxies with better access performance and stronger anti-scraping capabilities for the target website, effectively reducing data collection failures and anti-scraping triggers caused by proxy-website mismatch, and improving data collection efficiency. Contextualized scoring and global comprehensive scoring form a two-dimensional evaluation system of "general quality + scenario adaptation," allowing the generation of decision scores in the aforementioned embodiment to consider both the overall quality of the proxy IP and its specific performance for the target website, significantly improving the scientific nature and accuracy of scheduling decisions.
[0052] In an optional embodiment, the decision score for each candidate proxy IP is determined based on the global comprehensive score and the contextual score for the target website. This includes: weighting and summing the global comprehensive score and the contextual score according to a preset weight allocation rule, and using the weighted sum as the decision score, wherein the weight coefficient of the contextual score is greater than the weight coefficient of the global comprehensive score.
[0053] In the above embodiments, the decision score is determined by weighted summation of the global comprehensive score and the contextual score, with a larger weight coefficient for the contextual score. This can more accurately measure the adaptability and performance of candidate proxy IPs to the target website, improve the matching degree between proxy IPs and crawler tasks, thereby increasing the data collection success rate, reducing the probability of triggering the target website's anti-crawling mechanism, improving the utilization rate of proxy resources, and meeting the needs of refined and efficient proxy resource scheduling.
[0054] After selecting a set of candidate proxy IPs, to ultimately determine which IP is more suitable for the current crawling task, a weighted fusion of the global comprehensive score and the contextual score specific to the target website is needed. The global comprehensive score represents the general basic quality of the proxy IP (speed, stability, anonymity, etc.); the contextual score represents the proxy IP's specific adaptability to the current target website (ease of being blocked, historical success rate, anti-crawling circumvention effect). A weighted summation is used to calculate the decision score, and the contextual score has a greater weight than the global comprehensive score. This means that how well the proxy IP works on the current website is more important than how good the proxy IP itself is. The decision score prioritizes "website adaptability" and secondarily considers "general quality." This decision score is ultimately used as the scheduling weight in a weighted random algorithm to select the target proxy IP. For example, the decision score = k1 × contextual score + k2 × global comprehensive score, where the contextual score weight k1 is 0.7 (or 0.8, or other values), and the global comprehensive score weight k2 is 0.3 (or 0.2, or other values). In this embodiment, the weights k1 and k2 can be dynamically adjusted. This fusion model combines an agent's "basic hardware" (global score) and "practical experience" (contextual score), avoiding biased decision-making. When deciding whether to assign an agent to a task, the system prioritizes the agent's historical performance on that task (target website) over its performance in general tests. Even if an agent has a high global score (e.g., extremely fast speed), if its historical performance on the current target website is poor (low contextual score), its final decision score will not be high. Conversely, even if an agent's general qualities are average (e.g., slightly slower speed), as long as it has an outstanding track record on this website (high contextual score), its probability of being assigned will greatly increase. This embodiment prioritizes proxies with a successful history on the target website, fundamentally ensuring the success rate of data collection tasks, which is the core business objective of the entire scheduling system. Proxies that perform well on certain specific websites but have average general capabilities are no longer overlooked but become "experts" in handling specific tasks, maximizing the utilization of resource value. When the anti-scraping strategy of the target website changes, the first thing to be detected is the "contextualized score" of the proxy IP (the number of failures and the number of anti-scraping triggers will increase). Because the contextualized score has a higher decision weight, the system can quickly reduce the allocation of failed proxies and instead allocate them to other working proxies, thus adapting to changes in anti-scraping strategies more quickly.
[0055] As an optional implementation, a decision score for each candidate proxy IP is determined based on its global comprehensive score and contextualized score for the target website. This includes: if a candidate proxy IP has a valid contextualized score for the target website, the contextualized score is used as the decision score; if a candidate proxy IP does not have a valid contextualized score for the target website, the global comprehensive score is used as the decision score. In this embodiment, when determining the decision score of a candidate proxy IP, the system flexibly handles the situation based on whether a valid contextualized score for the target website exists. If such a score exists, the contextualized score is used as the decision score, which more accurately reflects the suitability of the proxy IP for the target website. If not, the global comprehensive score is used as the decision score, ensuring that a reasonable decision score can be determined even in the absence of a contextualized score. This allows for more scientific selection of target proxy IPs, improves the data collection success rate, avoids triggering the target website's anti-scraping mechanism, enhances proxy resource utilization, and meets the needs of refined and efficient proxy resource scheduling.
[0056] In an optional embodiment, after assigning the target proxy IP to the crawler task, the method further includes: monitoring the execution results of the crawler task using the target proxy IP to collect data; and dynamically adjusting the contextualized score of the target proxy IP for the target website based on the execution results.
[0057] In the above embodiments, monitoring the execution results of the crawler task using the target proxy IP for data collection allows for timely understanding of the target proxy IP's usage. Dynamically adjusting the contextualized score of the target proxy IP for the target website based on the execution results makes the score more relevant to actual usage. By monitoring the execution results in real time and dynamically adjusting the score, the contextualized score accurately reflects the latest access performance of the target proxy IP on the target website, avoiding scheduling based on outdated scores and significantly improving the accuracy of subsequent scheduling decisions, thereby making subsequent proxy IP scheduling more precise.
[0058] This embodiment refines the dynamic optimization mechanism for contextualized scoring after target proxy IP scheduling. This is a core technical step in achieving closed-loop management of "scoring-scheduling-feedback-optimization" and is crucial for ensuring the real-time performance and accuracy of contextualized scoring. The principle is as follows: After assigning the target proxy IP to the crawler task and completing proxy scheduling, the scheduling process does not end. Instead, a new step of real-time monitoring and dynamic scoring adjustment is added. First, the execution results of the crawler task using the target proxy IP for data collection are continuously monitored (including core scenarios such as successful collection, failed collection, and triggering anti-crawling mechanisms). Then, based on preset scoring adjustment rules, the contextualized score of the target proxy IP for the current target website is dynamically updated according to different execution results. For example, if collection is successful, the contextualized score is appropriately increased; if collection fails or anti-crawling is triggered, the contextualized score is correspondingly decreased. This ensures that the contextualized score always remains consistent with the latest access performance of the target proxy IP on the target website, breaking the limitation of "one-time score unchanged for life." This ensures that subsequent proxy scheduling for the target website can be based on the latest and most accurate contextualized evaluation data, achieving dynamic iterative optimization of the scheduling strategy. Even in related technologies where simple proxy evaluation exists, it is often a one-time assessment. Once the score is generated, it is not updated, failing to reflect changes in the quality of the proxy IP during use (such as being blocked by the target website or experiencing performance degradation). This results in subsequent scheduling being based on outdated scores, leading to scheduling errors. Furthermore, these technologies lack feedback and adjustment mechanisms; the proxy IP scores and scheduling strategies remain fixed, making optimization impossible based on actual conditions during data collection. This makes it difficult to adapt to dynamic changes in the target website's anti-scraping rules (e.g., a previously highly compatible proxy IP may become ineffective after the website upgrades its anti-scraping strategy). This embodiment further improves the data collection success rate and reduces the probability of anti-scraping triggers. When a target proxy IP experiences collection failure or triggers anti-scraping, its score is promptly reduced, decreasing the probability of it being subsequently assigned to that target website. This avoids the risks of collection failure and anti-scraping caused by repeatedly using ineffective proxies, while simultaneously allowing more compatible proxy IPs to receive more scheduling opportunities. By dynamically adjusting scores, the scheduling frequency of incompatible proxy IPs can be identified and reduced in a timely manner. This prevents high-quality proxy IPs from being quickly blocked due to frequent use in unsuitable scenarios, while allowing recoverable proxy IPs (such as those that temporarily trigger anti-scraping measures and then return to normal) to regain scheduling opportunities through score recovery, thereby improving the overall utilization rate of proxy resources. When the target website upgrades its anti-scraping policy, the collection execution results of the proxy IPs will change synchronously. The dynamic adjustment of scores can quickly adapt to this change, allowing the scheduling strategy to achieve self-adaptation without manual intervention, reducing manual maintenance costs. Optionally, the global comprehensive score of the target proxy IP can also be dynamically adjusted based on the execution results. For example, when collection fails or anti-scraping is triggered, the global comprehensive score is reduced accordingly, with the reduction being slightly smaller than that of contextualized scoring.
[0059] In an optional embodiment, the contextual score of the target proxy IP for the target website is dynamically adjusted based on the execution result, including: lowering the contextual score of the target proxy IP for the target website in response to an execution failure or triggering an anti-crawling mechanism; and raising the contextual score of the target proxy IP for the target website in response to a successful execution result.
[0060] In the above embodiments, the contextualized score of the target proxy IP for the target website is dynamically adjusted based on the execution result. The contextualized score is lowered when the execution result is failure or triggers an anti-scraping mechanism, and raised when the execution result is success. This makes the target proxy IP score more closely reflect the actual situation, allowing the scheduling of proxy IPs to more accurately adapt to the differences in anti-scraping strength across different websites. Through this explicit rule of "success adds points, failure / triggering anti-scraping deducts points," the contextualized score can reflect the latest availability status of the proxy IP on the current website in real time, objectively and quantitatively, and directly affect the decision score and selection probability in subsequent scheduling, forming a fully automatic adaptive closed loop of "scheduling → usage → feedback → score correction → rescheduling."
[0061] Based on the success, failure, or anti-crawling mechanism triggering of the crawler task execution results, the rating of the target proxy IP on the target website is adjusted directionally and differentiatedly. This embodiment not only reflects the closed-loop control concept of the technical solution, but also constructs a result-oriented proxy resource credit evaluation mechanism, which is a key link in realizing the "self-learning, self-adaptation, and self-optimization" of the intelligent scheduling system. If the data collection is successful, it indicates that the proxy IP has good adaptability to the current website and strong anti-crawling avoidance capabilities, so the contextual score is increased; if the data collection fails or anti-crawling is triggered, it indicates that the usability of the proxy IP on the current website decreases and the risk increases, so the contextual score is decreased. Usually, the deduction is greater than the addition, for example, 0.05 points (or other values) are added for each success, 0.1 points (or other values) are deducted for each failure, and 1 point (or other values) is deducted for each anti-crawling trigger. The rating adjustment is highly scenario-dependent. For example, if the proxy IP fails on "JD.com", only its contextual score on the "JD.com" website is reduced; its score on other platforms such as "Taobao" or "Zhihu" remains unchanged. This embodiment enhances the system's adaptability to anti-scraping strategies, enabling it to promptly detect website risk control upgrades and adjust scheduling strategies accordingly. The decision-making basis changes from "static tags" to "dynamic behavioral data," making it more objective and improving the scientific and fair nature of scheduling decisions. Without manual intervention, the system can automatically and accurately adjust the contextualized scoring of proxy IPs for each website based on the actual execution results of each task.
[0062] In an optional embodiment, after lowering the contextual score of the target proxy IP for the target website, the method further includes: marking the target proxy IP as in a cooling-off state for the target website, in which the target proxy IP is not assigned to crawling tasks for the target website.
[0063] In the above embodiments, after lowering the contextual score, the target proxy IP is marked as a cooling-off state and a forced isolation strategy is implemented, so that it will no longer participate in the task scheduling of the website for a certain period of time. This reflects the system's ability to respond quickly to risky behaviors and is a key protection mechanism for achieving a highly available and highly covert crawler system.
[0064] When a target proxy IP fails to crawl or triggers anti-crawling mechanisms during the data collection process, the system not only lowers the contextual score of that target proxy IP for the target website, but also marks it as being in a "cooling-off" state for that website. During this cooling-off period, even if the target proxy IP remains in the candidate proxy IP set, the system will not assign it to crawling tasks targeting the same website. This prevents the target proxy IP from being repeatedly invoked and continuously triggering the target website's risk control policies within a short period, providing a "cooling-off recovery time" for the target proxy IP. In subsequent task allocation, the scheduler checks the cooling-off state during the candidate proxy selection phase. If the target proxy is in a cooling-off period for that website, it is skipped and does not participate in the scheduling decision, even if its base score is high. Normal scheduling resumes once the target website lifts access restrictions on the target proxy IP or the risk decreases. The cooling-off state can be a preset duration, such as 30 minutes (or 1 hour, or other durations). This mechanism achieves dual risk control through score downgrading and temporary disabling, forming a safer and more stable proxy scheduling closed loop. This embodiment avoids the repeated scheduling of known failed proxies, reduces invalid requests, and significantly lowers the failure rate; it also allows IPs to hibernate through a cooling mechanism, avoiding website monitoring windows; and it increases the probability of IP regeneration and extends their lifespan.
[0065] In an optional embodiment, the method further includes: in response to the crawler task failing more than a first preset threshold number of times on the target website or triggering an anti-crawling mechanism more than a second preset threshold number of times, adjusting at least one of the following parameters of the crawler task: request frequency, request interval range, and number of consecutive accesses per IP.
[0066] In the above embodiments, when the number of consecutive failures of the crawler task on the target website exceeds the first preset threshold or the number of times the anti-crawling mechanism is triggered exceeds the second preset threshold, adjusting the request frequency can avoid triggering the target website's anti-crawling mechanism due to excessively frequent requests; adjusting the request interval range can make the requests more dispersed, reducing the possibility of the target website detecting a pattern; adjusting the number of consecutive accesses from a single IP can reduce the risk caused by continuous access from the same IP, improve the data collection success rate, further improve the utilization rate of proxy resources, and better adapt to the anti-crawling strength of different target websites.
[0067] This embodiment, based on the monitoring and scoring adjustment of collection execution results, further extends to a dynamic adaptive adjustment mechanism for crawler task parameters. It belongs to the collaborative protection technology of "proxy scheduling + task optimization". The core is to avoid the anti-crawling risks of target websites and improve the stability of collection by taking a two-pronged approach from the two dimensions of "proxy adaptation" and "task execution". Specifically, when monitoring the execution results of data collection using target proxy IPs in web crawling tasks, the system not only focuses on the performance of individual proxy IPs and adjusts their contextualized scores, but also performs global statistics on the execution status of the web crawling task itself. When it detects that the number of consecutive failed collection attempts on the same target website exceeds the first preset threshold, or the number of times the anti-crawling mechanism is triggered exceeds the second preset threshold, it determines that the execution parameters of the current web crawling task (request frequency, request interval, number of consecutive accesses per IP, etc.) have been identified as abnormal access by the target website's anti-crawling mechanism. At this time, the system automatically adjusts at least one core execution parameter of the web crawling task, such as by reducing the request frequency, widening the request interval range, and reducing the number of consecutive accesses per IP, to simulate the access behavior of normal users, reduce the anti-crawling vigilance of the target website, and avoid the web crawling task being continuously restricted due to abnormal access behavior. At the same time, in conjunction with the adjustment of the proxy IP scores, it achieves dual protection of "proxy adaptation optimization + task behavior optimization" to ensure that the data collection task can continue to progress stably. This mechanism breaks the limitation of "only optimizing the agent without adjusting the task," deeply binding agent scheduling with task execution. This gives the entire data collection system stronger anti-scraping and adaptive capabilities, further improving and upgrading the "scoring-scheduling-feedback" closed loop of the aforementioned embodiment. Through this embodiment, adaptive adjustment of crawler task parameters is achieved, reducing the probability of triggering anti-scraping mechanisms on target websites. Specifically, by monitoring task execution results in real time, core parameters such as request frequency and intervals are automatically optimized to simulate normal user access behavior, reducing abnormal access characteristics and lowering the risk of being identified by anti-scraping mechanisms from the source, making it harder for target websites to restrict data collection tasks. When a task experiences consecutive failures or frequent anti-scraping, parameters are quickly adjusted to adapt to website anti-scraping rules, preventing long-term task stagnation, ensuring continuous data collection, and improving overall data collection efficiency.
[0068] In an optional embodiment, a weighted random algorithm is used to select a target proxy IP from the candidate proxy IP set, including: calculating the sum of the decision scores of all proxy IPs in the candidate proxy IP set as the total weight; generating a random number between 0 and the total weight; determining the weight range corresponding to each proxy IP by accumulating the decision scores of the candidate proxy IPs in sequence; and determining the corresponding proxy IP as the target proxy IP based on the weight range in which the random number falls.
[0069] In the above embodiments, by calculating the total weight, generating random numbers, determining the weight range, and selecting the target proxy IP based on the range in which the random number falls, intelligent scheduling of multi-protocol proxy resources can be achieved. The probability of each proxy being selected is strictly positively correlated with its decision score. Proxy with a high score gets more opportunities, but proxy with a low score also retains the possibility of being selected, which is both fair and scientific. Due to the introduction of randomness, even if the set of high-scoring proxies is relatively fixed, the specific proxy selected each time will be different, avoiding the solidification of access patterns caused by always using the same batch of optimal proxies, and reducing the risk of being blocked by the target website through pattern recognition.
[0070] First, the decision scores of all proxy IPs in the candidate proxy IP set are summed to obtain the total weight. A random number between 0 and the total weight is generated. The decision scores are then accumulated sequentially from highest to lowest (or in order), dividing each proxy IP into a continuous weight range. Finally, the proxy IP whose weight range the random number falls into is selected as the target proxy IP for this task. The essence of this mechanism is that the higher the decision score and the longer the weight range, the greater the probability of being randomly selected. It is neither a fixed selection of the highest score nor completely random, but rather an intelligent scheduling method where "high-quality IPs have a higher probability, while ordinary IPs still have a chance." Each proxy IP is assigned a decision score; the higher the score, the better the IP is in the current task context. This score directly serves as its selection "weight," forming a positive correlation between weight and quality. Assuming the decision scores for IPA, IPB, and IPC are 30, 20, and 10 respectively, the total weight is 60. The weight intervals are: IPA (0-30), IPB (30-50), and IPC (50-60). The probability of IPA being selected is 30 / 60 = 50%, IPB is 20 / 60 = 33%, and IPC is 10 / 60 = 17%. Higher scores increase the probability of selection, but it's not the only possible choice.
[0071] In an optional embodiment, the method further includes: configuring different scheduling strategies for different crawler tasks, wherein the scheduling strategy includes at least one of the following: allowed proxy types, geographic location preference, cost tolerance, maximum number of consecutive uses per IP, and request interval range; wherein, when receiving a proxy call request for a crawler task, a set of candidate proxy IPs is selected from the proxy resource pool according to the scheduling strategy corresponding to the crawler task.
[0072] In the above embodiments, configuring different scheduling strategies for different crawling tasks can make proxy scheduling more suitable for the characteristics of each task. It can select a suitable set of candidate proxy IPs from the proxy resource pool according to the task requirements, improve the matching degree between proxy IPs and crawling tasks, and avoid problems such as low data collection success rate and easy triggering of anti-crawling mechanisms of target websites due to mismatch between proxy IPs and collection tasks.
[0073] This embodiment introduces a task-specific scheduling strategy to achieve "precise matching between proxy resource scheduling and crawler task requirements," breaking the limitation of "a single scheduling rule adapting to all tasks." This represents a key technical aspect of refined and personalized proxy scheduling. The principle is as follows: The system pre-configures dedicated scheduling strategies for different types and needs of crawler tasks. These strategies cover the core constraints and preferences related to proxy IPs during crawler task execution, including at least two or more key parameters such as allowed proxy types, geographic location preferences, cost tolerance, maximum consecutive usage times per IP, and request interval range. When the system receives a proxy call request from a crawler task, it no longer filters candidate proxy IPs solely based on the basic requirements in the task attribute information. Instead, it first matches the dedicated scheduling strategy corresponding to the crawler task, and then, combining the various constraints and preferences in the scheduling strategy, filters out proxy IPs from the proxy resource pool that simultaneously meet both the "basic task attributes" and the "dedicated scheduling strategy," forming a candidate proxy IP set. This ensures that the selected candidate IPs not only meet the basic requirements of the task but also adapt to the task's personalized needs, execution scenarios, and cost control objectives, achieving refined scheduling with "one set of scheduling rules for each type of task." This mechanism deeply binds "task requirements" with "scheduling strategies," upgrading agent scheduling from "general adaptation" to "personalized customization." It considers both the core requirements of the task and the differentiated scenarios (such as high-priority tasks vs. low-cost tasks, domestic websites vs. cross-border websites). This refines and optimizes the filtering logic in the aforementioned embodiments, making the entire scheduling system more flexible and adaptable. Through scheduling strategies such as cost tolerance and allowed agent types, high-cost, high-quality agents can be prioritized for high-priority, high-demand tasks, while low-cost agents can be assigned to ordinary tasks, avoiding waste of high-quality resources and costs, and achieving a balance between resource utilization and cost control. New scheduling strategies can be flexibly configured according to new crawler task types and changing requirements without modifying the core scheduling logic, adapting to diverse and complex data collection scenarios and reducing system upgrade and maintenance costs.
[0074] In an optional embodiment, the method further includes: automatically performing isolation or offline operations on proxy IPs that remain below a preset score threshold within a preset time window.
[0075] In the above embodiments, proxy IPs that remain below a preset scoring threshold within a preset time window can be automatically isolated or taken offline, avoiding the use of low-quality proxy IPs for data collection. This achieves intelligent identification and proactive isolation of low-quality proxy IPs, improves proxy resource utilization, increases data collection success rate, and reduces the risk of triggering the target website's anti-scraping mechanism.
[0076] This embodiment specifies the dynamic maintenance and low-quality resource cleanup mechanism for the proxy resource pool, which is a core technical aspect to ensure the overall quality of the proxy resource pool and avoid invalid resource occupation. The system pre-sets two core parameters: a preset time window (used to statistically analyze the long-term performance of proxy IPs and avoid misjudgment due to short-term fluctuations) and a preset scoring threshold (used to determine whether a proxy IP has normal usability, usually referring to the global comprehensive score, but can also be combined with contextualized scoring for comprehensive judgment). The system continuously monitors the rating status of all proxy IPs in the proxy resource pool. For each proxy IP, it tracks the rating change within a preset time window. If the rating of a proxy IP remains consistently below a preset rating threshold throughout the time window, it is considered a low-quality resource (e.g., long-term instability, poor connectivity, extremely low adaptability, and no obvious recovery trend). For such low-quality proxy IPs, the system automatically performs isolation or offline operations. Isolation temporarily removes them from the proxy resource pool, preventing them from participating in any crawler task candidate IP screening, while retaining their data for subsequent monitoring. Offline removal completely deletes them from the resource pool, freeing up system storage and management resources. This ensures that all proxy IPs in the pool have basic usability, preventing low-quality resources from affecting scheduling efficiency and collection results. The core of this mechanism is "long-term monitoring, threshold determination, and automatic cleanup," which differs from the extreme approach of "cleaning up when the rating is too low once." By filtering short-term performance fluctuations of proxy IPs (such as rating drops caused by temporary network fluctuations) through a preset time window, it ensures the accuracy of cleanup while avoiding the accidental deletion of recoverable proxy IPs, achieving dynamic optimization and a virtuous cycle for the proxy resource pool. For example, the preset time window is 1 hour (or other duration), and the preset scoring threshold is 60 points (or other value). For instance, if the score remains below 60 points for 1 hour, an isolation operation will be automatically performed; if the score does not rise back to above 60 points within 2 hours after isolation, an offline operation will be automatically performed.
[0077] As an optional implementation, the global comprehensive score of each proxy IP in the proxy resource pool is monitored in real time. If the global comprehensive score of a proxy IP remains below a first preset threshold for a preset duration, the proxy IP is isolated from the proxy resource pool. If the global comprehensive score of the isolated proxy IP recovers to above a second preset threshold, the isolated proxy IP is reinstated into the proxy resource pool. The second preset threshold is greater than or equal to the first preset threshold. Through this embodiment, real-time monitoring of the global comprehensive score of proxy IPs can promptly identify and isolate proxy IPs with persistently low scores, preventing low-quality proxy IPs from affecting data collection tasks and improving data collection success rate and proxy resource utilization. When the isolated proxy IP's score recovers and it is reinstated into the resource pool, proxy resources can be fully utilized, achieving dynamic management and optimization of proxy resources.
[0078] In an optional embodiment, the method further includes: calculating the cost-effectiveness ratio of proxy IPs from different suppliers, with different IP attribute types, or different geographical locations, the cost-effectiveness ratio being determined based on the ratio of successful request counts to procurement costs; and generating proxy procurement optimization suggestions or automatically adjusting the scheduling priority of different proxy IPs based on the cost-effectiveness ratio.
[0079] In the above embodiments, by statistically analyzing the cost-effectiveness ratios of proxy IPs from different suppliers, with different IP attribute types, or in different geographical locations, the cost-effectiveness of each proxy IP can be clearly understood. Based on this ratio, proxy procurement optimization suggestions can be generated, which helps to reduce procurement costs and improve resource utilization efficiency. Automatically adjusting the scheduling priority of different proxy IPs enables more reasonable allocation of proxy resources, improves data collection success rate and efficiency, and adapts to different collection needs.
[0080] This embodiment further introduces a cost-benefit analysis-based mechanism for evaluating the economic efficiency of proxy resources and for intelligent decision-making, expanding the management of proxy IPs from a "technical performance-oriented" approach to a "technology-economic dual-dimensional collaborative optimization" level. Its core lies in quantifying the key indicator of "cost-benefit ratio" to achieve value assessment, procurement optimization, and dynamic adjustment of scheduling strategies for proxy resources, thereby constructing a proxy resource governance system with business intelligence decision-making capabilities. The system needs to record the proxy source, success status, and supplier / type / region of each request; and aggregate the total number of successful requests and total cost within a statistical period by dimension. The dynamic calculation of the cost-benefit ratio supports sliding calculations over time windows (such as the last 7 days, 30 days, or other time windows). Based on this cost-benefit ratio, the system implements two types of automated actions: for example, generating proxy procurement optimization suggestions to guide subsequent procurement of which suppliers, types, and regions of proxies are more cost-effective; or directly and automatically adjusting scheduling priorities, prioritizing the scheduling of proxy IPs with higher cost-benefit ratios while meeting task requirements.
[0081] The present application will be described below with reference to specific embodiments. This application provides a method and system for multi-protocol proxy pool management and IP rotation optimization. It involves efficient and intelligent unified management, quality assessment, and dynamic scheduling of massive, heterogeneous proxy IP resources (HTTP / HTTPS / SOCKS4 / SOCKS5, from different vendors, data centers, and residential networks) in scenarios such as large-scale web crawling, data mining, and market monitoring. This maximizes data collection success rate and efficiency while circumventing anti-crawling mechanisms of target websites. The core of this application's embodiments is to construct an intelligent proxy brain that integrates "unified management - real-time assessment - policy scheduling," allowing each proxy IP resource to maximize its value in the most suitable scenario.
[0082] I. System Overall Architecture like Figure 2As shown, this system consists of three main modules: a multi-protocol proxy resource pool, an intelligent assessment and health engine, and a policy-driven intelligent scheduling center. The following is a detailed description of each of these three modules.
[0083] 1. Multi-protocol proxy resource pool: Heterogeneous Proxy Access Layer: Provides a unified interface to support access to HTTP(S), SOCKS4, and SOCKS5 proxies from different vendors, self-built, and free sources. Each proxy IP's metadata is recorded upon entry into the database: protocol, type (data center / residential / mobile), geographic location (country / city), vendor, procurement cost, etc.
[0084] Unified configuration and protocol adaptation: Internally, proxy calls for different protocols are encapsulated into a unified interface, which is transparent to upper-layer crawler tasks.
[0085] 2. Intelligent Assessment and Health Engine: Multi-dimensional proactive detection: Regularly (e.g., every minute) perform health checks on all online proxies. Check items include: connectivity (accessibility), latency and speed, anonymity (whether the real client IP is leaked), stability (historical success rate), and target website reachability (for specific data collection targets).
[0086] Dynamic comprehensive scoring model: Based on the detection results and metadata, a real-time dynamic "comprehensive quality score" (0-100) is calculated for each agent, which corresponds to the aforementioned global comprehensive score. The scoring weights are configurable; for example, latency has a high weight for speed-sensitive tasks, while anonymity and stability have high weights for tasks with high stealth requirements.
[0087] 3. Strategy-driven intelligent scheduling center: Task strategy configuration: Allows configuring different scheduling strategies for different crawler tasks. Strategy elements include: target website, request frequency, allowed proxy types, geographic location preferences, cost cap, etc.
[0088] Weighted Random Rotation Algorithm: The core scheduling algorithm. When a crawler requests a proxy, the scheduling center does not simply select the one with the highest score, but instead performs a weighted random selection based on the overall quality score. High-scoring proxies have a high probability of being selected, but low-scoring proxies still have a small probability of being selected to verify whether their state has recovered. This avoids "whipping the fast ox" and "sleeping over recovering proxies." Optionally, the final score of the proxy IP can be determined based on the overall quality score and contextualized scoring (corresponding to the aforementioned decision score), and then the target proxy IP can be selected using a weighted random algorithm.
[0089] Adaptive Feedback Regulator: Monitors the performance of each agent in real-world tasks (success, failure, blocking). If a high-scoring agent fails repeatedly on a specific target website, the system will dynamically reduce its weight on that website and may mark it as "suspected of being blocked on that website".
[0090] II. Methodology and Principles: Take, for example, an e-commerce price monitoring crawler that needs to frequently crawl a large e-commerce website: Phase 1: Resource Preparation and Initial Assessment (1) The system connects to 1,000 proxy IPs, including 500 residential proxies (expensive, high quality) and 500 data center proxies (cheap, average quality). (2) The health engine initiates a full detection and scores each IP. Residential proxies averaged a score of 85, while data center proxies averaged a score of 60.
[0091] Phase Two: Strategy Configuration and Intelligent Scheduling (1) Create the "E-commerce Website X Price Monitoring" task and configure the strategy: Preferred proxy type: residential proxy (because it is less likely to be identified as a web crawler); Target country: United States (as the website primarily serves users in the United States). Request interval: random 2-5 seconds (simulating a real person); Maximum number of consecutive uses per IP address: 10 (to prevent excessive use); (2) The crawler initiates a crawling request and calls the scheduling center; (3) The dispatch center selects a proxy IP from the “US Residential Agent” sub-pool according to the strategy and a weighted random algorithm; assume that IP-A (score 92) is selected; (4) The crawler successfully retrieved the page using IP-A; the system records "success +1" for IP-A for this task.
[0092] Phase 3: Exception Handling and Dynamic Optimization (1) After IP-A was successfully used 8 times in a row, the 9th request returned the "verification code" page (triggering anti-crawling). (2) The crawler sends the result ("encountering the CAPTCHA") back to the dispatch center; (3) Adaptive feedback regulator acts immediately: The "temporary weight" of IP-A for e-commerce website X will be significantly reduced, making it difficult for it to be selected for this task in the near future. Slightly lower the IP-A "overall score"; Immediately switch to another backup proxy IP-B for this task; (4) At the same time, the system may automatically trigger a “verification code processing” process (such as calling the CAPTCHA solving service) to try to recover this request.
[0093] Phase Four: Pool Maintenance and Cost Optimization For proxies with scores consistently below the threshold (e.g., 20 points), the system automatically isolates or takes them offline to avoid wasting request resources. The system provides cost reports, showing the cost-effectiveness ratio (number of successful requests / cost) of different supplier agents, to guide procurement decisions; For low-priority tasks that are not sensitive to agent type (such as fetching public news), the system can automatically allocate more low-cost data center agents, saving costs.
[0094] The main concepts of the embodiments of this application include the following: A unified, abstract, and automated quality monitoring system for multi-protocol heterogeneous proxies is developed. An "agent description protocol" is designed to standardize the description and storage of static attributes (protocol, type, geographic location, vendor, cost) and dynamic ports of various proxies. Based on this, a "distributed proactive health monitoring network" is constructed. This network performs multi-level health checks on all inbound proxies from multiple probe points globally, at configurable frequency and depth: from basic TCP port connectivity to the protocol correctness of simulated HTTP / HTTPS / SOCKS requests, and then to measuring end-to-end availability, latency, and anonymity of specific target websites. The health monitoring results update the agent status in real time, providing millisecond-level, up-to-date decision-making basis for intelligent scheduling.
[0095] A dynamic comprehensive scoring model integrating multi-dimensional indicators and task context feedback is proposed: an "adaptive weighted scoring algorithm" is introduced. This model not only considers general performance indicators of proxies (latency, speed, anonymity, historical success rate) but also innovatively introduces a "task context feedback factor." For example, an agent might perform well when crawling "website A" but be frequently blocked on "website B." The model maintains "contextualized scores" for this agent for different websites or website categories. When the scheduler selects an agent for a specific task, it prioritizes the agent's contextualized score within that task context, rather than a single global comprehensive score. As an optional implementation, the contextualized score and the global comprehensive score can be weighted and summed, with the contextualized score having a larger weight. Alternatively, if a contextualized score exists for the agent IP, it is used as the decision score. This makes scoring and scheduling highly targeted, significantly improving resource utilization efficiency.
[0096] A strategy-driven, weighted random intelligent scheduling algorithm and anti-crawling adaptive adjustment mechanism were developed: a "strategy-driven two-layer scheduling framework" was invented. The first layer is the "strategy matching layer," which filters a subset of qualified proxies (corresponding to the aforementioned candidate proxy IP set) based on the requirements of the crawling task (target website, frequency, cost sensitivity). The second layer is the "weighted random decision layer," which performs weighted random selection based on the "contextualized score" of each proxy within this subset. High-scoring proxies have a higher probability of being selected, but low-scoring proxies still have a chance, which ensures overall efficiency and provides low-scoring proxies with a "test recovery" opportunity. At the same time, the scheduler closely monitors the result of each request, and through a "negative feedback regulator," quickly demotes or temporarily isolates proxies that fail consecutively or trigger anti-crawling measures, and may dynamically adjust the request frequency and User-Agent rotation strategy of the entire task, forming a closed-loop adaptive capability to resist anti-crawling.
[0097] The embodiments of this application have at least the following technical effects: High success rate and efficiency in data collection: By intelligently selecting high-quality agents and quickly switching between them, the success rate of tasks and the overall speed are greatly improved. Effectively avoids blocking: The strategy-based weighted random rotation and behavior simulation make the crawler behave more like a real person, making it more difficult to identify and block; Maximize resource utilization: Let good agents take on key tasks, while also giving poor agents the opportunity to "train" or take on peripheral tasks, making the best use of resources; Unified and transparent management: One-stop management of all agent resources, providing a clear view of quality and cost analysis; Flexible strategy configuration: Different proxy usage strategies can be customized according to the "temperament" of different target websites.
[0098] The present application will now be described in conjunction with specific embodiments. The following scenario illustrates the application of the method of the present invention in detail through a scenario in which a market intelligence company, “XX,” monitors prices and inventory on five major global e-commerce platforms (Amazon.com, eBay, Taobao, JD.com, and MercadoLibre).
[0099] Background: "XX" needs to crawl millions of product pages from the above five websites daily. Each website has different anti-crawling strategies: Amazon is extremely strict, Taobao / JD.com has verification for Chinese IP addresses, and MercadoLibre (Latin America) requires a local IP address.
[0100] Step 1: Build a global agent resource pool The company purchases and connects with four types of agents: US residential real estate agents: 1000, high cost, used on Amazon and eBay; China data center agents: 2000 units, cost-inclusive, used for Taobao and JD.com; Latin American (Brazil, Mexico) residential agents: 500, high cost, used for MercadoLibre; Globally compatible data center agents: 5,000, low cost, used as backups and for crawling secondary websites; All agents are entered into the system and configured for automatic activity detection. The detection targets are specifically set to the homepage of each e-commerce website or a specific API to verify their "availability".
[0101] Step 2: Configure refined data collection strategies Create a separate data collection task for each e-commerce platform and configure the strategy accordingly: Task A (AmazonUS): Strategy: A US residential proxy must be used. No more than 5 consecutive requests from a single IP address; 30-minute cooldown. Request intervals are randomized from 3 to 8 seconds; full browser fingerprinting (User-Agent, Accept-Language, etc.) is enabled. Cost tolerance: High (allows the use of high-priced agents).
[0102] Task B (Taobao / JD.com): Strategy: Prioritize using proxies located in Chinese data centers. Require the proxy IP's ASN to belong to a major Chinese ISP. Enable specific request headers for Chinese websites; Cost tolerance: Medium.
[0103] Task C (MercadoLibre): Strategy: A Brazilian or Mexican residential real estate agent must be used; Cost tolerance: High.
[0104] Step 3: The Operation of Intelligent Scheduling in Complex Scenarios Scenario 1: Scraping popular products from Amazon.com (with extremely strong anti-scraping measures) (1) The crawler worker requests a proxy for task A; (2) The matching strategy of the dispatch center is to select from the "US residential agents" sub-pool; (3) In the sub-pool, IP-X has a “contextual score (for Amazon)” of 95 because its recent success rate in crawling Amazon is as high as 98%, while IP-Y has a score of 80. (4) The dispatch center performs weighted random selection, and IP-X has a very high probability of being selected; (5) The crawler successfully crawled 4 pages using IP-X. On the 5th request, it returned "503 Service Temporarily Unavailable" (may be temporarily restricted). (6) Results are fed back immediately. The negative feedback regulator is activated: the contextual score of IP-X for Amazon drops sharply from 95 to 50; IP-X is marked to enter "Amazon cooling state" for 30 minutes; an alarm is sent to the scheduler: "Amazon task is suspected of triggering risk control, it is recommended to reduce the overall request frequency by 10%"; (7) The scheduler immediately switches the current request to the backup IP-Z (score 85) and adjusts the overall request interval of task A to a random 4-10 seconds.
[0105] Scenario 2: Scraping Taobao products (with IP address and geographic preference) (1) The dispatch center is selected from the "China Data Center Agent" pool; (2) The system found that although the overall scores of IP-P and IP-Q were similar, the ASN of IP-P belonged to "China Telecom", while that of IP-Q belonged to "a Chinese node of a certain US cloud service provider". Historical data shows that IPs with China Telecom ASNs have a higher success rate in accessing Taobao; (3) Therefore, in weighted randomization, IP-P is given an additional weight. It is more likely to be selected, thus achieving a higher crawling success rate.
[0106] Scenario 3: Low-cost scraping of product description (non-price) information (1) The company has a secondary task, which only requires scraping product description text from Amazon (the update frequency is low and the IP requirements are relatively relaxed). (2) Configure a new policy for this task: allow the use of a globally universal data center agent, with a low cost tolerance; (3) The dispatch center will allocate a large number of cheap, average-rated general agents to this task, thereby saving expensive residential agent resources for core price monitoring tasks.
[0107] Step 4: System self-learning and cost optimization Performance Reports: The system generates weekly reports for display.
[0108] US Residential Agents Group: Amazon task success rate 92%, cost per successful request $0.001; China Data Center Agent Group: Taobao task success rate 88%, cost per successful request $0.0002; Latin American Residential Agents Group: MercadoLibre has an 85% success rate and a cost of $0.002 per transaction.
[0109] Cost optimization decisions: Data shows that the cost-effectiveness of Latin American agents is relatively low. Operations personnel can: lower the procurement priority of Latin American agents in the system; try configuring task C to allow the use of data center agents located in Latin America when there are insufficient residential agents (risk assessment required); automatically reschedule more Latin American tasks to Mexican agents with slightly higher cost-effectiveness, instead of Brazilian agents.
[0110] Agent Supplier Evaluation: The system compares the actual performance (success rate / cost) of different suppliers providing "US residential agency" services on Amazon tasks and automatically recommends the supplier resources with the best procurement performance.
[0111] The embodiments of this application achieve the following effects: Data capture success rate: Compared to using a simple polling agent list, the daily average data capture success rate of major e-commerce platforms has increased from approximately 65% to over 90%. Resource utilization: The utilization rate (effective successful requests / total lifecycle) of high-quality residential agents increased from less than 50% to over 85%; low-cost agents could also achieve 70% utilization by undertaking edge tasks; Anti-scraping capabilities: The frequency of task interruptions due to IP being completely blocked by the target website has been reduced from several times a week to less than once a month; Operating costs: Under the premise of achieving the same data collection goals, the overall procurement cost of agency resources can be reduced by 30-40% through intelligent scheduling and resource optimization; Operational efficiency: Agent online / offline status, health status monitoring, and policy adjustments are all automated, reducing the time required for operations and maintenance personnel to intervene by 80%.
[0112] This application also provides an intelligent scheduling device for multi-protocol proxy resources, used to execute the intelligent scheduling method for multi-protocol proxy resources in any of the foregoing embodiments, such as... Figure 3 As shown, the device includes: The access unit is used to access proxy IPs of multiple protocol types and build a proxy resource pool. It configures and stores metadata for each proxy IP. The metadata includes at least the protocol type, IP attribute type, geographical location and supplier information. The evaluation unit is used to perform multi-dimensional proactive probing on proxy IPs in the proxy resource pool at a preset frequency, obtain the real-time probing results of each proxy IP, combine the real-time probing results of each proxy IP with the corresponding metadata, determine the global comprehensive score of each proxy IP through a dynamic comprehensive scoring model, and combine the historical access data of each proxy IP to different websites to generate a contextual score for each proxy IP for each website. The contextual score is used to represent the access adaptability score for each website. The filtering unit is used to receive proxy call requests for crawling tasks. The proxy call requests carry the task attribute information of the crawling task. Based on the task attribute information, the unit filters out a set of candidate proxy IPs that meet the requirements from the proxy resource pool. The task attribute information includes at least the target website, allowed IP attribute types, and geographical location preferences. The scheduling unit determines the decision score for each candidate proxy IP based on the global comprehensive score and the contextual score for the target website. The decision score of each candidate proxy IP is used as the scheduling weight. A weighted random algorithm is used to select the target proxy IP from the candidate proxy IP set and assign the target proxy IP to the crawler task so that the crawler task can initiate a data collection request to the target website based on the target proxy IP.
[0113] It should be noted that the devices or systems provided in the above embodiments are only illustrated by the division of the above functional modules. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the device and method embodiments provided in the above embodiments belong to the same concept. Other device or system embodiments correspond to the aforementioned method embodiments. Other technical features are described in the previous embodiments and will not be repeated here.
[0114] This application also provides a computer-readable storage medium storing instructions that, when executed, perform the steps of any of the methods described above.
[0115] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.
[0116] This application also discloses an electronic device. For example... Figure 4 As shown, Figure 4 This is a schematic diagram of the structure of an electronic device disclosed in an embodiment of this application. The electronic device 400 may include: at least one processor 401, at least one network interface 404, a user interface 403, a memory 405, and at least one communication bus 402.
[0117] The communication bus 402 is used to enable communication between these components.
[0118] The user interface 403 may include a display screen and a camera. Optionally, the user interface 403 may also include a standard wired interface and a wireless interface.
[0119] The network interface 404 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface).
[0120] The processor 401 may include one or more processing cores. The processor 401 connects to various parts of the electronic device (such as a server) using various interfaces and lines, and performs various server functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in memory 405, and by calling data stored in memory 405. Optionally, the processor 401 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 401 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the content required for display; and the modem handles wireless communication. It is understood that the modem may also be implemented as a separate chip without being integrated into the processor 401.
[0121] The memory 405 may include random access memory (RAM) or read-only memory. Optionally, the memory 405 may include a non-transitory computer-readable storage medium. The memory 405 may be used to store instructions, programs, code, code sets, or instruction sets. The memory 405 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch function, sound playback function, image playback function, etc.), instructions for implementing the above-described method embodiments, etc.; the data storage area may store data involved in the above-described method embodiments, etc. Optionally, the memory 405 may also be at least one storage device located remotely from the aforementioned processor 401. (Refer to...) Figure 4 The memory 405, which serves as a computer storage medium, may include an operating system, a network communication module, a user interface module, and an application program for an intelligent scheduling method of multi-protocol proxy resources.
[0122] exist Figure 4 In the illustrated electronic device 400, the user interface 403 is mainly used to provide an input interface for the user and to acquire user input data; while the processor 401 can be used to call an application program of a multi-protocol proxy resource intelligent scheduling method stored in the memory 405. When executed by one or more processors 401, the electronic device 400 performs one or more of the methods described in the above embodiments. It should be noted that, for the foregoing method embodiments, for the sake of simplicity, they are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, because according to this application, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to this application.
[0123] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.
[0124] In the various embodiments provided in this application, it should be understood that the disclosed apparatus or system can be implemented in other ways. For example, the apparatus or system embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some service interface; the indirect coupling or communication connection between apparatuses or units may be electrical or other forms.
[0125] The above description is merely an exemplary embodiment of this disclosure and should not be construed as limiting the scope of this disclosure. Any equivalent changes and modifications made in accordance with the teachings of this disclosure shall still fall within the scope of this disclosure. Other embodiments of this disclosure will be readily apparent to those skilled in the art upon consideration of the disclosure herein.
[0126] This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art that are not described in this disclosure.
Claims
1. A method for intelligent scheduling of multi-protocol proxy resources, characterized in that, include: Access proxy IPs of multiple protocol types and build a proxy resource pool. Configure and store metadata for each proxy IP. The metadata includes at least the protocol type, IP attribute type, geographical location, and supplier information. Multi-dimensional proactive probing is performed on the proxy IPs in the proxy resource pool at a preset frequency to obtain the real-time probing results of each proxy IP. Combining the real-time probing results of each proxy IP with the corresponding metadata, a global comprehensive score for each proxy IP is determined through a dynamic comprehensive scoring model. At the same time, combined with the historical access data of each proxy IP to different websites, a contextual score for each proxy IP for each website is generated. The contextual score is used to represent the access adaptability score for each website. Receive a proxy call request for a web crawler task, the proxy call request carrying the task attribute information of the web crawler task, and filter out a set of candidate proxy IPs that meet the requirements from the proxy resource pool based on the task attribute information, wherein the task attribute information includes at least the target website, allowed IP attribute types and geographical location preferences; Based on the global comprehensive score of each candidate proxy IP and the contextual score for the target website, a decision score is determined for each candidate proxy IP. The decision score of each candidate proxy IP is used as a scheduling weight, and a weighted random algorithm is used to select a target proxy IP from the candidate proxy IP set. The target proxy IP is then assigned to the crawler task so that the crawler task can initiate a data collection request to the target website based on the target proxy IP.
2. The method according to claim 1, characterized in that, The generation of contextualized scores for each proxy IP for each website includes: For each of the proxy IPs, obtain the historical access data of the proxy IPs for each website, wherein the historical access data includes at least the number of successful accesses, the number of failed accesses, and the number of times the anti-scraping mechanism was triggered by the proxy IP on each website; Based on the historical access data, contextualized scores for each website are determined for the proxy IP.
3. The method according to claim 1, characterized in that, Based on the global comprehensive score of each candidate proxy IP and the contextualized score for the target website, a decision score is determined for each candidate proxy IP, including: According to the preset weight allocation rules, the global comprehensive score and the contextual score are weighted and summed, and the weighted sum is used as the decision score, wherein the weight coefficient of the contextual score is greater than the weight coefficient of the global comprehensive score.
4. The method according to claim 1, characterized in that, After assigning the target proxy IP to the crawler task, the method further includes: Monitor the execution results of the crawler task using the target proxy IP to collect data; Based on the execution results, the contextualized score of the target proxy IP for the target website is dynamically adjusted.
5. The method according to claim 4, characterized in that, Based on the execution results, dynamically adjust the contextualized score of the target proxy IP for the target website, including: In response to the execution result being a failure or triggering an anti-scraping mechanism, the contextual score of the target proxy IP for the target website is lowered; In response to a successful execution result, the contextual score of the target proxy IP for the target website is increased.
6. The method according to claim 5, characterized in that, After lowering the contextual score of the target proxy IP for the target website, the method further includes: The target proxy IP is marked as being in a "cooling-off" state for the target website, and in this "cooling-off" state, the target proxy IP is not assigned to any crawling tasks for the target website.
7. The method according to claim 4, characterized in that, The method further includes: In response to the crawler task failing more than a first preset threshold number of times on the target website or triggering anti-crawling mechanisms more than a second preset threshold number of times, at least one of the following parameters of the crawler task is adjusted: Request frequency, request interval range, and number of consecutive accesses per IP address.
8. An intelligent scheduling device for multi-protocol proxy resources, characterized in that, A smart scheduling method for executing the multi-protocol proxy resources according to any one of claims 1 to 7, comprising: The access unit is used to access proxy IPs of multiple protocol types and build a proxy resource pool. It configures and stores metadata for each proxy IP, and the metadata includes at least the protocol type, IP attribute type, geographical location and supplier information. The evaluation unit is used to perform multi-dimensional active probing on the proxy IPs in the proxy resource pool at a preset frequency, obtain the real-time probing results of each proxy IP, combine the real-time probing results of each proxy IP with the corresponding metadata, determine the global comprehensive score of each proxy IP through a dynamic comprehensive scoring model, and generate a contextual score for each proxy IP for each website by combining the historical access data of each proxy IP to different websites. The contextual score is used to represent the access adaptability score for each website. A filtering unit is used to receive proxy call requests for crawling tasks, wherein the proxy call requests carry task attribute information of the crawling task, and filter out a set of candidate proxy IPs that meet the requirements from the proxy resource pool based on the task attribute information, wherein the task attribute information includes at least the target website, allowed IP attribute types and geographical location preferences; The scheduling unit is used to determine the decision score of each candidate proxy IP based on the global comprehensive score of each candidate proxy IP and the contextual score for the target website, use the decision score of each candidate proxy IP as a scheduling weight, select the target proxy IP from the candidate proxy IP set using a weighted random algorithm, and assign the target proxy IP to the crawler task so that the crawler task can initiate a data collection request to the target website based on the target proxy IP.
9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores instructions that, when executed, perform the method as described in any one of claims 1 to 7.