An industry data monitoring method and system based on crawler scheduling and network flow
By dynamically adjusting the priority and access frequency of crawler tasks through monitoring network data flow, the problems of untimely data retrieval and resource waste in existing technologies are solved, achieving efficient industrial data monitoring and processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING ZHIYI SHUPU DATA SERVICE CO LTD
- Filing Date
- 2026-02-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing web crawler systems lack the ability to perceive and respond to environmental changes, resulting in untimely crawling of high-value data and repeated access to low-popularity data, wasting resources and affecting system stability.
By monitoring the network data flow of various data sources, the priority and access frequency of crawler task sets are dynamically adjusted, a dual scheduling mechanism is established, resources are intelligently allocated to hot data sources, and the best crawling strategy is set.
It improves the efficiency of capturing and processing industry data, ensures the timely acquisition of key and timely data, and provides reliable data support for enterprise decision-making.
Smart Images

Figure CN122240904A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of web data crawling and processing technology, and in particular to an industrial data monitoring method and system based on crawler scheduling and network flow. Background Technology
[0002] Currently, with the development of the industrial internet, efficient and accurate collection of publicly available online data has become a key support for enterprise decision-making and market analysis. In existing technologies, web crawler systems typically employ predefined fixed scheduling strategies (such as breadth-first, depth-first, or simple time interval polling) to execute data collection tasks.
[0003] This static scheduling mechanism lacks the ability to perceive and respond to environmental changes, and cannot dynamically adjust based on the real-time load of the target website server, network bandwidth fluctuations, and changes in the popularity of the data content itself. On the one hand, it fails to capture high-value, frequently updated hot information in a timely manner, resulting in a lack of data timeliness and low collection efficiency; on the other hand, it repeatedly and frequently accesses low-popularity or slowly updated resources, wasting valuable network bandwidth and computing resources, and may also trigger the target site's anti-scraping mechanism due to request overload, leading to IP blocking and affecting the stable operation of the system. Summary of the Invention
[0004] The purpose of this application is to provide an industry data monitoring method and system based on crawler scheduling and network flow to solve the above-mentioned technical problems, aiming to improve the efficiency of industry data crawling and processing and provide data support for enterprise decision-making.
[0005] In some embodiments of this application, multiple data sources are selected according to the needs of industry data crawling, and crawler task sets for each data source are set to crawl industry data. By monitoring the network data flow of each data source, the priority of each crawler task set is dynamically adjusted to achieve the perception efficiency of critical and time-sensitive industry data, thereby improving the efficiency of industry data crawling and processing and providing data support for enterprise decision-making.
[0006] In some embodiments of this application, a dual scheduling mechanism is established to dynamically adjust the execution priority and access frequency of each data source based on the popularity of network data streams in each data source. Computing resources and bandwidth resources are intelligently allocated to hot data sources to ensure the overall crawling efficiency of the system. At the same time, within each data source, the optimal data crawling strategy is set according to the real-time popularity changes of its data stream to improve the crawling efficiency of key data in the data source.
[0007] In some embodiments of this application, an industry data monitoring method based on crawler scheduling and network flow is provided, including:
[0008] Set multiple data sources according to the associated network structure of industrial data, and establish a crawler task set for each data source; Obtain the network data streams of each data source, and set a crawler scheduling strategy according to all the network data streams; Obtain the feedback data packets of each data source according to the crawler scheduling strategy, and set a primary storage strategy according to the preprocessing model and all the feedback data packets.
[0009] In some embodiments of the present application, the establishment of the crawler task set for each data source includes: Establish a data source sequence A, A = (a1, a2... ai... an), where ai is the i-th data source; n is the number of data sources; Set ai as the target data source in sequence according to the data source sequence A; Obtain the crawling requirements of the target data source; Set multiple sub-crawler tasks according to the crawling requirements, and generate mapping sub-requirements for each sub-crawler task; Set the crawler task set of the target data source according to all the sub-crawler tasks; Set the crawler task sets of each data source in sequence.
[0010] In some embodiments of the present application, the setting of the crawler scheduling strategy includes: Set an initial scheduling strategy and multiple popularity metrics according to historical record data; The initial scheduling strategy includes: the execution priority and access frequency of each crawler task set; Set ai as the data source to be evaluated in sequence according to the data source sequence A; Obtain the network data stream of the data source to be evaluated, and generate a popularity deviation value of the data source to be evaluated; Generate the popularity deviation values of each data source in sequence, and set the crawler scheduling strategy according to all the popularity deviation values.
[0011] In some embodiments of the present application, the setting of the crawler scheduling strategy according to all the popularity deviation values includes: Preset a popularity deviation value threshold B1; If B1 < bi, i = (1, 2... n), generate a primary scheduling instruction for the i-th data source, and set the i-th data source as a popular data source; Set the primary scheduling strategy according to all the primary scheduling instructions and the initial control strategy; Set the secondary scheduling strategy for each popular data source according to the preset crawler correction model; Set the crawler scheduling strategy according to the primary scheduling strategy and all the secondary scheduling strategies.
[0012] In some embodiments of the present application, the preset crawler correction model includes: Based on the data source sequence A, ai is sequentially set as the data source to be corrected; Retrieve the historical data package of the data source to be corrected; Multiple trending scenarios are generated based on historical data packages; Establish a sequence of hot topics B, B=(b1, b2, ..., bi, ..., bm), where bi is the i-th hot topic of the data source to be corrected; m is the number of hot topics of the data source to be corrected. Based on the sequence of trending scenarios B, bi is sequentially set as the target trending scenario; Define the execution sub-strategy for the target popularity scenario; The execution sub-strategy includes: multiple sub-tasks to be executed and a first-level priority order; Execution sub-strategies for each heat scenario are set sequentially, and correction sub-models for the data source to be corrected are built based on all execution sub-strategies; The correction sub-models for each data source are generated sequentially, and the crawler correction model is built based on all the correction sub-models.
[0013] In some embodiments of this application, the execution sub-strategy for setting the target popularity scenario includes: Obtain the set of crawler tasks for the data source to be corrected, and establish a crawler subtask sequence H; H = (h1, h2, ..., hi, ..., hr), where hi is the number of crawler subtasks of the data source to be corrected; r is the number of crawler subtasks of the data source to be corrected. Based on the sequence of crawler subtasks H, hi is sequentially set as the target subtask; Obtain the mapping sub-requirements of the target sub-task, and generate the association value d between the target sub-task and the target popularity scene based on the mapping sub-requirements; Preset threshold value D1 for related values; If d > D1, set the target subtask as the subtask to be executed in the target popularity scenario; Generate association values between each crawler subtask and the target popularity scene in sequence; Set the first-level priority order based on all associated values.
[0014] In some embodiments of this application, the setting of the secondary scheduling strategy for each heat data source includes: Select the target trending source sequentially from all trending data sources; Establish an operational scenario based on the network data stream of the target heat source; The correction sub-model for the target heat source is defined as the target correction model; Generate similarity values for each popular scenario in the running scenario and the target correction model; Set the execution sub-strategy of the heat scene corresponding to the maximum value among all similar values as the secondary scheduling strategy of the target heat source; Set the secondary scheduling strategy for each person's popularity data source in sequence.
[0015] In some embodiments of this application, the setting of the primary storage strategy includes: Build a distributed storage library based on all data sources; The distributed storage library includes storage sub-libraries for each data source; Based on the data source sequence A, ai is sequentially set as the data source to be monitored; Obtain the feedback data packet from the data source to be monitored; Industry data packets are generated based on the processing results of the feedback data packets from the preprocessing model; Generate timestamps and data tags for industry data packets; Generate packaged industry packets based on industry data packets, timestamps, and data tags; The packaged industry package is sent to the storage sub-database of the data source to be monitored.
[0016] In some embodiments of this application, an industry data monitoring system based on crawler scheduling and network flow is provided, including: The central control unit is used to set up multiple data sources based on the network structure of industry data and establish a set of crawler tasks for each data source. The monitoring unit is used to acquire network data streams from various data sources; The central control unit is also used to set crawler scheduling strategies based on all network data streams. The storage unit is used to obtain feedback data packets from various data sources according to the crawler scheduling strategy, and to set a primary storage strategy based on the preprocessing model and all feedback data packets.
[0017] The central control unit includes: The first processing module is used to establish a data source sequence A, A=(a1, a2…ai…an), where ai is the i-th data source and n is the number of data sources; Based on the data source sequence A, set ai as the target data source in sequence; Obtain the crawling requirements of the target data source; Set up multiple crawler sub-tasks according to the crawling requirements, and generate mapping sub-requirements for each crawler task. Set the target data source for the crawler task set based on all crawler subtasks; Configure the crawler task sets for each data source in sequence.
[0018] In some embodiments of this application, the central control unit further includes: The second processing module is used to set the initial scheduling strategy and multiple popularity indicators based on historical data. The initial scheduling strategy includes: the execution priority and access frequency of each crawler task set; Set ai as the data source to be evaluated in sequence according to the data source sequence A; Obtain the network data stream of the data source to be evaluated, and generate the heat deviation value of the data source to be evaluated; Generate the heat deviation values of each data source in sequence, and set the crawler scheduling strategy according to all the heat deviation values; Among them, setting the crawler scheduling strategy according to all the heat deviation values includes: Preset the heat deviation value threshold B1; If B1 < bi, i = (1, 2... n), generate the first-level scheduling instruction for the i-th data source, and set the i-th data source as the heat data source; Set the first-level scheduling strategy according to all the first-level scheduling instructions and the initial control strategy; Set the second-level scheduling strategy for each heat data source according to the preset crawler correction model; Set the crawler scheduling strategy according to the first-level scheduling strategy and all the second-level scheduling strategies.
[0019] In the embodiment of the present application, a method and system for industrial data monitoring based on crawler scheduling and network flow, compared with the prior art, its beneficial effects are as follows: Select multiple data sources according to the crawling requirements of industrial data, set the crawler task sets of each data source to crawl industrial data, and dynamically correct the priorities of each crawler task set by monitoring the network data streams of each data source, so as to realize the perception efficiency of key and time-sensitive industrial data, thereby improving the crawling and processing efficiency of industrial data and providing data support for enterprise decision-making.
[0020] By establishing a dual scheduling mechanism, dynamically adjust the execution priority and access frequency of each data source according to the heat of the network data stream in each data source, and intelligently allocate computing resources and bandwidth resources to the hot data sources, ensuring the overall crawling efficiency of the system. At the same time, within each data source, set the best data crawling strategy according to the real-time heat change state of its data stream, and improve the crawling efficiency of key data in the data source. Description of the Drawings
[0021] Figure 1 It is a schematic flowchart of a method for industrial data monitoring based on crawler scheduling and network flow in a preferred embodiment of the embodiment of the present application. Detailed Embodiment
[0022] The following combines the drawings and embodiments to further describe the detailed implementation of the present application in detail. The following embodiments are used to illustrate the present application, but are not used to limit the scope of the present application.
[0023] In the description of this application, it should be understood that the terms "center", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on this application.
[0024] The terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, unless otherwise stated, "a plurality of" means two or more.
[0025] In the description of this application, it should be noted that, unless otherwise expressly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection between two components. Those skilled in the art can understand the specific meaning of the above terms in this application based on the specific circumstances.
[0026] like Figure 1 As shown, a preferred embodiment of this application provides an industry data monitoring method based on crawler scheduling and network flow, comprising: S101: Based on the network structure of industry data, set up multiple data sources and establish a crawler task set for each data source; S102: Obtain the network data streams from each data source and set the crawler scheduling strategy based on all network data streams; S103: Obtain feedback data packets from each data source according to the crawler scheduling strategy, and set a primary storage strategy based on the preprocessing model and all feedback data packets.
[0027] Specifically, the associated network structure refers to all websites that can obtain relevant industry data of an enterprise (such as official enterprise websites, supply chain and customer data websites, competitor websites, industry-related macroeconomic policy websites, and market feedback and public opinion data websites). Multiple data sources are set up based on all websites in the associated network structure, where a single data source represents one website.
[0028] Specifically, this involves establishing a set of crawler tasks for each data source, including: Establish a data source sequence A, A=(a1, a2, ..., ai, ..., an), where ai is the i-th data source and n is the number of data sources; Based on the data source sequence A, set ai as the target data source in sequence; Obtain the crawling requirements of the target data source; Set up multiple crawler sub-tasks according to the crawling requirements, and generate mapping sub-requirements for each crawler task. Set the target data source for the crawler task set based on all crawler subtasks; Configure the crawler task sets for each data source in sequence.
[0029] Specifically, the analysis involves identifying industry-related data within the target data source to determine the corresponding crawling scope, and then generating crawling requirements for the target data source based on that scope.
[0030] Specifically, by breaking down the crawling requirements (dividing them according to the website structure into various sub-requirements such as crawling newly added unknown URLs, crawling known specific URLs, and revisiting high-value pages), multiple crawling sub-requirements are generated, and multiple crawler sub-tasks are constructed based on the breakdown results. Each crawler sub-task handles a single crawling sub-requirement.
[0031] Specifically, a web crawling subtask includes specific tasks, namely the crawling path and the crawled content. Furthermore, the mapping sub-requirement of a web crawling subtask is the corresponding crawling sub-requirement.
[0032] It is understandable that in the above embodiments, multiple data sources are selected according to the needs of industry data crawling, and multiple crawler subtasks are set by analyzing the network structure of each data source to improve the crawling efficiency of relevant industry data in each data source.
[0033] In a preferred embodiment of this application, a crawler scheduling strategy is set, including: The initial scheduling strategy and multiple popularity indicators are set based on historical data; The initial scheduling strategy includes: the execution priority and access frequency of each crawler task set; Based on the data source sequence A, ai is sequentially set as the data source to be evaluated; Obtain the network data stream of the data source to be evaluated and generate the heat deviation value of the data source to be evaluated; Generate the popularity deviation value for each data source in sequence, and set the crawler scheduling strategy based on all popularity deviation values.
[0034] Specifically, popularity metrics include parameters that reflect the popularity of data sources, such as page update frequency, page views, and number of external citations.
[0035] Specifically, the historical record data is the historical content data captured from each data source. By analyzing the historical content data captured from each data source, priority evaluation values for each data source are set according to the data update frequency and data importance (i.e., the degree of impact on the enterprise's industrial decision-making. The greater the degree of impact, the higher the corresponding data importance). The greater the priority evaluation value, the higher the execution priority of the crawler task set of the corresponding data source, the higher the access frequency, and the greater the allocation of computing resources and bandwidth resources.
[0036] Specifically, a first reference value is set according to the data update frequency. The greater the update frequency, the greater the corresponding first reference value. A second reference value is set according to the data importance. The higher the data importance, the greater the corresponding second reference value. And the value ranges of the first reference value and the second reference value are the same. The priority evaluation value is generated based on the weighted processing result of the first reference value and the second reference value. The weight factors of the first reference value and the second reference value can be set according to historical parameters, and the sum of the two weight factors is 1.
[0037] Specifically, by quantifying each heat index, the reference values of each heat index are in the same value range. And by analyzing the historical content data of the data source to be evaluated, the expected reference value corresponding to each heat index is generated. The real-time reference value of each heat index is generated by obtaining the network data stream of the data source to be evaluated. According to the difference between the expected reference value and the real-time reference value of each heat index and the set heat deviation value, the greater the sum of the differences, the greater the corresponding heat deviation value. The mapping relationship between the two can be set according to historical parameters.
[0038] Specifically, a crawler scheduling strategy is set according to all heat deviation values, including: Presetting a heat deviation value threshold B1; If B1 < bi, i = (1, 2…n), generate a first-level scheduling instruction for the i-th data source and set the i-th data source as a heat data source; Set a first-level scheduling strategy according to all first-level scheduling instructions and the initial control strategy; Set a second-level scheduling strategy for each heat data source according to the preset crawler correction model; Set a crawler scheduling strategy according to the first-level scheduling strategy and all second-level scheduling strategies.
[0039] Specifically, the heat deviation value threshold can be set according to historical parameters. If the heat deviation value of the current data source is greater than the preset heat deviation value threshold, it means that there is an abnormal fluctuation in the relevant data in the current data source, and the crawler execution parameters of the current data source need to be adjusted in a timely manner according to the first-level scheduling instruction.
[0040] Specifically, the first-level scheduling strategy refers to updating the priority evaluation value of each popular data source based on real-time data popularity, and dynamically adjusting the execution priority and access frequency of the corresponding crawler task set based on the update results. The higher the data popularity, the higher the execution priority of the corresponding crawler task set.
[0041] It is understandable that in the above embodiments, by performing routine analysis on each data source, setting an initial scheduling strategy, and completing the routine crawling of industry-related content within each data source, and by monitoring the network data flow of each data source, the priority of each crawler task set is dynamically adjusted to achieve the perception efficiency of key and time-sensitive industry data, thereby improving the efficiency of crawling and processing industry data and providing data support for enterprise decision-making.
[0042] In a preferred embodiment of this application, the preset crawler correction model includes: Based on the data source sequence A, ai is sequentially set as the data source to be corrected; Retrieve the historical data package of the data source to be corrected; Multiple trending scenarios are generated based on historical data packages; Establish a sequence of hot topics B, B=(b1, b2, ..., bi, ..., bm), where bi is the i-th hot topic of the data source to be corrected; m is the number of hot topics of the data source to be corrected. Based on the sequence of trending scenarios B, bi is sequentially set as the target trending scenario; Define the execution sub-strategy for the target popularity scenario; The execution sub-strategy includes: multiple sub-tasks to be executed and a first-level priority order; Execution sub-strategies for each heat scenario are set sequentially, and correction sub-models for the data source to be corrected are built based on all execution sub-strategies; The correction sub-models for each data source are generated sequentially, and the crawler correction model is built based on all the correction sub-models.
[0043] Specifically, the history package contains historical content data scraped from the data source to be corrected.
[0044] Specifically, based on the quantitative results of each popularity index and the historical data package, multiple value ranges for each popularity index are set sequentially. Multiple popularity scenarios of the data source to be corrected are generated based on the random combination of all value ranges. Among them, the value ranges of each popularity index corresponding to any two popularity scenarios are not exactly the same.
[0045] Specifically, the division of value ranges for various popularity metrics varies across different data sources. For example, in data sources with high page update frequency, the interval division for that popularity metric (i.e., page update frequency) is more precise (i.e., the range of a single value interval is smaller). Conversely, in data sources with lower data importance, the interval division precision for each popularity metric is lower. The specific criteria for this division can be set based on the analysis results of the historical data package of that data source. By dynamically adjusting the value range division of each popularity metric, the scheduling efficiency of the correction sub-model for various crawler subtasks within the data source can be improved.
[0046] Specifically, the execution sub-strategies for setting target popularity scenarios include: Obtain the set of crawler tasks for the data source to be corrected, and establish a crawler subtask sequence H; H = (h1, h2, ..., hi, ..., hr), where hi is the number of crawler subtasks of the data source to be corrected; r is the number of crawler subtasks of the data source to be corrected. Based on the sequence of crawler subtasks H, hi is sequentially set as the target subtask; Obtain the mapping sub-requirements of the target sub-task, and generate the association value d between the target sub-task and the target popularity scene based on the mapping sub-requirements; Preset threshold value D1 for related values; If d > D1, set the target subtask as the subtask to be executed in the target popularity scenario; Generate association values between each crawler subtask and the target popularity scene in sequence; Set the first-level priority order based on all associated values.
[0047] Specifically, the influence of each popularity metric on the mapped sub-requirements is set with corresponding influence values. The greater the influence, the greater the corresponding influence value. For example, the page update frequency has a greater influence on the sub-requirement of crawling newly added unknown URLs, and the corresponding influence value is greater. The number of external references has a greater influence on the sub-requirement of revisiting and crawling high-value pages.
[0048] Specifically, it generates the variation values of each popularity index in the target popularity scenario (i.e., the difference between the expected reference value of the popularity index corresponding to the corrected data source and the median value of the popularity index range in the target popularity scenario).
[0049] Specifically, a weight sequence is set according to the mapping requirements of the target sub-task (i.e., the corresponding weight coefficient is set according to the influence value of the popularity index; the larger the influence value, the larger the corresponding weight coefficient, and the mapping relationship between the two can be set according to historical parameters). The weight sequence is used to weight the change values of each popularity index in the target popularity scenario, and the weighted result is set as the correlation value d between the target sub-task and the target popularity scenario.
[0050] Specifically, a preset correlation value threshold is set. If the correlation value of the target subtask is lower than the preset correlation value threshold, it means that the target subtask does not need to be executed at the current time node.
[0051] Specifically, the larger the correlation value, the higher the position of the corresponding subtask to be executed in the first-level priority order, the higher the execution priority of the subtask, and the more computing resources and bandwidth resources are allocated.
[0052] Specifically, the secondary scheduling strategy for each popularity data source is set, including: Select the target trending source sequentially from all trending data sources; Establish an operational scenario based on the network data stream of the target heat source; The correction sub-model for the target heat source is defined as the target correction model; Generate similarity values for each popular scenario in the running scenario and the target correction model; Set the execution sub-strategy of the heat scene corresponding to the maximum value among all similar values as the secondary scheduling strategy of the target heat source; Set the secondary scheduling strategy for each person's popularity data source in sequence.
[0053] Specifically, reference values for each popularity metric in the running scenario are obtained, thereby generating the difference between the value and the value of each popularity scenario (i.e., the sum of the absolute values of the differences between the value of each popularity metric and the value range of each popularity metric in the popularity scenario). The greater the difference, the smaller the corresponding similarity value. The mapping relationship between the two can be set according to historical parameters.
[0054] It is understandable that in the above embodiments, by establishing a dual scheduling mechanism, the execution priority and access frequency of each data source are dynamically adjusted according to the popularity of network data streams in each data source, and computing resources and bandwidth resources are intelligently allocated to hot data sources to ensure the overall crawling efficiency of the system. At the same time, within each data source, the optimal data crawling strategy is set according to the real-time popularity change status of its data stream to improve the crawling efficiency of key data in the data source.
[0055] In a preferred embodiment of this application, a primary storage strategy is set, including: Build a distributed storage library based on all data sources; A distributed storage library includes storage sub-libraries for various data sources; Based on the data source sequence A, ai is sequentially set as the data source to be monitored; Obtain the feedback data packet from the data source to be monitored; Industry data packets are generated based on the processing results of the feedback data packets from the preprocessing model; Generate timestamps and data tags for industry data packets; Generate packaged industry packets based on industry data packets, timestamps, and data tags; The packaged industry package is sent to the storage sub-database of the data source to be monitored.
[0056] Specifically, a corresponding preprocessing sub-model is constructed based on the data format of the content crawled from each data source. The industry data in the crawled content can be extracted through the preprocessing sub-model and transformed into the same fixed data format.
[0057] Specifically, a preprocessing model is constructed based on all preprocessing sub-models.
[0058] Specifically, the feedback data packet is content data captured from the data source to be monitored.
[0059] Specifically, the industry data package contains processed, uniformly formatted industry data. Specifically, a timestamp is generated based on the capture time, and corresponding data tags are set according to the category to which the industry data belongs. The timestamps establish a time-related architecture for industry data from different data sources, while the data tags construct a content-related architecture for industry data within the same data source, facilitating subsequent multi-source data querying and analysis.
[0060] In another preferred embodiment of the industry data monitoring method based on crawler scheduling and network flow, based on any of the above preferred embodiments, an industry data monitoring system based on crawler scheduling and network flow is provided, comprising: The central control unit is used to set up multiple data sources based on the network structure of industry data and establish a set of crawler tasks for each data source. The monitoring unit is used to acquire network data streams from various data sources; The central control unit is also used to set crawler scheduling strategies based on all network data streams; The storage unit is used to obtain feedback data packets from various data sources according to the crawler scheduling strategy, and to set a primary storage strategy based on the preprocessing model and all feedback data packets.
[0061] The central control unit includes: The first processing module is used to establish a data source sequence A, A=(a1, a2…ai…an), where ai is the i-th data source and n is the number of data sources; Based on the data source sequence A, set ai as the target data source in sequence; Obtain the crawling requirements of the target data source; Set up multiple crawler sub-tasks according to the crawling requirements, and generate mapping sub-requirements for each crawler task. Set the target data source for the crawler task set based on all crawler subtasks; Set the crawler task sets of each data source in sequence.
[0062] In a preferred embodiment of the present application, the central control unit further includes: A second processing module for setting an initial scheduling policy and multiple heat metrics according to historical record data; The initial scheduling policy includes: the execution priority and access frequency of each crawler task set; Set ai as the data source to be evaluated in sequence according to the data source sequence A; Obtain the network data stream of the data source to be evaluated and generate the heat deviation value of the data source to be evaluated; Generate the heat deviation values of each data source in sequence, and set the crawler scheduling policy according to all the heat deviation values; Among them, setting the crawler scheduling policy according to all the heat deviation values includes: Preset the heat deviation value threshold B1; If B1 < bi, i = (1, 2... n), generate the first-level scheduling instruction for the i-th data source and set the i-th data source as the heat data source; Set the first-level scheduling policy according to all the first-level scheduling instructions and the initial control policy; Set the second-level scheduling policy for each heat data source according to the preset crawler correction model; Set the crawler scheduling policy according to the first-level scheduling policy and all the second-level scheduling policies.
[0063] According to the first concept of the present application, select multiple data sources according to the scraping requirements of industrial data, set the crawler task sets of each data source to scrape industrial data, and dynamically correct the priorities of each crawler task set by monitoring the network data streams of each data source, so as to realize the perception efficiency of key and time-sensitive industrial data, thereby improving the scraping and processing efficiency of industrial data and providing data support for enterprise decision-making.
[0064] According to the second concept of the present application, by establishing a dual scheduling mechanism, dynamically adjust the execution priority and access frequency of each data source according to the heat of the network data stream in each data source, and intelligently allocate computing resources and bandwidth resources to the hot data sources to ensure the overall scraping efficiency of the system. At the same time, within each data source, set the best data scraping strategy according to the real-time heat change state of its data stream to improve the scraping efficiency of key data in the data source.
[0065] The above are only the preferred embodiments of the present application. It should be noted that for those of ordinary skill in the art, without departing from the technical principle of the present application, several improvements and replacements can be made, and these improvements and replacements should also be regarded as the protection scope of the present application.
Claims
1. An industry data monitoring method based on crawler scheduling and network flow, characterized in that, Including: Set multiple data sources according to the association network structure of industrial data, and establish a crawler task set for each data source; Obtain the network data streams of each data source, and set a crawler scheduling strategy according to all the network data streams; Obtain the feedback data packets of each data source according to the crawler scheduling strategy, and set a primary storage strategy according to the preprocessing model and all the feedback data packets.
2. The industry data monitoring method based on crawler scheduling and network flow as described in claim 1, characterized in that, The establishment of the crawler task set for each data source includes: Establish a data source sequence A, A=(a1, a2…ai…an), where ai is the i-th data source; n is the number of data sources; Set ai as the target data source in sequence according to the data source sequence A; Obtain the scraping requirements of the target data source; Set multiple crawler sub-tasks according to the scraping requirements, and generate mapping sub-requirements for each crawler sub-task; Set the crawler task set of the target data source according to all the crawler sub-tasks; Set the crawler task sets of each data source in sequence.
3. The industry data monitoring method based on crawler scheduling and network flow as described in claim 2, characterized in that, The setting of the crawler scheduling strategy includes: Set an initial scheduling strategy and multiple heat metrics according to historical record data; The initial scheduling strategy includes: the execution priority and access frequency of each crawler task set; Set ai as the data source to be evaluated in sequence according to the data source sequence A; Obtain the network data stream of the data source to be evaluated, and generate a heat deviation value of the data source to be evaluated; Generate the heat deviation values of each data source in sequence, and set a crawler scheduling strategy according to all the heat deviation values.
4. The industry data monitoring method based on crawler scheduling and network flow as described in claim 3, characterized in that, The setting of the crawler scheduling strategy according to all the heat deviation values includes: Preset a heat deviation value threshold B1; If B1 < bi, i=(1, 2…n), generate a primary scheduling instruction for the i-th data source, and set the i-th data source as a heat data source; Set a primary scheduling strategy according to all the primary scheduling instructions and the initial control strategy; Set a secondary scheduling strategy for each heat data source according to the preset crawler correction model; Set a crawler scheduling strategy according to the primary scheduling strategy and all the secondary scheduling strategies.
5. The industry data monitoring method based on crawler scheduling and network flow as described in claim 4, characterized in that, The preset crawler correction model includes: Set ai as the data source to be corrected in sequence according to the data source sequence A; Obtain the historical record packet of the data source to be corrected; Generate multiple heat scenarios according to the historical record packet; Establish a heat scenario sequence B, B=(b1, b2…bi…bm), where bi is the i-th heat scenario of the data source to be corrected; m is the number of heat scenarios of the data source to be corrected; Set bi as the target heat scenario in sequence according to the heat scenario sequence B; Set the execution sub-strategy of the target heat scenario; The execution sub-strategy includes: multiple sub-tasks to be executed and a primary priority order; Set the execution sub-strategies of each heat scenario in sequence, and establish a correction sub-model of the data source to be corrected according to all the execution sub-strategies; Generate the correction sub-models of each data source in sequence, and establish a crawler correction model according to all the correction sub-models.
6. The industry data monitoring method based on crawler scheduling and network flow as described in claim 5, characterized in that, The setting of the execution sub-strategy of the target heat scenario includes: Obtain the crawler task set of the data source to be corrected, and establish a crawler sub-task sequence H; H=(h1, h2…hi…hr), where hi is the number of crawler sub-tasks of the data source to be corrected; r is the number of crawler sub-tasks of the data source to be corrected; Set hi as the target sub-task in sequence according to the crawler sub-task sequence H; Obtain the mapping sub-requirements of the target sub-task, and generate the association value d between the target sub-task and the target popularity scene based on the mapping sub-requirements; Preset threshold value D1 for related values; If d > D1, set the target subtask as the subtask to be executed in the target popularity scenario; Generate association values between each crawler subtask and the target popularity scene in sequence; Set the first-level priority order based on all associated values.
7. The industry data monitoring method based on crawler scheduling and network flow as described in claim 5, characterized in that, The secondary scheduling strategy for each heat source includes: Select the target trending source sequentially from all trending data sources; Establish an operational scenario based on the network data stream of the target heat source; The correction sub-model for the target heat source is defined as the target correction model; Generate similarity values for each popular scenario in the running scenario and the target correction model; Set the execution sub-strategy of the heat scene corresponding to the maximum value among all similar values as the secondary scheduling strategy of the target heat source; Set the secondary scheduling strategy for each person's popularity data source in sequence.
8. The industry data monitoring method based on crawler scheduling and network flow as described in claim 3, characterized in that, The setting of the primary storage strategy includes: Build a distributed storage library based on all data sources; The distributed storage library includes storage sub-libraries for each data source; Based on the data source sequence A, ai is sequentially set as the data source to be monitored; Obtain the feedback data packet from the data source to be monitored; Industry data packets are generated based on the processing results of the feedback data packets from the preprocessing model; Generate timestamps and data tags for industry data packets; Generate and package industry packages based on industry data packets, timestamps, and data tags; The packaged industry package is sent to the storage sub-database of the data source to be monitored.
9. An industry data monitoring system based on crawler scheduling and network flow, employing the industry data monitoring method based on crawler scheduling and network flow as described in any one of claims 1-8, characterized in that, include: The central control unit is used to set up multiple data sources based on the network structure of industry data and establish a set of crawler tasks for each data source. The monitoring unit is used to acquire network data streams from various data sources; The central control unit is also used to set crawler scheduling strategies based on all network data streams. The storage unit is used to obtain feedback data packets from various data sources according to the crawler scheduling strategy, and to set a primary storage strategy based on the preprocessing model and all feedback data packets. The central control unit includes: The first processing module is used to establish a data source sequence A, A=(a1, a2…ai…an), where ai is the i-th data source and n is the number of data sources; Based on the data source sequence A, set ai as the target data source in sequence; Obtain the crawling requirements of the target data source; Set up multiple crawler sub-tasks according to the crawling requirements, and generate mapping sub-requirements for each crawler task. Set the target data source for the crawler task set based on all crawler subtasks; Configure the crawler task sets for each data source in sequence.
10. The industry data monitoring system based on crawler scheduling and network flow as described in claim 9, characterized in that, The central control unit also includes: The second processing module is used to set the initial scheduling strategy and multiple popularity indicators based on historical data. The initial scheduling strategy includes: the execution priority and access frequency of each crawler task set; Based on the data source sequence A, ai is sequentially set as the data source to be evaluated; Obtain the network data stream of the data source to be evaluated and generate the heat deviation value of the data source to be evaluated; Generate the popularity deviation value of each data source in sequence, and set the crawler scheduling strategy based on all popularity deviation values; The crawler scheduling strategy is set based on all popularity deviations, including: Preset heat deviation threshold B1; If B1 < bi, where i = (1, 2... n), generate the first-level scheduling instruction for the ith data source and set the ith data source as a hot data source; Set the first-level scheduling policy according to all the first-level scheduling instructions and the initial control policy; Set the second-level scheduling policy for each hot data source according to the preset crawler correction model; Set the crawler scheduling policy according to the first-level scheduling policy and all the second-level scheduling policies.