A cluster data adaptive sampling method and apparatus
By collecting real-time pressure values of cluster load indicators, dynamically determining the pressure inflection point and nonlinearly normalizing it, and fusing and predicting future load, the problem of load quantization distortion and frequency adjustment lag in cluster data collection is solved. This achieves an upgrade of the adaptive sampling strategy, improving collection accuracy and resource utilization efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ANQING (TIANJIN) COMPUTER CO LTD
- Filing Date
- 2026-05-29
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies cannot adaptively adjust to dynamic changes in cluster load during cluster data acquisition, resulting in load quantization distortion and lag in sampling frequency response. They also cannot adjust the sampling frequency in advance before load changes, affecting acquisition accuracy and resource utilization efficiency.
By collecting real-time stress values of multiple load indicators in the cluster, the stress inflection point of each load indicator is dynamically determined, the real-time stress values are nonlinearly normalized, and the cluster load value is obtained by fusion. Based on future load prediction, the sampling frequency is adjusted so that the frequency change is earlier than the actual load change.
It achieves accurate load quantization and forward-looking adjustment of sampling frequency, solves the problems of load quantization distortion and frequency adjustment lag, and improves acquisition accuracy and resource utilization efficiency.
Smart Images

Figure CN122309289A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of server data acquisition technology, and in particular to a cluster data adaptive sampling method and apparatus. Background Technology
[0002] With the rapid development of AI technology, the application of ultra-large-scale AI server clusters is becoming increasingly widespread. Cluster sizes are constantly expanding, and node types are becoming more diverse and heterogeneous, encompassing various types such as compute nodes, storage nodes, and switching nodes. To ensure the stable operation and efficient maintenance of the cluster, continuous data collection on the operational status of each node is necessary. The quality and efficiency of the collected data directly affect the accuracy of operational decisions and AI predictions.
[0003] Currently, the industry primarily employs fixed-frequency sampling or simple threshold-triggered sampling for cluster data acquisition. Fixed-frequency sampling involves collecting operational data from each node in the cluster at preset fixed time intervals, maintaining a constant sampling frequency regardless of cluster load. Simple threshold-triggered sampling involves setting one or more fixed load thresholds; when the monitored load metric exceeds or falls below the threshold, an adjustment to the sampling frequency is triggered.
[0004] However, fixed-frequency sampling cannot adaptively adjust to dynamic changes in cluster load. In high-load scenarios, a fixed sampling frequency may be insufficient to capture instantaneous changes in critical performance data such as computing power and latency, leading to missing key data. In idle scenarios, a fixed sampling frequency results in redundant waste of acquisition and transmission resources, failing to balance the contradiction between acquisition accuracy and resource consumption. Although simple threshold-triggered sampling introduces a response to load changes, the inherent information transmission and processing delay between detecting a change and adjusting the sampling frequency means that the adjustment of the sampling frequency always lags behind the actual load change. When the cluster load suddenly increases, the sampling frequency cannot keep up immediately, resulting in missing key data in the initial stage of the load surge; when the cluster load decreases, the sampling frequency remains high during the transition period, causing resource waste. In addition, existing solutions typically use a uniform and fixed pressure inflection point for normalization when quantifying cluster load. However, different clusters have different workload characteristics and hardware configurations, and the load level of the same cluster varies significantly at different times. A fixed static inflection point cannot adapt to these differences, leading to distorted load value calculations and affecting the accuracy of sampling frequency adjustment.
[0005] In summary, existing technologies cannot simultaneously solve the problems of load quantization distortion and sampling frequency response lag in cluster data acquisition. There is an urgent need for an adaptive sampling method that can dynamically determine the load quantization benchmark based on the actual load characteristics of the cluster and adjust the sampling frequency in advance based on the load change trend. Summary of the Invention
[0006] In view of this, this application provides a cluster data adaptive sampling method and apparatus to solve the problem that the existing cluster data acquisition uses a fixed pressure inflection point and passive response frequency adjustment, which leads to distortion or lag in load quantization and sampling frequency adjustment, and makes it impossible to adjust the sampling frequency in advance before the load changes.
[0007] Specifically, this application is implemented through the following technical solution:
[0008] A first aspect of this application provides a cluster data adaptive sampling method, the method comprising:
[0009] Collect real-time stress values corresponding to multiple load metrics of the cluster;
[0010] Determine the pressure inflection point corresponding to each load indicator based on historical real-time pressure values.
[0011] The real-time pressure value is nonlinearly normalized based on the pressure inflection point of each load index to obtain the standard pressure value corresponding to each load index.
[0012] The standard stress values of various load metrics are combined to obtain the cluster load value;
[0013] Predict the future load of the cluster based on the cluster load value at the first possible time.
[0014] The sampling frequency of the cluster's data is adjusted based on the future load, and the adjusted sampling frequency changes earlier than the actual load change time of the cluster.
[0015] A second aspect of this application provides a cluster data adaptive sampling device, the device comprising a data acquisition module, a determination module, a processing module, a fusion module, a prediction module, and an adjustment module;
[0016] The acquisition module is used to collect real-time pressure values corresponding to multiple load indicators of the cluster;
[0017] The determining module is used to determine the pressure inflection point corresponding to each load indicator based on historical real-time pressure values.
[0018] The processing module is used to nonlinearly normalize the real-time pressure value according to the pressure inflection point of each load index to obtain the standard pressure value corresponding to each load index.
[0019] The fusion module is used to fuse the standard stress values of various load indicators to obtain the cluster load value;
[0020] The prediction module is used to predict the future load of the cluster in the first possible time based on the cluster load value;
[0021] The adjustment module is used to adjust the sampling frequency of the cluster's data based on the future load, wherein the adjusted sampling frequency change time is earlier than the actual load change time of the cluster.
[0022] The adaptive sampling method and apparatus for cluster data provided in this application differ from traditional fixed threshold sampling or variable sampling methods prescribed by human criteria. It accurately normalizes load indicators through the pressure inflection point, accurately assesses the current load level, and predicts load changes based on this accurate starting point. By adjusting the pressure inflection point from a fixed preset value to one dynamically determined based on historical real-time data, the normalization benchmark of the load indicators can adapt to changes in cluster load characteristics. Furthermore, the sampling frequency is adjusted in advance through a prediction algorithm, ensuring that the change in sampling frequency occurs earlier than the change in the actual cluster load. Overall, this achieves an upgrade in sampling strategy from passive response to proactive prediction, solving the problems of load quantization distortion and frequency adjustment lag in existing solutions. Specifically, by collecting real-time pressure values of multiple load indicators from the cluster and dynamically determining the pressure inflection point for each load indicator based on historical real-time pressure values, the differences in load characteristics of different clusters and time periods can be reflected in the normalized benchmark, avoiding normalization distortion caused by fixed static inflection points. Nonlinear normalization is applied to the current pressure value of each load indicator based on its pressure inflection point, accelerating the amplification of high load states approaching or exceeding the inflection point within the standard pressure value, ensuring comparability of load indicators with different physical characteristics on a unified scale. The standard pressure values of each load indicator are fused to obtain the cluster load value, generating a single evaluation value by integrating pressure information from multiple indicators, providing a basis for subsequent prediction and adjustment. The cluster load value is used to predict the future load of the cluster in the first possible moment, allowing for early perception of load change trends and providing forward-looking information for predictive adjustment of the sampling frequency. The sampling frequency is adjusted based on the future load, with the adjusted sampling frequency change time preceding the actual load change time, ensuring the sampling frequency is increased before the actual load rises and gradually decreased after the actual load falls, solving the problem of adjustment lag. Attached Figure Description
[0023] Figure 1 A flowchart of an embodiment of the adaptive sampling method for cluster data provided in this application;
[0024] Figure 2 This is a schematic diagram of the structure of Embodiment 2 of the cluster data adaptive sampling device provided in this application. Detailed Implementation
[0025] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.
[0026] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used herein are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
[0027] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."
[0028] Example 1
[0029] The following specific embodiments are given to illustrate the technical solution of this application in detail.
[0030] Figure 1 This is a flowchart of an embodiment of the adaptive sampling method for cluster data provided in this application. Please refer to... Figure 1 The method provided in this embodiment may include:
[0031] S101: Collect real-time pressure values corresponding to multiple load indicators of the cluster.
[0032] The method provided in this embodiment is applied to a six-level collaborative acquisition architecture covering the hardware layer, firmware layer, system layer, application layer, network layer, and heat dissipation layer. This six-level collaborative acquisition architecture covers the entire operation chain of a large-scale AI server cluster. The data collected at each level is as follows: The hardware layer collects parameters such as CPU utilization, GPU utilization, memory usage, hard disk read / write speed, and power consumption of each node through intelligent acquisition modules; the firmware layer collects data such as the version number, running status, fault codes, and upgrade logs of the hardware firmware of each node; the system layer collects data such as the number of processes, resource usage, system logs, and boot time of the operating system of each node, supporting Linux and Windows Server operating systems; the application layer collects data such as the training speed, latency, accuracy, and task progress of AI model training tasks, as well as the throughput and response time of data processing applications; the network layer collects data such as network bandwidth usage, transmission latency, packet loss rate, and port status between each node; and the heat dissipation layer collects data such as the flow rate of liquid cooling pipes, coolant inlet and outlet temperatures, cooling fan speed, and heat dissipation system fault warnings. Data interaction between different levels is achieved through data linkage interfaces to ensure the consistency and integrity of the collected data in terms of timing and to avoid data disconnection across levels.
[0033] It should be noted that real-time stress values for multiple load metrics of the cluster are collected collaboratively by the BMC and the host-side agent (OSAgent). The BMC is a management controller that runs independently of the server's main processor. It can monitor and manage server hardware through an out-of-band management channel, and it can still function normally even when the main processor is powered off. The OS Agent is an agent program deployed on the host side, used to collect host-side operational data.
[0034] In the six-level collaborative data acquisition architecture, data from the hardware, firmware, and thermal layers can be collected through the BMC (Browser Control Center). This is because the BMC itself has out-of-band management capabilities, providing standard IPMI and Redfish interfaces to obtain server hardware resources, logs, and operational status information. It is the primary data source in the acquisition architecture, directly accessing hardware sensors, firmware status registers, and thermal system monitoring interfaces. However, data from the system, application, and network layers belongs to the host's operational data and cannot be directly obtained by the BMC. Therefore, an OS Agent needs to be installed on the host to periodically collect data from these three layers and transmit it to the BMC. The BMC aggregates the data collected from each layer, providing a unified method for acquiring data from each layer, ensuring the temporal consistency of the collected data and avoiding data chaos caused by independent sampling from multiple layers. The clustered deployment of acquisition nodes reduces coordination errors from independent nodes, unifies data processing and management, and establishes a six-layer data model covering all aspects of server operation, providing comprehensive and reliable data analysis.
[0035] When collecting real-time stress values corresponding to multiple load metrics in a cluster, the first step is to determine the communication protocol for each device in the cluster. Since large-scale clusters may simultaneously deploy server devices from different vendors, and the communication interface specifications of these devices differ, using a fixed communication protocol may prevent some devices from collecting data correctly. Therefore, historical communication statistics are first used to predict the communication protocols that the devices might use. Then, port scanning is used for real-time verification and priority adjustment. Finally, the final matching communication protocol is determined using time-division interleaving based on priority.
[0036] Specifically, including:
[0037] (1) Obtain historical communication statistics information corresponding to the device to be collected, and sort the communication collection protocols of the device to be collected according to the historical communication statistics information to obtain an initial priority list.
[0038] Historical communication statistics are records kept by the system during each data acquisition process, showing the types of protocols used in successful communication with each device and the corresponding success rates. For example, for a server of a certain model from a certain manufacturer, the system records the types of protocols used in its historical successful communication, including the communication protocols defined by the Intelligent Platform Management Interface (IPMI) specification and the Redfish protocol, and provides statistics on the historical communication success rates for each protocol.
[0039] When determining the initial priority ranking, the system sorts the communication protocols collected based on the historical protocol distribution corresponding to the device model in the historical communication statistics, from high to low historical communication success rate, and generates an initial priority list. For device models with successful communication records, the initial priority list only includes protocol types that have successfully communicated in the past; for devices that are accessing the cluster for the first time and have no historical communication records, the initial priority list is filled with all supported protocol types according to the industry-standard default priority order.
[0040] Pre-sorting using historical communication statistics allows for the rapid identification of the most likely successful communication protocols based on historical experience, avoiding the need to try all protocols from scratch on each startup and reducing the time required for protocol detection.
[0041] (2) Send a probe request to the common port of the device to be collected, and correct the initial priority list according to the port connectivity to obtain the corrected priority list.
[0042] Since historical communication statistics reflect past communication status, and the port configuration of the device to be collected may change due to firmware upgrades, security policy adjustments, etc., it is necessary to modify the initial priority list in conjunction with real-time port connectivity.
[0043] Specifically, the system sends probe requests to the commonly used ports corresponding to each communication acquisition protocol in the initial priority list. For example, the IPMI protocol typically uses UDP port 623, and the Redfish protocol typically uses HTTPS port 443. The system determines whether a port is reachable based on the response status returned by the port. If the port responds normally, it confirms that the communication acquisition protocol is available at the current time; if the port times out and does not respond or refuses to connect, the corresponding communication acquisition protocol is removed from the initial priority list or its priority is reduced.
[0044] The revised priority list removes protocols that were available in historical statistics but are currently unreachable on the port. The protocols that are retained are all those that have been confirmed to be reachable by port probe at the current moment, thereby ensuring the success rate of subsequent handshake probes.
[0045] (3) Send protocol handshake packets to the devices to be collected one by one according to the order of the revised priority list, and determine the communication collection protocol that matches the devices to be collected based on the first returned handshake response.
[0046] To avoid excessive time overhead caused by serial polling, this step uses time-division interleaving to send protocol handshake packets. Time-division interleaving means initiating multiple protocol handshake requests sequentially within a short period. The highest priority protocol handshake packet is sent first, and without waiting for its timeout response, the next priority protocol handshake packet is immediately sent, and so on. When a handshake packet returns a normal handshake response, it is confirmed that the communication acquisition protocol corresponding to that handshake packet matches the device being acquired, and subsequent handshake requests that are not sent or do not return are no longer waited for or processed.
[0047] Using time-division interleaving to send handshake packets significantly shortens the total protocol probing time compared to traditional serial polling, which involves sending a handshake request, waiting for a timeout, and then sending the next one. In serial polling, if the highest-priority protocol cannot connect, the system must wait for the full timeout period before trying the next protocol; however, in time-division interleaving, multiple protocol handshake requests are sent consecutively within a very short time window, allowing the first protocol to connect to be quickly identified and used.
[0048] After determining the appropriate communication acquisition protocol for the device to be sampled, the system uses this protocol to collect real-time stress values corresponding to multiple load metrics, such as CPU utilization, GPU utilization, memory utilization, and task concurrency, from different types of nodes in the cluster, including compute nodes, storage nodes, and switching nodes. These values are used for subsequent load value calculations and sampling frequency adjustments. Specifically, CPU utilization is obtained from the CPU performance counters of the compute nodes, GPU utilization from the GPU driver interface of the compute nodes, memory utilization from the memory controllers of the storage nodes, and task concurrency is obtained by calculating the ratio of the number of currently executing tasks to the maximum task capacity in the cluster.
[0049] It should be noted that the data collected in the cluster is diverse, and the importance of different data to cluster operation and maintenance and AI prediction varies. For example, computing power data and latency data directly reflect the cluster's performance bottlenecks and business experience, and are considered high-importance data; while static information such as hardware firmware version numbers are relatively less important. In this embodiment, when adjusting the sampling frequency in subsequent steps, a uniform frequency adjustment strategy is not applied to all data. Instead, a differentiated correction coefficient is applied to the sampling frequency based on the importance level of the data. The higher the importance of the data, the greater the adjustment range of its sampling frequency with load changes, ensuring that critical data is not lost under high-load scenarios; the sampling frequency adjustment range of the lower importance data is relatively gradual, in order to save collection and transmission resources.
[0050] S102. Determine the pressure inflection point corresponding to each load indicator based on the historical real-time pressure value.
[0051] It's important to clarify that the physical meaning of the stress inflection point is: when the actual value of a load metric reaches this inflection point, the system stress enters a high-load range, and the normalized standard stress value of that metric shows an accelerating growth trend. The stress inflection point varies for different load metric. For example, the stress inflection point for CPU utilization is typically set around 80%, while the stress inflection point for memory utilization might be set around 95%. This is because when CPU utilization reaches 80%, the system response latency has already increased significantly, while memory can still maintain relatively stable performance at 90% utilization. If a uniform and fixed stress inflection point is used for all load metric, it will not accurately reflect the true stress state of each metric, leading to distortion in subsequent load value calculations.
[0052] This embodiment adopts a method of dynamically determining the pressure inflection point based on historical real-time data. Different clusters have different workload characteristics, hardware configurations and business models, and the load level of the same cluster is also significantly different at different time periods. A fixed static inflection point cannot adapt to these differences.
[0053] It should be noted that the granularity of determining the pressure inflection point in this embodiment is as follows: each load metric has a uniform pressure inflection point determined for each node type, rather than each node having an independent pressure inflection point for each metric. Specifically, for all nodes of the same type in the cluster, the system aggregates the historical real-time pressure values of the same load metric for all nodes of that type, performs statistical processing on the aggregated data, and uses the processing result as the uniform pressure inflection point for that load metric under that node type. This uniform pressure inflection point applies to all nodes of that type. For example, the CPU utilization of all compute nodes in the cluster shares a single pressure inflection point, which is obtained by aggregating the historical pressure values of CPU utilization of all compute nodes; similarly, the memory utilization of all storage nodes shares a single pressure inflection point, which is obtained by aggregating the historical pressure values of memory utilization of all storage nodes.
[0054] The purpose of using a unified approach to determine the pressure inflection point based on node type, rather than determining it independently for each node, is twofold. First, nodes of the same type typically have the same hardware configuration and run the same type of workload, resulting in highly consistent load characteristics across their load metrics. Determining the inflection point by type can reflect the overall load characteristics of that type of node and reduce inflection point deviations caused by local fluctuations in a single node. Second, aggregating and processing historical data from multiple nodes of the same type is more economical in terms of storage requirements, provides more sufficient data volume, and simplifies the solution process compared to calculating the inflection point independently for each node.
[0055] For the same load metric across different node types, the pressure inflection point is determined independently. Because different node types bear different load tasks and have different hardware configurations, the pressure characteristics of the same load metric vary. Therefore, each node type independently calculates its corresponding pressure inflection point based on the aggregated historical pressure values of its own type. When a new node of a certain type is added to the cluster or a node of a certain type is removed, the system automatically includes the historical data of the newly added node in the corresponding type's aggregated statistics in the next inflection point update cycle, or removes the historical data of the removed node from the aggregated statistics, and recalculates the unified pressure inflection point for that type. For nodes newly connected to the cluster without historical data, the system temporarily uses the current unified pressure inflection point of the same type of node as the initial value until sufficient historical data is collected. Specifically, determining the pressure inflection point corresponding to each load metric based on historical real-time pressure values includes:
[0056] (1) Obtain multiple real-time pressure values for each load indicator within a preset historical period.
[0057] The system maintains a historical stress value sequence for each load metric. This sequence records the real-time stress value of the metric obtained from each sampling within a preset historical period (e.g., the most recent 24 hours or the most recent week). The historical stress value sequence is updated chronologically after each new data collection, with new data appended to the end of the sequence and old data exceeding the preset historical period automatically removed to ensure that the sequence reflects recent load characteristics.
[0058] (2) Perform statistical processing on the multiple real-time pressure values and use the processing result as the pressure inflection point corresponding to the load index.
[0059] Statistical processing was performed on multiple real-time pressure values in the historical pressure value sequence, and the statistical quantity that can reflect the high load critical level of the load index was selected as the pressure inflection point.
[0060] Optional statistical processing methods include the median method and the rolling average method. The median method sorts all values in the historical pressure value sequence by size and takes the value in the middle as the pressure inflection point. The advantage of the median is that it is not sensitive to extreme values; even if there are a few abnormal peaks in the historical sequence, the median can stably reflect the typical high load level of the indicator. The rolling average method calculates the arithmetic mean of all values in the historical pressure value sequence, multiplies the average by a preset empirical coefficient, and then uses this average as the pressure inflection point. The empirical coefficient moderately raises the historical average level to the high load critical level, so that the pressure inflection point can characterize the load warning line of the indicator. The value range of the empirical coefficient can be pre-defined based on the ratio of the cluster's historical peak pressure to the average pressure, for example, set as a constant between 1.1 and 1.3; for clusters that are newly connected or have no historical data, a default number (such as 1.2) can be used as the empirical coefficient.
[0061] It is important to note that the system does not employ a fixed statistical processing method. Instead, it adaptively selects the most suitable method based on the historical fluctuation characteristics of the load index. When determining a pressure inflection point, the system first calculates the coefficient of variation (COP) of the historical pressure value sequence. The COP, calculated as the ratio of the standard deviation to the mean, measures the relative dispersion of the data. When the COP is below a preset first COP threshold, it indicates that the historical pressure values are concentrated and have relatively small fluctuations. In this case, the system uses the rolling mean method to determine the pressure inflection point, more smoothly reflecting the typical load level of the index. When the COP exceeds the first COP threshold but is below the second COP threshold, it indicates that the historical pressure value distribution has some dispersion. In this case, the system uses the median method to avoid interference from a few abnormal peaks. When the COP exceeds the second COP threshold, it indicates that the historical pressure values fluctuate drastically. In this case, the system switches to the truncated mean method, removing the highest and lowest values from the sequence according to a preset proportion before calculating the mean, achieving a balance between extreme value interference and mean representativeness. This adaptive selection of statistical methods based on data distribution characteristics allows the determination of the pressure inflection point to dynamically match the most suitable statistical tool according to the actual fluctuation characteristics of the load index, avoiding the bias caused by fixed methods under specific data distributions.
[0062] (3) When the fluctuation range of the historical real-time pressure value exceeds the preset threshold, the pressure inflection point is re-determined.
[0063] The load characteristics of a cluster may change due to business changes, node expansion, or seasonal traffic variations. When the fluctuation range of the historical pressure value sequence exceeds a preset threshold, it indicates that the current load characteristics have deviated from historical statistical patterns, and the previously determined pressure inflection point may no longer be applicable. At this time, the system triggers the inflection point recalculation process, re-acquires the real-time pressure values within the most recent preset historical period, re-executes statistical processing, and updates the newly calculated results as the current pressure inflection point.
[0064] The fluctuation range can be measured using dispersion indicators such as the standard deviation, variance, or range of historical pressure value sequences. The threshold is set based on the cluster's historical operating data and operational experience, and can be adjusted according to actual circumstances.
[0065] Furthermore, it should be noted that when merging the standard stress values of various load metrics into a cluster load value, the merging weights for different load metrics are not fixed. Cluster load itself can be divided into different load ranges, and the impact of each load metric on the overall cluster operation varies across these ranges. For example, in high load ranges, CPU utilization and GPU utilization have a more significant impact on cluster performance, and their corresponding merging weights increase accordingly; while in low load ranges, metrics such as task concurrency are more sensitive to the cluster's performance. In this embodiment, the merging weights are dynamically adjusted based on the load range in which the cluster load value generated in the previous cycle falls.
[0066] S103. Based on the pressure inflection point of each load index, the real-time pressure value is nonlinearly normalized to obtain the standard pressure value corresponding to each load index.
[0067] It's important to note that directly using the raw, real-time stress values of various load metrics (such as 85% CPU utilization and 90% memory utilization) for subsequent load value fusion and sampling frequency adjustment will lead to numerical distortion. This is because the utilization characteristics of different component resources vary significantly. For example, when CPU utilization exceeds 70%, system response latency has already increased significantly, while memory utilization may still be healthy even at 90%. If the 85% CPU utilization and 90% memory utilization are directly weighted and summed, the memory value will mask the already high CPU stress. Therefore, it is necessary to first apply non-linear mapping normalization to each load metric, adjusting all metrics to a uniform standard stress value range, making the values of different metrics comparable.
[0068] It should be noted that when the actual value of the load index approaches or exceeds its pressure inflection point, its standard pressure value shows an accelerated growth trend, rapidly approaching the upper limit of the standard pressure value; when the actual value of the load index is far below the pressure inflection point, the standard pressure value remains at a low level. This design allows the normalized standard pressure value to more sensitively reflect the high load level of the load index.
[0069] Specifically, the step of nonlinearly normalizing the real-time pressure value based on the pressure inflection point of each load index to obtain the standard pressure value corresponding to each load index includes:
[0070] (1) Calculate the ratio of the real-time pressure value of each load index to the corresponding pressure inflection point.
[0071] For any load index, the real-time pressure value of the index collected by S101 is divided by the pressure inflection point corresponding to the index determined by S102 to obtain the ratio R. The ratio R reflects the position of the current real-time pressure value relative to the high load critical level of the index. When R<1, it means that the current pressure value has not yet reached the high load range; when R≥1, it means that the current pressure value has entered or exceeded the high load range.
[0072] Taking CPU utilization as an example, if the current CPU utilization is 70% and the stress inflection point is 80%, then the ratio R = 70% / 80% = 0.875, indicating that the current CPU stress is below the inflection point. If the current CPU utilization is 90% and the stress inflection point is 80%, then the ratio R = 90% / 80% = 1.125, indicating that the current CPU stress has exceeded the inflection point.
[0073] (2) Perform an exponential operation on the ratio. When the result exceeds the preset upper limit of the standard pressure value, limit the result to the upper limit of the standard pressure value and use it as the standard pressure value of the corresponding load index.
[0074] After obtaining the ratio R, perform exponential operations on R, for example, calculate e with the natural constant e as the base. R Or calculate 10 with base 10 R The exponential coefficient is determined based on the physical characteristics of the load index. Specifically, if an exponential calculation with base e is used, the formula for calculating the standard pressure value can be expressed as: Standard pressure value = min(e) (λ·R) ,1), where λ is the exponential coefficient, and the min function is used to limit the result to within 1 of the upper limit of the standard pressure value. The role of the exponential operation is that when R is close to or exceeds 1, the calculation result increases rapidly, making the standard pressure value quickly approach the upper limit, thus giving this indicator more attention when merging load values.
[0075] The exponential coefficients for different load metrics can be set according to their physical characteristics. Memory utilization uses a higher exponential coefficient because system performance deteriorates sharply when memory is near full load, requiring a stronger amplification during normalization. CPU utilization uses a medium exponential coefficient, and task concurrency uses a lower exponential coefficient because the impact of increased task queuing on system performance is relatively gradual. For example, the exponential coefficient for memory utilization can be set to 3-5, for CPU utilization to 2-3, and for task concurrency to 1.5-2.
[0076] After the exponential calculation, the result is compared with the upper limit of the standard pressure value. In this embodiment, the upper limit of the standard pressure value is set to 1, representing the full-load pressure state. When the calculation result exceeds the upper limit of the standard pressure value, the calculation result is restricted to the upper limit of the standard pressure value, that is, the final standard pressure value does not exceed the upper limit value; when the calculation result does not exceed the upper limit, the calculation result is retained as the standard pressure value of the corresponding load index.
[0077] S104. Merge the standard stress values of each load indicator to obtain the cluster load value.
[0078] It's important to note that the cluster load fusion strategy incorporates a mechanism that links load partitioning with weights. Cluster load itself can be divided into different load zones, and the impact of various load metrics on the overall cluster performance varies across these zones. In high-load zones, the increase in CPU and GPU utilization directly affects the processing power and response latency of computing tasks, significantly impacting the cluster's user experience. Conversely, in low-load zones, the cluster's computing resources are relatively abundant, and changes in task concurrency better reflect user business needs and cluster activity. Therefore, the fusion weights should not be fixed but dynamically adjusted based on the cluster's current load zone.
[0079] Specifically, the process of fusing the standard stress values of various load metrics to obtain the cluster load value includes:
[0080] (1) Obtain the cluster load value generated in the previous cycle, and adjust the fusion weight corresponding to each load index according to the load range in which the cluster load value generated in the previous cycle is located.
[0081] The fusion weight adjustment is based on the cluster load value of the previous period and its load range. The cluster load range can be preset to multiple levels; for example, the standard pressure value range [0, 1] can be divided into three ranges: low load, medium load, and high load. The system obtains the cluster load value generated in the previous period, determines which load range the value falls into, and then adjusts the fusion weight of each load indicator accordingly based on the load range. Upon first execution, if there is no cluster load value generated in the previous period, the cluster load value is initialized to 0 by default (considered a low load range), or equal initial fusion weights are assigned to each load indicator. For example, if there are four load indicators, each has a weight of 0.25.
[0082] It's important to note that the load interval division is not fixed. The system can adaptively adjust the boundaries of each interval based on the distribution characteristics of the cluster's historical load values. During operation, the system continuously collects generated cluster load values, constructs a load value distribution histogram, and determines the boundary values of each interval according to a preset cumulative distribution ratio. For example, the boundary between the low-load and medium-load areas is set at the 33rd percentile of the historical cumulative load value distribution, and the boundary between the medium-load and high-load areas is set at the 67th percentile. The system recalculates the load value distribution at a preset update cycle (e.g., weekly or monthly) and dynamically updates the interval boundaries. When the cluster's overall load level shifts due to business expansion or changes, the updated interval boundaries automatically follow the load distribution changes, ensuring that the load interval division always matches the current actual operating state of the cluster. This adaptive interval division based on historical distribution solves the problem of improper weight allocation caused by fixed intervals under different load characteristics.
[0083] The specific adjustment strategy is as follows: when the cluster load value of the previous period is in the high load range, increase the fusion weight of CPU utilization and GPU utilization, because these indicators have the most direct impact on cluster performance under high load; when the cluster load value of the previous period is in the medium load range, keep the fusion weight of each indicator balanced; when the cluster load value of the previous period is in the low load range, increase the fusion weight of indicators such as task concurrency, because the change in the number of tasks under low load is an important reference for judging the cluster activity.
[0084] The adjusted fusion weights meet the normalization condition, meaning that the sum of the fusion weights of all indicators remains unchanged, and the adjustment only redistributes the weight ratios among the indicators.
[0085] (2) Based on the adjusted fusion weight, the standard pressure values of each load index are weighted and summed to obtain the cluster load value for the current period.
[0086] The standard stress values corresponding to CPU utilization, GPU utilization, memory utilization, and task concurrency obtained in step S103 are multiplied by the corresponding fusion weights adjusted in step (1), and then the products are summed to obtain the cluster load value for the current period. Weighted summation is to fuse standardized stress information from multiple dimensions into a comprehensive evaluation value according to their respective importance contributions.
[0087] The cluster load value generated by the above steps is a single value within the range [0, 1]. When the cluster load value is close to 0, it indicates that the cluster is generally idle and the utilization rate of each resource is low; when the cluster load value is close to 1, it indicates that the cluster is generally fully loaded and many resources are approaching their capacity limits. This value will serve as the input for future load prediction. Specifically, when the cluster load value is high, the system predicts that the load will further increase and increases the sampling frequency in advance to capture key data; when the cluster load value is low, the system predicts that the load will not increase significantly in the short term and appropriately reduces the sampling frequency to save collection resources.
[0088] S105. Based on the cluster load value, predict the future load of the cluster in the first possible time.
[0089] It's important to note that if the sampling frequency is adjusted solely based on the currently monitored load value, the system's adjustments will always lag behind actual load changes. When the cluster load suddenly increases, the sampling frequency cannot keep up immediately, resulting in missing critical data in the initial stages of the load surge; conversely, when the cluster load decreases, the sampling frequency remains high during the transition period, leading to resource waste. The information transmission and processing delay between detecting a change and adjusting the sampling frequency is a problem that neither fixed-frequency sampling nor passive response adjustments can solve.
[0090] This embodiment introduces a prediction mechanism into the adaptive sampling link, so that the change time of the sampling frequency is earlier than the change time of the actual load of the cluster, thereby realizing the transformation from passive response to active prediction.
[0091] Specifically, predicting the future load of the cluster based on the cluster load value includes:
[0092] (1) Obtain the cluster load values at the current time and multiple previous times to form a load value sequence.
[0093] After generating a cluster load value each time, the system arranges the cluster load values of the current time and a preset number of previous time points in chronological order to form a load value sequence. Each element in this sequence corresponds to the cluster load value at a sampling time, and the time interval between adjacent elements is the period of load value fusion in S104. The load value sequence is maintained in a sliding window manner. When a new cluster load value is generated, it is appended to the end of the sequence, while the oldest load value in the sequence is removed to keep the sequence length constant and ensure that the prediction basis reflects the recent load change trend.
[0094] Here, the "first time in the future" refers to a predicted time span extending into the future from the current moment. The length of this time span determines the prediction lead time. The first time can be a pre-set fixed time length, such as 5 seconds, 10 seconds, or 30 seconds, and the specific value can be calibrated according to the response speed of cluster load changes; alternatively, the first time can also be dynamically adjusted, for example, inversely proportional to the rate of change of the cluster load value, that is, the faster the load changes, the shorter the predicted first time, in order to ensure prediction accuracy. In this embodiment, for the sake of simplification, a fixed time length, such as 10 seconds, is preferred.
[0095] (2) The load value sequence is calculated using a preset time-series prediction algorithm to obtain the future load at the first time.
[0096] It should be noted that in many real-world server clusters, load metrics such as CPU utilization, task queue length, and event arrival rate usually exhibit a certain degree of continuity over a short period of time, without any irregular or drastic fluctuations. Therefore, load changes are predictable and can be used to make short-term predictions using historical data from a recent period.
[0097] Furthermore, the prediction algorithm maintains an estimate of the load at the current moment. When a new actual load value arrives, it calculates the deviation between the predicted value and the actual value, i.e., the prediction error, and uses this error to correct the prediction for the next moment. The prediction formula is: Predicted value at the first future moment = Estimated value at the current moment + Adaptive gain × Prediction error at the current moment.
[0098] The adaptive gain determines the speed at which the predicted value responds to errors. The adaptive gain automatically adjusts its magnitude according to changes in error. When the prediction error is large, the system load state may have changed abruptly, and the previous estimate deviates significantly from the actual situation. In this case, the adaptive gain automatically increases, allowing the predicted value to quickly catch up with the changes in the actual measured value. When the prediction error is small, it indicates that the current estimate matches the actual load well, and the observed small deviation is likely just instantaneous noise rather than a true change in load trend. In this case, the adaptive gain automatically decreases, keeping the predicted value smooth and stable, avoiding unnecessary fluctuations caused by noise interference.
[0099] The adaptive gain is limited to a preset range, such as between 0 and 1. Its update rule is positively correlated with the magnitude of the prediction error. In one implementation, let the prediction error at the current time t be e(t) = |actual load value(t) - predicted load value(t)|, and the adaptive gain at the current time be K(t). Then the adaptive gain K(t+1) at the next time is updated by the following formula: K(t+1) = min(max(K(t) + α × e(t), 0), 1).
[0100] Here, α is a preset step size coefficient, and the min and max functions are used to restrict K(t+1) within the interval [0,1]. This update rule ensures that the larger the prediction error, the larger the gain increment, thus correcting the predicted value more quickly; the smaller the prediction error, the more stable the gain tends to be.
[0101] Alternatively, a nonlinear mapping K(t+1)=1-exp(-β×e(t)) can be used, where β is a preset sensitivity coefficient. It should be noted that the above is an example; in actual implementation, a suitable update rule can be selected based on the cluster load characteristics.
[0102] It should be noted that the step size coefficient α is not fixed; the system can adaptively adjust its value based on the historical performance of the prediction results. The system records the prediction error sequence over the most recent sampling periods. When the standard deviation of the prediction error sequence continues to increase, it indicates that the current step size coefficient is insufficient to keep up with the rate of load change, and the system automatically increases α by a preset increment step size. When the prediction error sequence continues to fall below the preset error tolerance limit, it indicates that the current step size coefficient may be too large, causing the predicted value to be overly sensitive to instantaneous noise, and the system automatically decreases α by a preset decrement step size. The adjustment range of α is limited to the preset minimum and maximum values to prevent over-adjustment. This adaptive step size adjustment based on prediction performance feedback enables the prediction algorithm to automatically adjust its response sensitivity at different load change stages, maintaining a lower step size during periods of stable load to suppress noise, and increasing the step size during periods of sudden load changes to accelerate the response, achieving a dynamic balance between response speed and smooth stability.
[0103] S106. Adjust the sampling frequency of the cluster data based on the future load, wherein the time of change of the adjusted sampling frequency is earlier than the time of change of the actual load of the cluster.
[0104] It should be noted that adjusting the sampling frequency involves three levels of decision-making. First, the basic sampling frequency is determined based on the future load, establishing a mapping relationship between the load and the frequency. Second, differentiated correction coefficients are applied according to the importance level of the data to be collected, so that data of different importance levels obtain different response amplitudes during frequency adjustment. Finally, during data collection, the collection tasks are grouped and scheduled based on the importance level of the data.
[0105] The following is a detailed explanation. Specifically, adjusting the data sampling frequency of the cluster based on the future load includes:
[0106] (1) Determine the basic sampling frequency based on the future load and the preset load-frequency mapping relationship.
[0107] The load-frequency mapping defines the correspondence between future loads and the base sampling frequency. The design of the mapping follows these principles: in the low load range, the sampling frequency is maintained at a low level, and the rate of increase with load changes is gradual, in order to save acquisition and transmission resources; when the future load enters the high load range, the sampling frequency is accelerated and increases, approaching the highest sampling frequency, to ensure that critical performance data is not lost under high load scenarios.
[0108] Specifically, when the future load increases relative to the current load, the base sampling frequency is increased from the current frequency value towards the preset maximum sampling frequency. The increase is positively correlated with the increase in future load; the greater the increase in future load, the greater the frequency increase. When the future load decreases relative to the current load, the base sampling frequency is decreased from the current frequency value towards the preset minimum sampling frequency. The decrease is positively correlated with the decrease in future load.
[0109] It should be noted that the sampling frequency adjustment is not a one-time adjustment to the target value after predicting future load, but rather a gradual, smooth transition within the first time period. Specifically, the system first determines the target frequency change within the first time period, i.e., the difference between the target sampling frequency corresponding to the future load and the current actual sampling frequency; then, it calculates the acceleration of the sampling frequency change. The magnitude of the acceleration is determined based on the target frequency change and the length of the first time period, ensuring that the sampling frequency smoothly reaches the target value at the end of the first time period, rather than abruptly changing after the adjustment command is issued.
[0110] During the smooth transition process, the system starts from the current moment and determines the difference between the actual load and the predicted load at each time point. When the difference does not exceed the preset acceleration adjustment threshold, it indicates that the current acceleration matches the actual load change trend well, and the system maintains the current acceleration and continues to adjust the sampling frequency according to the original acceleration. When the difference exceeds the acceleration adjustment threshold, it indicates that the actual rate of load change has deviated from the predicted rate of change, and the system dynamically adjusts the acceleration. If the actual load change is greater than the predicted value, the acceleration is increased to speed up the frequency adjustment; if the actual load change is less than the predicted value, the acceleration is decreased to slow down the frequency adjustment. The acceleration adjustment range is limited to the preset minimum and maximum acceleration to prevent over-adjustment.
[0111] This smooth transition and dynamic acceleration correction mechanism avoids sudden and large jumps in the sampling frequency during adjustment, making the frequency change process smoother and reducing the impact of frequency changes on the acquisition link and equipment. At the same time, by dynamically calibrating the acceleration through real-time difference feedback, it ensures that the frequency adjustment can accurately reach the target value that matches the future load when it ends in the first time.
[0112] The minimum and maximum sampling frequencies are preset based on the cluster's acquisition hardware capabilities and data transmission bandwidth. The minimum frequency ensures basic operational status monitoring needs, while the maximum frequency does not exceed the physical limits of the acquisition link.
[0113] As an feasible approach, the minimum sampling frequency Fmin can be set to the minimum sampling frequency required for the cluster to maintain basic operation and maintenance monitoring, such as 0.1Hz (i.e., sampling once every 10 seconds); the maximum sampling frequency Fmax can be set to the maximum frequency that can be supported without causing congestion of the sampling link or overloading of the sampling equipment, such as 10Hz (i.e., sampling once every 0.1 seconds). The specific values of Fmin and Fmax can be calibrated according to actual conditions such as cluster size, network bandwidth, and storage capacity.
[0114] It should be noted that during cluster operation, the number of active nodes may change due to node inactivity, failures, or expansion. When the number of active nodes increases, the amount of data to be collected per node per unit time increases synchronously, leading to an increase in the total load on the acquisition links; conversely, when the number of active nodes decreases, the total load on the acquisition links decreases. The BMC periodically counts the number of active nodes in the current cluster. When the change in the number of active nodes exceeds a preset threshold, the BMC adjusts the values of Fmin and Fmax proportionally. When the number of active nodes increases, Fmax is appropriately reduced to prevent congestion on the acquisition links; when the number of active nodes decreases, Fmax is appropriately increased to fully utilize idle acquisition bandwidth. Fmin is adjusted synchronously proportionally to maintain the ratio between Fmin and Fmax. This adaptive frequency boundary adjustment based on cluster size allows the sampling frequency constraints to dynamically adapt to changes in the cluster topology, avoiding overload of acquisition links due to cluster expansion or idle acquisition bandwidth due to cluster downsizing.
[0115] Through the above load-frequency mapping relationship, the basic sampling frequency realizes the prediction of future load changes. When S105 predicts that the load will increase, the basic sampling frequency has already started to increase before the load actually increases; when it predicts that the load will decrease, the basic sampling frequency is gradually reduced only after the load actually decreases.
[0116] (2) Obtain the correction coefficient corresponding to the data to be collected, use the correction coefficient to adjust the basic sampling frequency, and output the adjusted sampling frequency.
[0117] The data collected in the cluster is diverse, and the importance of different data to cluster operation and maintenance and AI prediction varies significantly. For example, computing power data and latency data directly reflect the cluster's computing performance bottlenecks and business response experience, and are considered high-importance data. Their absence in high-load scenarios will severely impact the accuracy of AI predictions. On the other hand, static or slowly changing information such as hardware firmware version numbers and boot times have relatively lower real-time requirements for sampling frequency. Using a uniform basic sampling frequency for all data will result in insufficient collection of high-importance data, while excessive resources will be allocated to collecting low-importance data, failing to achieve optimal resource allocation.
[0118] To address the aforementioned issues, this embodiment assigns different correction coefficients to the data to be collected at different importance levels. Specifically, the determination of the correction coefficients includes:
[0119] (i) Calculate the prediction error between the predicted load value at the current time and the actual load value at the current time.
[0120] It should be noted that the correction coefficient is affected not only by the importance level of the data but also by the accuracy of the prediction algorithm. Prediction accuracy can be measured by prediction error. Specifically, prediction error is calculated by subtracting the predicted load value from the actual load value at the current moment, and taking the absolute value of the difference as the prediction error. Prediction error reflects the degree of prediction deviation of the algorithm at the current moment. A larger error value indicates a more severe deviation between the predicted and actual values, and a lower reliability of the prediction algorithm; a smaller error value indicates a good match between the predicted and actual values, and a stable and reliable prediction algorithm.
[0121] By using prediction errors, the correction coefficients gain the ability to sense prediction quality, thus enabling them to make compensatory responses to prediction deviations in subsequent frequency adjustments.
[0122] (ii) Adjust the correction coefficient according to the magnitude of the prediction error, wherein the adjusted correction coefficient is positively correlated with the magnitude of the prediction error.
[0123] After obtaining the prediction error value, the correction coefficient is adjusted based on this value. The correction coefficient is positively correlated with the prediction error value; the larger the prediction error value, the larger the correction coefficient; the smaller the prediction error value, the smaller the correction coefficient. For example, the correction coefficient can be calculated using the following formula: C = 1 + γ × e(t), where C is the correction coefficient, e(t) is the prediction error at the current time (absolute value), and γ is a preset proportional coefficient. The proportional coefficient γ can be adaptively adjusted according to the current load range. When the cluster is under low load, the absolute change in load is usually small, and the prediction error is correspondingly small. In this case, the system sets γ to a higher value, making the correction coefficient more sensitive to the prediction error and ensuring that even under low load levels, subtle load changes can be captured promptly. When the cluster is under high load, the absolute change in load is usually large, and the prediction error is correspondingly large. In this case, the system sets γ to a lower value to prevent the correction coefficient from being too large, causing the sampling frequency to exceed the reasonable range or resulting in over-adjustment. When the cluster is under medium load, the system sets γ to an intermediate value to balance response speed and stability. Simultaneously, to prevent the correction coefficient from being too large and causing the frequency to exceed the reasonable range, an upper limit for the correction coefficient can be set, for example, Cmax=2.
[0124] Therefore, when the prediction error is large, the accuracy of load prediction decreases, and the actual load may rise or fluctuate at a rate higher than expected. At this time, it is necessary to increase the correction force so that the sampling frequency can respond to the actual changes more quickly and compensate for the risk of data loss caused by the prediction deviation. When the prediction error is small, the predicted value is highly consistent with the actual load, and the current sampling frequency adjustment strategy can match the actual load changes well. At this time, the correction coefficient is kept at a small value to keep the sampling frequency smooth and stable and avoid unnecessary frequency fluctuations caused by instantaneous noise.
[0125] After adjusting the correction coefficient, it is combined with the base sampling frequency to obtain the final adjusted sampling frequency. In this process, the adjustment of the sampling frequency integrates the predicted value of future load (which determines the direction and magnitude of the rise and fall of the base frequency), the importance level of the data (which determines the differential magnitude of the frequency adjustment), and the magnitude of the prediction error (which determines the strength of compensation for prediction deviation). These three factors work together to enable the sampling frequency to respond to load changes in advance and to be finely configured according to the data value and prediction reliability.
[0126] After determining the adjusted sampling frequency, the data to be collected also needs to be scheduled. The number of data items to be collected in the cluster is huge. If a single serial acquisition method is used, the acquisition cycle will be too long, which will not meet the time constraints of high-frequency acquisition. If a fully parallel method is used for all data items, the acquisition request may be congested due to the limitations of network bandwidth and the processing capacity of the acquisition equipment.
[0127] Therefore, a grouping scheduling strategy based on data importance levels can be adopted, specifically including:
[0128] (1) According to the importance level of the data to be collected, the data to be collected is divided into collection groups of different priorities.
[0129] The system categorizes all data items to be collected according to their importance level based on a preset data importance classification table. Data at the same importance level are grouped into the same collection group, while data at different importance levels belong to different collection groups. This creates multiple collection groups with different priorities, where data with higher importance levels belongs to a higher priority collection group. For example, computing power data and latency data belong to the highest importance level and are assigned to the first priority collection group; static information such as hardware firmware version numbers and boot times belong to lower importance levels and are assigned to lower priority collection groups.
[0130] (2) Data within the same acquisition group is acquired in parallel.
[0131] Data items within the same collection group have the same importance level and consistent requirements for collection timeliness. During the collection of such data items, the system simultaneously sends collection requests to multiple data sources, and multiple collection operations are executed concurrently to shorten the completion time for collecting all data within the same priority group. The number of concurrent collection operations can be preset to an upper limit based on network bandwidth and the processing capacity of the collection equipment to avoid excessive concurrency leading to network congestion or equipment overload.
[0132] (3) For data between different acquisition groups, the data acquisition instructions of the higher priority acquisition group are sent first, and the data acquisition instructions of the next priority acquisition group are sent after the response is received.
[0133] The scheduling between different acquisition groups adopts a serial strategy. The system prioritizes processing data from high-priority acquisition groups, sending acquisition instructions to each data source within the high-priority acquisition group first. Only after all data acquisition tasks within that group are completed does the system send acquisition instructions to the next higher-priority acquisition group. This serial design between groups ensures that, with limited acquisition resources, the most important data receives priority acquisition and processing, preventing delays in critical data acquisition due to resource consumption by low-priority data acquisition. Thus, even under heavy load scenarios, highly important data can still receive the most timely acquisition response.
[0134] The method provided in this embodiment replaces the fixed static pressure inflection point in existing solutions by constructing a mechanism for dynamically determining pressure inflection points based on historical real-time data. This allows the normalized benchmarks of various load indicators to adapt to changes in cluster load characteristics, solving the problem of load value calculation distortion caused by differences in load characteristics across different clusters and time periods. In the load value fusion stage, a load partitioning and weight linkage mechanism is introduced. The fusion weights of each load indicator are dynamically adjusted according to the current load range of the cluster, ensuring that the indicator most sensitive to the current load range receives a higher contribution during fusion, further improving the accuracy of the cluster load value in representing the actual operating state. By introducing an adaptive gain load prediction mechanism driven by prediction error, the prediction error at the current moment is used to dynamically correct the predicted value for the next moment. This allows the prediction algorithm to quickly follow load changes and maintain smoothness when the load is stable. Furthermore, the sampling frequency is adjusted in advance based on the predicted value, ensuring that the change in sampling frequency occurs earlier than the actual change in cluster load, solving the problem of lag in existing adjustments. When the load increases, the sampling frequency has already been increased in advance, and key data is no longer missing. Regarding the adjustment of sampling frequency, by integrating the data importance level and the prediction error value into a differentiated correction coefficient, the sampling frequency of high-importance data can be adjusted more significantly with load changes, while the adjustment of low-importance data is relatively gradual. At the same time, by combining the grouped parallel and inter-group serial scheduling strategies based on data importance level, the limited acquisition resources are tilted towards high-value data, achieving a dynamic balance between acquisition accuracy and resource consumption.
[0135] Example 2
[0136] Corresponding to the aforementioned embodiment of the adaptive sampling method for cluster data, this application also provides an embodiment of an adaptive sampling device for cluster data.
[0137] Figure 2 This is a schematic diagram of the structure of Embodiment 2 of the cluster data adaptive sampling device provided in this application. Please refer to... Figure 2 The device provided in this embodiment includes an acquisition module 210, a determination module 220, a processing module 230, a fusion module 240, a prediction module 250, and an adjustment module 260.
[0138] The acquisition module 210 is used to collect real-time pressure values corresponding to multiple load indicators of the cluster.
[0139] The determining module 220 is used to determine the pressure inflection point corresponding to each load index based on the historical real-time pressure value.
[0140] The processing module 230 is used to nonlinearly normalize the real-time pressure value according to the pressure inflection point of each load index to obtain the standard pressure value corresponding to each load index.
[0141] The fusion module 240 is used to fuse the standard stress values of various load indicators to obtain the cluster load value;
[0142] The prediction module 250 is used to predict the future load of the cluster in the first future time based on the cluster load value.
[0143] The adjustment module 260 is used to adjust the sampling frequency of the cluster data based on the future load, wherein the adjusted sampling frequency change time is earlier than the actual load change time of the cluster.
[0144] The apparatus of this embodiment can be used to perform... Figure 1 The steps of the method embodiment shown are similar in principle and process, and will not be repeated here.
[0145] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0146] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this application according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0147] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.
Claims
1. A cluster data adaptive sampling method, characterized in that, The method includes: Collect real-time stress values corresponding to multiple load metrics of the cluster; Determine the pressure inflection point corresponding to each load indicator based on historical real-time pressure values. The real-time pressure value is nonlinearly normalized based on the pressure inflection point of each load index to obtain the standard pressure value corresponding to each load index. The standard stress values of various load metrics are combined to obtain the cluster load value; Predict the future load of the cluster based on the cluster load value at the first possible time. The sampling frequency of the cluster's data is adjusted based on the future load, and the adjusted sampling frequency changes earlier than the actual load change time of the cluster.
2. The method according to claim 1, characterized in that, The step of nonlinearly normalizing the real-time pressure value based on the pressure inflection point of each load index to obtain the standard pressure value corresponding to each load index includes: Calculate the ratio of the real-time pressure value to the corresponding pressure inflection point for each load indicator; The ratio is subjected to an exponential operation. When the result exceeds the preset upper limit of the standard pressure value, the result is restricted to the upper limit of the standard pressure value and used as the standard pressure value of the corresponding load index.
3. The method according to claim 1, characterized in that, The step of determining the pressure inflection point corresponding to each load indicator based on historical real-time pressure values includes: Obtain multiple real-time stress values for each load metric within a preset historical time period; The multiple real-time pressure values are statistically processed, and the processing result is used as the pressure inflection point corresponding to the load index. When the fluctuation range of the historical real-time pressure value exceeds a preset threshold, the pressure inflection point is redefined.
4. The method according to claim 1, characterized in that, The process of fusing the standard stress values of various load metrics to obtain the cluster load value includes: Obtain the cluster load value generated in the previous cycle, and adjust the fusion weights corresponding to each load indicator based on the load range in which the cluster load value generated in the previous cycle is located. Based on the adjusted fusion weights, the standard stress values of each load metric are weighted and summed to obtain the cluster load value for the current period.
5. The method according to claim 1, characterized in that, The prediction of the future load of the cluster based on the cluster load value includes: The cluster load values at the current time and multiple previous times are obtained to form a load value sequence; The load value sequence is calculated using a preset time-series prediction algorithm to obtain the future load at the first time point.
6. The method according to claim 1, characterized in that, Adjusting the sampling frequency of the cluster's data based on the future load includes: Based on the future load and the preset load-frequency mapping relationship, a basic sampling frequency is determined, wherein the load-frequency mapping relationship is as follows: when the future load increases, the basic sampling frequency is increased from the current frequency toward the highest sampling frequency; when the future load decreases, the basic sampling frequency is decreased from the current frequency toward the lowest sampling frequency. Obtain the correction coefficient corresponding to the data to be collected, use the correction coefficient to adjust the basic sampling frequency, and output the adjusted sampling frequency.
7. The method according to claim 6, characterized in that, The determination of the correction factor includes: Calculate the prediction error between the predicted load value at the current moment and the actual load value at the current moment; The correction coefficient is adjusted according to the magnitude of the prediction error, wherein the adjusted correction coefficient is positively correlated with the magnitude of the prediction error.
8. The method according to claim 1, characterized in that, After adjusting the sampling frequency of the cluster data, the method further includes: Based on the importance level of the data to be collected, the data to be collected is divided into collection groups of different priorities; For data within the same acquisition group, a parallel acquisition method is used; For data from different acquisition groups, the data acquisition command of the highest priority acquisition group is sent first, and the data acquisition command of the next priority acquisition group is sent after the response is received.
9. The method according to claim 1, characterized in that, The real-time pressure values corresponding to multiple load metrics of the data acquisition cluster include: Obtain historical communication statistics information corresponding to the device to be collected, and sort the communication collection protocols of the device to be collected according to the historical communication statistics information to obtain an initial priority list; Send a probe request to the commonly used port of the device to be collected, and correct the initial priority list according to the port connectivity to obtain the corrected priority list; According to the order of the revised priority list, protocol handshake packets are sent to the devices to be collected one by one, and the communication collection protocol matching the devices to be collected is determined based on the first returned handshake response.
10. A cluster data adaptive sampling device, characterized in that, The device includes an acquisition module, a determination module, a processing module, a fusion module, a prediction module, and an adjustment module; The acquisition module is used to collect real-time pressure values corresponding to multiple load indicators of the cluster; The determining module is used to determine the pressure inflection point corresponding to each load indicator based on historical real-time pressure values. The processing module is used to nonlinearly normalize the real-time pressure value according to the pressure inflection point of each load index to obtain the standard pressure value corresponding to each load index. The fusion module is used to fuse the standard stress values of various load indicators to obtain the cluster load value; The prediction module is used to predict the future load of the cluster in the first possible time based on the cluster load value; The adjustment module is used to adjust the sampling frequency of the cluster's data based on the future load, wherein the adjusted sampling frequency change time is earlier than the actual load change time of the cluster.