Method and device for judging air conditioning delivery failure of machine room

By combining the heat recirculation coefficient and weighting function with clustering methods to eliminate the influence of hot air recirculation, the fault point of cold air delivery can be accurately located, solving the problem of inaccurate location in existing technologies and realizing precise early warning and energy-saving control.

CN122241401APending Publication Date: 2026-06-19CHINA MOBILE GROUP DESIGN INST +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA MOBILE GROUP DESIGN INST
Filing Date
2026-01-23
Publication Date
2026-06-19

Smart Images

  • Figure CN122241401A_ABST
    Figure CN122241401A_ABST
Patent Text Reader

Abstract

This application provides a method for diagnosing air conditioning supply faults in a data center, comprising: identifying servers with abnormal temperature rise based on the operating parameters of each server, and obtaining the temperature sequence corresponding to the servers with abnormal temperature rise; normalizing the temperature sequence using a heat return coefficient to obtain a temperature sequence matrix, wherein the heat return coefficient is related to the heat of the hot air discharged from the server and the flow rate of the hot air returning to the server inlet; constructing a weighting function based on the temperature sequence matrix, and clustering the servers with abnormal temperature rise using the weighting function and a clustering method, and determining the location of the abnormal air outlet of the servers with abnormal temperature rise based on the clustering results. This method can identify the location of the abnormal air outlet of the servers with abnormal temperature rise without the need for additional sensors.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data center technology, and in particular to a method and device for diagnosing faults in the cooling air supply system of a computer room. Background Technology

[0002] With the rapid development of information technology, data centers, as the core carriers of information processing and storage, have seen their operational efficiency and energy consumption control become key research areas. Existing technologies mainly achieve stable operation and energy-saving goals for data centers from two dimensions: temperature monitoring and thermal management optimization. On the one hand, local hotspots are identified through multi-point temperature information clustering, a three-dimensional temperature data matrix is ​​constructed, and clustering algorithms are used to discover potential high-temperature areas within the space, providing early warning information for maintenance personnel. On the other hand, based on server energy consumption prediction models, hot and cold aisle status assessment models, and temperature change prediction models, overall energy consumption optimization and cooling intervention strategies for data centers are designed to improve cooling efficiency and reduce operating costs.

[0003] In terms of temperature monitoring, multiple temperature sensors are typically deployed to collect temperature data from key locations within the data center. Software analysis generates temperature distribution maps, identifies potential localized hotspots, correlates them with potentially affected business systems, and proposes corresponding cooling recommendations. For aisle closure assessment, cold aisle temperature data is acquired and combined with hot aisle temperature prediction models for regression analysis to determine aisle sealing conditions, thereby identifying potential sealing issues early. Simultaneously, temperature models are used to predict cold aisle temperature trends, assisting in implementing cooling interventions. These intervention strategies are adjusted based on subsequent temperature feedback to maintain the data center temperature within a reasonable range. These technologies collectively constitute the current technical framework for data center thermal management and fault early warning, providing crucial support for the safe operation of data centers.

[0004] In existing technologies, methods for identifying local hotspots through multi-point temperature information clustering struggle to accurately reflect temperature changes in the cooling airflow channel, making it impossible to precisely pinpoint cooling air delivery faults. Furthermore, they fail to differentiate between normal server temperature rises caused by hot air recirculation, affecting clustering results. Additionally, they do not consider the impact of temperature differences between the central and edge server inlets after a cooling air anomaly on the clustering results, leading to inaccurate location. Therefore, there is an urgent need for a method that can accurately identify the location of abnormal cooling air outlets on servers with abnormal temperature rises without requiring additional sensors, enabling precise early warning and energy-saving control. Summary of the Invention

[0005] This application provides a method for diagnosing air conditioning delivery faults in data centers. This method overcomes the limitations of existing technologies that rely on clustering multi-point temperature information to identify local hotspots. These methods struggle to accurately reflect temperature changes in the air conditioning aisles, making it difficult to pinpoint the fault location. Furthermore, they fail to differentiate between normal server temperature rises caused by hot air recirculation, affecting clustering results. Additionally, they do not consider the impact of temperature differences between the central and peripheral server inlets after an air conditioning anomaly on the clustering results, leading to inaccurate location. Other methods, such as energy consumption prediction, aisle closure assessment, and cooling fault intervention, do not address server inlet temperature rise detection caused by air conditioning delivery anomalies, nor can they pinpoint specific abnormal air outlets. Therefore, there is an urgent need for a method that requires no additional sensors, effectively eliminates hot air recirculation interference, accurately identifies the location of air conditioning delivery faults, and classifies servers with abnormal temperature rises, enabling precise early warning and energy-saving control.

[0006] Firstly, a method for diagnosing air conditioning supply faults in a data center is provided. This includes: identifying servers with abnormal temperature rise based on their operating parameters and obtaining their corresponding temperature sequences; normalizing the temperature sequences using a heat recirculation coefficient to obtain a temperature sequence matrix, where the heat recirculation coefficient is related to the heat output of the server's exhaust air and the flow rate of the hot air recirculating back to the server's inlet; constructing a weighting function based on the temperature sequence matrix, and clustering the servers with abnormal temperature rise using the weighting function and a clustering method; and determining the location of the abnormal air outlet of the server with abnormal temperature rise based on the clustering results.

[0007] Based on the methods described above, this invention normalizes the server inlet temperature sequence by introducing a heat recirculation coefficient, effectively eliminating the influence of hot air recirculation on temperature rise. This accurately reflects the actual impact of air conditioning supply failures on the server inlet temperature, significantly improving the accuracy of fault location. Simultaneously, by constructing a weighted function based on the normalized temperature sequence and combining it with clustering methods to classify servers with abnormal temperature rises, the location of abnormal air conditioning outlets can be accurately identified, improving clustering efficiency and result reliability. Furthermore, this method fully utilizes existing server operating parameters, eliminating the need for additional sensor installations, reducing implementation costs and complexity, and achieving efficient early warning and classification of air conditioning supply failures. This contributes to improving data center operational stability and optimizing energy utilization efficiency.

[0008] In conjunction with the first aspect, in some possible implementations of the first aspect, before determining the servers with abnormal temperature rise based on the operating parameters of each server and obtaining the temperature sequence corresponding to the servers with abnormal temperature rise, the method further includes: determining the minimum number of servers affected by an abnormal cold air delivery from a cold air outlet based on the relative position of each server to other servers, the air conditioning temperature, the server's rated power consumption and / or rated speed.

[0009] In conjunction with the first aspect, in some possible implementations of the first aspect, before constructing a weighting function based on the temperature sequence matrix, clustering the servers with abnormal temperature rise using the weighting function and clustering methods, and determining the location of the abnormal air outlet of the servers with abnormal temperature rise based on the clustering results, the method further includes: taking the last column of the temperature sequence matrix, sorting it from largest to smallest according to its numerical value, removing the last preset number of servers with abnormal temperature rise in the sorted sequence using data standardization, and removing the preset number of servers with abnormal temperature rise from the temperature sequence matrix.

[0010] In conjunction with the first aspect, in some possible implementations of the first aspect, determining the server with abnormal temperature rise based on the operating parameters of each server and obtaining the temperature sequence corresponding to the server with abnormal temperature rise includes: obtaining the temperature difference between the temperature at the air inlet of each server and the temperature of the cold air output by the air conditioner as the temperature rise value; obtaining the temperature difference between the temperature of the return hot air and the cold air at the air inlet of each server; calculating the proportion of the return hot air flow rate in the air inlet flow rate of each server based on the temperature rise value and the temperature difference; and determining the server with abnormal temperature rise due to heat return based on the proportion.

[0011] In conjunction with the first aspect, in some possible implementations of the first aspect, a weighting function is constructed based on the temperature sequence matrix, and servers with abnormal temperature rise are clustered using the weighting function and a clustering method. The location of the abnormal air outlet of the servers with abnormal temperature rise is determined based on the clustering results, including: constructing a weighting function based on the Euclidean distance between the servers with abnormal temperature rise and the ratio of the temperature rise value at the air inlet of each server to the temperature difference between the return hot air and the cold air at the air inlet of each server; clustering servers with abnormal temperature rise due to cold air delivery failure, so that servers with abnormal temperature rise caused by the same air outlet are clustered as servers with abnormal cold air delivery failure; and determining the location of the abnormal air outlet based on the clustering results.

[0012] In conjunction with the first aspect, in some possible implementations of the first aspect, before normalizing the temperature sequence using the heat recirculation coefficient to obtain the temperature sequence matrix, the method further includes: using the heat of the server exhaust hot air based on distance weighting to characterize the temperature difference between the recirculated hot air and the cold air at the air inlet of each server as the heat recirculation coefficient of each server.

[0013] Secondly, a device for diagnosing air conditioning supply faults in a data center is provided. It includes: an acquisition unit for identifying servers with abnormal temperature rise based on the operating parameters of each server and acquiring the corresponding temperature sequence; a processing unit for normalizing the temperature sequence using a heat recirculation coefficient to obtain a temperature sequence matrix, wherein the heat recirculation coefficient is related to the heat of the hot air exhausted from the server and the flow rate of the hot air recirculated from the server inlet; and a determination unit for constructing a weighting function based on the temperature sequence matrix, clustering the servers with abnormal temperature rise using the weighting function and a clustering method, and determining the location of the abnormal air outlet of the server with abnormal temperature rise based on the clustering results.

[0014] Based on the methods described above, this invention identifies servers with abnormal temperature rises by acquiring server operating parameters and normalizes the temperature sequence using a heat recirculation coefficient, effectively eliminating the impact of hot air recirculation on temperature rise and improving the accuracy of cold air delivery fault diagnosis. The processing unit constructs a weighted function and combines it with clustering methods to accurately locate abnormal cold air outlets, improving fault location efficiency and reliability. Furthermore, this device requires no additional sensors, relying solely on existing data to classify and warn of abnormal servers, reducing maintenance costs and energy waste. By sorting and standardizing the normalized temperature values, the influence of hot air recirculation is further eliminated, temperature rise differences are reduced, and the clustering calculation process is optimized, significantly improving the accuracy and practicality of the clustering results and providing strong support for efficient operation and maintenance of data center server rooms.

[0015] Thirdly, an electronic device includes: at least one processor; and the ability to perform the methods described above.

[0016] Fourthly, a computer-readable storage medium having a computer program or instructions stored thereon, which, when run on a computer, executes the method according to aspect one.

[0017] Fifthly, a computer program product that, when run on a computer, executes the method according to aspect one. Attached Figure Description

[0018] Figure 1 This is a schematic diagram of the cooling mechanism of a data center air conditioning system in the prior art; Figure 2 This is a schematic diagram illustrating the coverage area of ​​the server by the air vents in existing technology. Figure 3 This is a schematic diagram of a method for determining a fault in the cooling supply of a computer room according to an embodiment of this application; Figure 4 This is a flowchart illustrating a method for determining a fault in the cooling supply of a computer room according to an embodiment of this application. Figure 5 This is a schematic diagram of the structure of a device for determining the fault in the cooling air supply of a computer room according to an embodiment of this application; Figure 6 This is a structural block diagram of an electronic device according to an embodiment of this application. Detailed Implementation

[0019] The technical solutions in this application will now be described with reference to the accompanying drawings.

[0020] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The specific operating methods in the method embodiments can also be applied to the device embodiments or system embodiments. In the description of this application, unless otherwise stated, "multiple" means two or more.

[0021] In the various embodiments of this application, unless otherwise specified or in case of logical conflict, the terminology and / or descriptions of different embodiments are consistent and can be referenced by each other. The technical features of different embodiments can be combined to form new embodiments according to their inherent logical relationship.

[0022] It is understood that the various numerical designations used in this application are merely for descriptive convenience and are not intended to limit the scope of this application. The order of the process numbers does not imply the order of execution; the execution order of each process should be determined by its function and internal logic.

[0023] The terms "first," "second," "third," "fourth," and other various terminology (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0024] To facilitate understanding of the embodiments of this application, the terminology involved in the embodiments of this application will be briefly explained below.

[0025] Heat recirculation coefficient: A normalized parameter used to characterize the temperature difference between the recirculated hot air and the cold air at the server air inlet (or server entrance). Its calculation is based on the heat of the hot air discharged from the server and the weight related to its distance, reflecting the degree of influence of abnormal cold air delivery on the rise in server entrance temperature.

[0026] Temperature sequence matrix: A matrix consisting of normalized temperature sequences from multiple servers with abnormal temperature rises, used for subsequent cluster analysis to identify temperature rise patterns caused by cold air delivery failures.

[0027] Weighting function: A function introduced during the clustering process that combines the Euclidean distance between servers with the correlation of temperature sequences and the difference in temperature rise values, in order to more accurately cluster servers affected by the same abnormal air outlet into one class.

[0028] Euclidean distance: The geometric distance used to measure the difference between the normalized temperature sequences of two servers, serving as a basic metric for judging server similarity in clustering algorithms.

[0029] Clustering method: Classify servers with abnormal temperature rise by using data grouping techniques (such as K-medoids), identify server groups with the same cold air delivery failure characteristics, and thus locate the abnormal cold air outlet.

[0030] The CH index (Calinski-Harabaz) is an evaluation criterion used to assess the effectiveness of clustering. It determines the optimal number of clusters by comparing the ratio of inter-cluster dispersion to intra-cluster dispersion.

[0031] Based on this, this application provides an early warning method and device for air conditioning delivery faults in data center computer rooms. This overcomes the limitations of existing methods that rely on multi-point temperature information clustering to identify local hotspots, which struggle to accurately reflect temperature changes in the air conditioning channel and pinpoint the fault location. Furthermore, these methods fail to differentiate between normal server temperature rises caused by hot air recirculation, affecting clustering results. They also neglect to consider the impact of temperature differences between the central and edge server inlets after an air conditioning anomaly on the clustering results, leading to inaccurate location. Additionally, other methods, such as energy consumption prediction, channel closure status assessment, and cooling fault intervention, do not address server inlet temperature rise detection caused by air conditioning delivery anomalies, nor can they pinpoint specific abnormal air outlets. Therefore, there is an urgent need for a method that requires no additional sensors, effectively eliminates hot air recirculation interference, accurately identifies the location of air conditioning delivery faults, and classifies servers with abnormal temperature rises, in order to achieve precise early warning and energy-saving control.

[0032] use Figure 1 The diagram illustrates the formation mechanism of the hot reflow server. Figure 1This is a schematic diagram of the cooling mechanism of a data center air conditioning system in existing technology. From the working mechanism of a data center air conditioning system, it is generally understood that the system sends cool air into the static pressure chamber under the raised floor, and then into the cold aisle through the perforated floor. The server racks draw in cool air from the cold aisle to cool the server equipment, while simultaneously exhausting an equal amount of hot air from the back of the rack into the hot aisle. Finally, the hot air returns to the air return vents of the air conditioning system. Ideally, all server racks in the server room should receive sufficient cool air without hot air entering the cold aisle, and the air drawn in by the servers should be the cool air output from the air conditioning system. However, in reality, the hot air exhausted by the server equipment can intrude into the cold aisle from the edges and tops of each row of server racks, causing hot air recirculation.

[0033] Under normal circumstances, the airflow from the perforated floor is large enough that even with some hot air recirculation, the impact on the cold aisle is minor and will not cause significant changes in the server inlet temperature. There are two main factors contributing to the increase in server inlet temperature: 1) The air supply is normal, but the surrounding servers are under heavy load and have high operating power. The temperature of the exhaust hot air rises significantly, and after flowing back into the cold aisle, the temperature at the server entrance rises.

[0034] 2) Abnormal cold air delivery leads to a decrease in the cold air velocity at the air outlet, a decrease in the proportion of cold air in the air within the cold aisle, and an increase in the proportion of hot air returning, causing the server inlet temperature to rise.

[0035] Existing technologies typically employ unsupervised clustering algorithms such as K-means or K-mediods to cluster servers based on their inlet temperature and relative location, aiming to identify air conditioning malfunctions within server rooms and pinpoint the faulty air vents. However, methods that use multi-point temperature information to cluster and identify local hotspots struggle to accurately reflect temperature changes within the air conditioning aisles, failing to precisely locate the air conditioning malfunction point. Furthermore, these methods do not differentiate between normal server temperature rises caused by hot air recirculation, impacting clustering effectiveness. Additionally, they do not consider the influence of temperature differences between the central and peripheral server inlets after an air conditioning malfunction on the clustering results, leading to inaccurate location.

[0036] refer to Figure 2 The diagram showing the area covered by the air conditioning vents of the server is used to further illustrate the problems of existing technologies in locating servers with abnormal air conditioning.

[0037] Because data center servers have high density, if the server itself or surrounding servers operate at high power, the exhaust hot air temperature is high. Even if the cooling system is functioning normally, the hot air recirculation can still cause the server inlet temperature to rise. Such servers are defined as "hot air recirculation temperature rise servers." Figure 2 The red servers are shown. Although there are few of these servers, they can cause problems such as increased clustering computation and inaccurate clustering results when used as outliers during clustering.

[0038] On the other hand, if the air conditioning supply is abnormal, the airflow at the air outlets will decrease significantly. Servers in the central area of ​​the outlets will be most affected, and these are defined as servers in the central area of ​​the abnormal air conditioning supply. Servers in the peripheral areas covered by the air conditioning supply will be less affected due to their proximity to other outlets, and these are defined as servers in the peripheral areas of the abnormal air conditioning supply. Figure 2 The air supply at the central exhaust vent 2 is malfunctioning, with a significant decrease in airflow. This results in a substantial reduction in the proportion of cold air entering the air intake of the blue server in the central area, while the proportion of hot air returning to the server increases significantly, leading to a rise in inlet temperature. Simultaneously, due to the increased inlet temperature of the blue server, the server's power output increases, causing a significant rise in the temperature of the exhaust hot air. This hot air recirculation further elevates the server's inlet temperature.

[0039] For the orange servers located at the edge of the air conditioning vents, the proportion of hot air in the airflow at the server inlet is low due to the influence of the cold air output from air conditioning outlets 1 and 3. Furthermore, the increase in server power is not significant, resulting in a small temperature rise in the exhaust hot air. Consequently, the server inlet temperature is significantly lower than that of servers in the central area. Since temperature is an important clustering factor, a large temperature difference can lead to inaccurate clustering results, affecting the identification of the nature of the temperature rise and the location of abnormal air conditioning vents.

[0040] In view of this, this proposal puts forward a method for diagnosing faults in the cooling supply of computer rooms, see [link to relevant documentation]. Figure 3 The method includes steps S301, S302, and S303. S301 identifies servers with abnormal temperature rise based on the operating parameters of each server and obtains the temperature sequence corresponding to the servers with abnormal temperature rise. Specifically, at regular intervals, the server inlet temperature, server exhaust fan speed, and server power information within the cooling range of a duct system are collected. The first and last temperature data are observed to calculate the server inlet temperature rise value. Servers with temperature rise values ​​exceeding a certain value are counted to obtain the number, location, and corresponding temperature rise value of servers with abnormal temperature rise. S302, the temperature sequence is normalized using the heat reflux coefficient to obtain the temperature sequence matrix, where the heat reflux coefficient is related to the heat of the hot air discharged from the server and the flow rate of the hot air returning to the server inlet. Specifically, based on the server's relative position to other servers, air conditioning temperature, server rated power consumption, and / or rated rotation speed, the heat recirculation coefficient of each server is calculated. This coefficient is expressed as the ratio of the heat contained in the hot air exhausted by the server to the weighted sum of its distances to other servers. The temperature sequences of servers with abnormal temperature rises are then normalized according to the heat recirculation coefficient, forming an N×R temperature sequence matrix. S303. Construct a weighting function based on the temperature sequence matrix, and cluster the servers with abnormal temperature rise using the weighting function and clustering method. Determine the location of the abnormal air outlet of the server with abnormal temperature rise based on the clustering results. Specifically, a weighting function is constructed based on the Euclidean distance between servers with abnormal temperature rise and the ratio of the temperature rise value at the air inlet of each server to the temperature difference between the return hot air and the cold air at the air inlet. The K-medoids clustering method is used to cluster the servers with abnormal temperature rise, and the optimal number of clusters is selected through the CH index to finally determine the location of the abnormal cold air outlet. In some implementations, such as Figure 3 The method described above, before identifying servers with abnormal temperature rises based on their operating parameters and obtaining their corresponding temperature sequences, further includes: determining the minimum number of servers affected by an abnormal airflow from an air vent, based on each server's relative position to other servers, air conditioning temperature, server rated power consumption, and / or rated rotation speed. This allows for a more accurate identification of the impact range of the abnormal airflow, improving the accuracy of subsequent cluster analysis. In some implementations, such as Figure 3 The method described includes, before constructing a weighting function based on the temperature sequence matrix, clustering servers with abnormal temperature rise using the weighting function and clustering methods, and determining the location of the abnormal air outlets of the servers with abnormal temperature rise based on the clustering results, the following steps are also included: taking the last column of the temperature sequence matrix, sorting it from largest to smallest value, removing the last predetermined number of servers with abnormal temperature rise in the sorted sequence using data standardization, and removing the predetermined number of servers with abnormal temperature rise from the temperature sequence matrix. This effectively removes servers with high heat backflow temperature rise caused by their own or surrounding servers' high operating power, avoiding interference with the clustering analysis. In some implementations, such as Figure 3 The method described above identifies servers with abnormal temperature rises based on their operating parameters and obtains the corresponding temperature sequences for these servers. This includes: obtaining the temperature difference between the air inlet temperature and the air conditioner output temperature of each server as the temperature rise value; obtaining the temperature difference between the recirculated hot air and the cold air at the air inlet of each server; calculating the proportion of the recirculated hot air flow rate in the inlet air flow rate of each server based on the temperature rise value and temperature difference; and determining the server experiencing temperature rise due to hot recirculation based on the proportion. This method accurately distinguishes between temperature rises caused by abnormal cold air delivery and those caused by hot recirculation, improving the reliability of the assessment. In some implementations, such as Figure 3The method described involves constructing a weighting function based on the temperature sequence matrix, and then clustering servers with abnormal temperature rises using this weighting function and a clustering method. The location of the abnormal air outlet for these servers is determined based on the clustering results. This includes: constructing a weighting function based on the Euclidean distance between the servers and the ratio of the temperature rise at each server's air inlet to the temperature difference between the return hot air and the cold air at each server's air inlet; clustering servers with air supply failures causing inlet temperature increases by grouping servers with the same air outlet abnormality into a single cluster; and determining the location of the abnormal air outlet based on the clustering results. Therefore, cluster analysis can accurately identify areas of abnormal air supply and quickly locate the fault point. In some implementations, such as Figure 3 The method shown includes, before normalizing the temperature sequence using the heat recirculation coefficient to obtain the temperature sequence matrix, the following step: using the heat of the server exhaust hot air, weighted by distance, to characterize the temperature difference between the recirculated hot air and the cold air at the air inlet of each server, as the heat recirculation coefficient for each server. This allows for a more accurate reflection of the impact of the recirculated hot air on the server inlet temperature, improving the accuracy of the normalization process. In such Figure 3 The method shown eliminates the impact of recirculated hot air on the server inlet temperature through normalization processing, and combines weighted functions and clustering algorithms to identify abnormal areas of cold air delivery, thereby achieving the technical effect of early warning of overheating or even hardware failure of large-scale servers in data centers.

[0041] Figure 4 This is a flowchart illustrating a method for determining a fault in the cooling supply of a computer room according to an embodiment of this application. Figure 4 As shown, the flowchart includes the following 7 steps: Step 1: Calculate the relative positions of all servers Value and calories value For any server within a duct system The coefficient is calculated based on the server's relative position to other servers. Calculate the heat contained in the hot air discharged by the server at its rated power, based on the air conditioner's cooling temperature, the server's rated power consumption, and rated speed. Calculate and store the results for all M servers. Determine the minimum number of servers, P-value, affected by a cold air supply malfunction at a single air vent.

[0042] Step 2: Collect temperature sequence points for all servers and identify servers with abnormal temperature rises. At regular intervals (e.g., 30 seconds), the server inlet temperature, server exhaust fan speed, and server power information within the cooling range of the air duct system are collected. After collecting data R times, the server inlet temperature rise is calculated by observing the first and Rth temperature data, and cases where the temperature rise exceeds a certain value are recorded. The system collects data on the number, location, and corresponding temperature rise values ​​of servers exhibiting abnormal temperature increases. A sliding window of length R can be used to continuously collect server information.

[0043] Step 3: Determine if there is a cooling supply failure and calculate the temperature sequence of all abnormal servers. If the temperature rise exceeds If the number of servers N > P, it can be assumed that there is a potential cooling supply failure. For servers with abnormal temperature rise... Calculate the R collected server inlet temperature values ​​that exceed the air conditioning temperature. The temperature rise value is used to obtain the temperature sequence. Calculate the temperature rise values ​​of N servers to obtain the corresponding temperature sequence.

[0044] Step 4: Calculate the heat return coefficient of the server with abnormal temperature rise The temperature series was then normalized. For servers with abnormal temperature rise The heat reflux coefficient is calculated according to equation (4). The temperature sequence is then normalized. (1) For the aforementioned N servers, N normalized temperature sequences can be obtained, forming an N*R order temperature sequence matrix.

[0045]

[0046] Step 5: Remove servers with abnormal temperature rise during heat reflow Take the last column of the temperature sequence matrix Considering that the normalized temperature sequence reflects the proportion of recirculated hot air in the server inlet airflow, and that servers experiencing hot recirculation temperature rise have higher inlet temperatures due to their own or surrounding servers' higher operating power and higher exhaust hot air temperature, there is no increase in the proportion of recirculated hot air in such servers. The value is significantly higher than that of servers affected by air conditioning supply failures. The values ​​are small. Sort the sequence in descending order of value. Considering that the minimum number of servers affected by an abnormal cold air output from an air outlet is P, take the last P values ​​of the sorted sequence. The value at the end of the sequence represents a heat return temperature rise server (assuming there are L servers, P >> L), while the other values ​​in the sequence... If the value is large, L hot reflow temperature rise servers can be removed using data standardization methods such as Z-Score. Removing hot reflow servers using data standardization is highly robust because even if a small number of servers are removed... Servers in the edge area of ​​cold air anomalies with smaller values ​​will not affect subsequent clustering and location of abnormal air supply vents. Servers with heat return temperature rise identified through this step can have their load adjusted, or the load on these servers or surrounding servers can be adjusted to reduce hot air output; or the flow rate of cold air channels near these servers can be increased to increase the proportion of cold air entering these servers, lowering the inlet temperature and ensuring stable server operation.

[0047] Step 6: Construct a weighted function using the server temperature series After removing L heat recirculation temperature rise servers, the original temperature sequence matrix becomes (NL)*R order:

[0048] All (NL) servers experienced temperature increases due to abnormal airflow. Considering that multiple air outlets within the cooling range of a single duct system may exhibit abnormal airflow conditions, and the degree of abnormality may vary, server clustering can identify the causes and extent of temperature rises in different server clusters and correlate them with the locations of the abnormal air outlets. Therefore, a weighted Euclidean distance is defined: (2) in For (NL) servers with abnormal temperature rise and The Euclidean distance between them and These are the normalized temperature sequences corresponding to the two servers, respectively. Considering that the temperature rise trends of the servers in the central and peripheral areas of the cooling system are the same after the cooling air vent anomaly, and the difference in normalized temperature is small, a weighting function can be constructed: (3) in, For sequence and The Pearson coefficient is a coefficient that is closer to 1 if the correlation between the changes of two sequences is greater, and smaller if the correlation is less. This represents the absolute value of the normalized temperature rise difference between the R-th samples of the two servers. A smaller difference in normalized temperature rise caused by an abnormal cooling output from the same vent indicates a larger difference in temperature rise, and vice versa. Therefore, using the aforementioned weighting function makes it easier to cluster servers whose inlet temperature rises due to an abnormal cooling output from the same vent into one group.

[0049] Step 7: Calculate the weighted Euclidean distance between servers and determine the nature of the temperature rise and the location of the air conditioning failure point using clustering methods and evaluation metrics. Calculate the weighted Euclidean distance between (NL) servers to estimate the upper limit of the possible number of clusters. The K-medoids clustering method is used to cluster the servers. The number of each cluster is evaluated by the CH (Calinski-Harabaz) index. The clustering result of the server with the best temperature rise is determined by the CH index. The relative location of the cold air delivery failure point is output and an early warning is given.

[0050] Therefore, this invention analyzes the airflow at the server inlet and proposes a server heat recirculation coefficient. It uses a normalized temperature rise value at the server inlet to characterize the proportion of cold air, eliminating the influence of recirculating hot air on the inlet airflow. The proportion of cold air in the server inlet air is used as a key clustering factor, and the correlation and temperature difference of the temperature sequence are used as weighting factors to calculate the weighted Euclidean distance between servers. This method is effective in removing cluster outliers and aggregating servers with abnormal cold air flow into central and peripheral servers. It also lays the foundation for reducing the computational load and increasing the accuracy of clustering. This method can determine the nature of abnormal server inlet temperature rises and quickly classify abnormal servers. For servers with abnormal temperature rises caused by cold air delivery failures, it can identify the abnormal point of the cold air delivery failure, guiding service personnel to troubleshoot the problem as quickly as possible. For server inlet temperature rises caused by recirculating hot air, it can adjust the load of the server or surrounding servers, or increase the cold air output of nearby air ducts. Based on the above methods, precise control can be achieved, significantly reducing the probability of server downtime and avoiding energy waste caused by blindly lowering the overall temperature of the data center or increasing the cold air supply.

[0051] Figure 5 This is a schematic diagram of a device for diagnosing air conditioning supply faults in a computer room, provided in an embodiment of this application. Figure 5 As shown, the device includes an acquisition unit 501, a processing unit 502, and a determination unit 503. The acquisition unit 501 is used to identify servers with abnormal temperature rise based on the operating parameters of each server, and to acquire the temperature sequence corresponding to the servers with abnormal temperature rise. Specifically, the acquisition unit 501 can collect information such as the server's air inlet temperature, exhaust fan speed, and server power, calculate its inlet temperature rise value, and count the servers whose temperature rise value exceeds a certain value to obtain the servers with abnormal temperature rise and their temperature sequences. The processing unit 502 is used to normalize the temperature sequence using a heat recirculation coefficient to obtain a temperature sequence matrix. Specifically, the processing unit 502 calculates the heat recirculation coefficient based on the distance between the heat contained in the server's exhaust air and the server, and normalizes the temperature sequence based on this coefficient to eliminate the influence of the recirculated hot air on the server inlet temperature, thereby constructing an N*R order temperature sequence matrix. The heat recirculation coefficient is related to the heat content of the server's exhaust air and the flow rate of the recirculated hot air at the server inlet. The determining unit 503 is used to construct a weighted function based on the temperature sequence matrix, and then cluster the servers with abnormal temperature rise using the weighted function and a clustering method. Based on the clustering results, the location of the abnormal air outlet of the servers with abnormal temperature rise is determined. Specifically, the determining unit 503 can define a weighted Euclidean distance, construct a weighted function by combining the Pearson coefficient and the normalized absolute value of the temperature difference, cluster the servers using the K-medoids clustering method, and determine the optimal number of clusters using the CH index, thereby identifying the cluster of servers with abnormal temperature rise caused by air supply failure, and locating the abnormal air outlet of the air conditioner accordingly. In some implementations, the K-medoids clustering method is used, employing the Calinski-Harabaz (CH) metric as the evaluation standard for the number of clusters, selecting the cluster with the highest CH metric as the optimal clustering result. This implementation effectively improves clustering accuracy while removing outliers that negatively impact clustering, such as the heat recirculation temperature rise server. In some implementations, the formula for calculating the heat reflux coefficient is: (4) in, For server The heat contained in the exhaust hot air, For server The heat contained in the exhaust hot air at rated power. For server With server The distance between them Here, M is a constant, and M is the total number of servers within a duct system. This implementation can more accurately characterize the proportion of hot return airflow between servers, improving the accuracy of subsequent normalization processing.

[0052] In some implementations, the normalized temperature sequence reflects the proportion of recirculating hot air in the server inlet airflow. However, when a hot recirculation temperature rise server, due to its own or surrounding servers' high operating power and the resulting high exhaust hot air temperature, leads to an increase in the server inlet temperature, its normalized temperature rise value is significantly smaller. This implementation can eliminate L hot recirculation temperature rise servers using data standardization methods such as Z-Score, thereby reducing interference with subsequent cluster analysis. In some implementations, such as Figure 5 The structure shown may also include a weighting function module, which is used to construct a weighted Euclidean distance based on the temperature sequence, wherein the expression for the weighting function is: (5) in, For sequence and sequence Pearson coefficient, This is the absolute value of the difference in normalized temperature rise values ​​between the Rth samples from two servers. This implementation enhances the clustering effect of servers experiencing temperature rise due to the same abnormal air outlet, improving the accuracy of locating air conditioning fault points. The method of this application will now be described with reference to specific embodiments.

[0053] Example 1 In some embodiments, the present invention provides an early warning method for air conditioning supply failures in data center computer rooms, specifically including the following steps: First, calculate the relative position and heat values ​​of all servers. For any server within the air duct system, calculate a distance weighting coefficient based on its relative position to other servers; simultaneously, based on the air conditioning output temperature, the server's rated power consumption, and the rated speed of the cooling fan, calculate the heat contained in the hot air discharged by the server at its rated power, and store this data. Furthermore, it is necessary to determine the minimum number of servers, P, that may be affected if an air conditioning vent malfunctions.

[0054] Next, temperature sequence points for all servers are collected to determine if any servers exhibit abnormal temperature rise. At regular intervals (e.g., 30 seconds), the inlet temperature, exhaust fan speed, and power consumption of all servers within the air duct system are collected. After collecting R data points, the temperature rise value at the inlet of each server is calculated by comparing the first and Rth temperature data points. Servers with temperature rise values ​​exceeding a set threshold are identified, and their number, location, and corresponding temperature rise values ​​are recorded. A sliding window approach can be used to continuously collect server information to dynamically monitor temperature rise trends.

[0055] Subsequently, it is determined whether there is a possible cooling supply failure, and the temperature sequence of all servers with abnormal temperature rise is calculated. If the number N of servers with temperature rise exceeding a set threshold is greater than the preset minimum number of affected servers P, it is preliminarily determined that there may be a cooling supply failure. For these servers with abnormal temperature rise, the temperature rise value of their inlet temperature exceeding the cooling temperature in R samplings is further calculated to obtain the temperature sequence of each server.

[0056] Next, the heat recirculation coefficient of the server with abnormal temperature rise is calculated, and the temperature series is normalized. The heat recirculation coefficient is determined by the ratio of the heat output of the server's exhaust air to the heat output of its exhaust air at its rated power, combined with a distance weighting factor with other servers. By normalizing the temperature series, the influence of inlet temperature rise caused by differences in hot air recirculation temperature is eliminated, retaining only the variation characteristics of the proportion of cold air, thereby improving the accuracy of subsequent cluster analysis.

[0057] Next, servers with abnormal heat recirculation temperature rise are removed. The last column of the normalized temperature sequence matrix is ​​taken and sorted from largest to smallest. Considering that abnormal cold air delivery can lead to larger normalized temperature rise values ​​for many servers, while the normalized temperature rise values ​​for heat recirculation servers are smaller, the last P values ​​of the sorted sequence are selected as potential heat recirculation servers. These servers are then removed using standardization methods such as Z-Score to ensure that subsequent clustering is not affected.

[0058] Subsequently, a weighted Euclidean distance function was constructed using the server temperature sequences. After removing servers experiencing heat recirculation-induced temperature rise, the remaining (NL) servers all exhibited temperature rises caused by abnormal airflow. To identify server clusters in different abnormal airflow regions, a weighted Euclidean distance was defined, incorporating Pearson correlation coefficient and normalized temperature difference as weighting factors. This resulted in servers in the same abnormal airflow region having smaller distance values, making them easier to cluster into one class.

[0059] Finally, the weighted Euclidean distance between all servers was calculated, and the servers were clustered using the K-medoids clustering method. The effectiveness of different numbers of clusters was evaluated using the CH (Calinski-Harabaz) index, and the cluster with the highest CH value was selected as the optimal result. This allowed the relative location of the air conditioning delivery fault point to be determined, and an early warning signal to be issued. This method effectively distinguishes between air conditioning delivery anomalies and heat recirculation temperature rise, improving the accuracy and efficiency of air conditioning delivery fault detection.

[0060] Example 2 In some embodiments, this invention provides an early warning method for cooling air supply failures in data center computer rooms. The core of this method lies in normalizing the server inlet temperature to eliminate the impact of heat recirculation on temperature rise, and then performing cluster analysis based on this normalized temperature to identify temperature rise problems caused by abnormal cooling air supply. The specific implementation steps are as follows: First, calculate the relative position and heat values ​​of all servers. For any server within an air duct system, calculate its corresponding distance weighting coefficient based on its distance from other servers. Simultaneously, calculate the heat contained in the exhaust air from the server at its rated power, based on the air conditioner's output temperature, the server's rated power consumption, and the cooling fan's rated speed. Perform these calculations for all M servers and store the results to determine the minimum number of servers, P, affected by an abnormality in the cooling air delivery from a single air outlet.

[0061] Subsequently, temperature sequence points for all servers are collected to determine if any temperature rise anomalies exist. At regular intervals (e.g., 30 seconds), the inlet temperature, exhaust fan speed, and server power information of servers within the cooling range of the air duct system are collected. After collecting R data sets, the inlet temperature rise value for each server is calculated by comparing the first and Rth temperature data sets. Servers with temperature rise values ​​exceeding a set threshold are identified, determining the number, location, and corresponding temperature rise values ​​of servers with temperature rise anomalies. A sliding window approach can be used to continuously collect server information for dynamic monitoring of temperature rise changes.

[0062] If the number N of servers with temperature rise exceeding the threshold is greater than the preset minimum number of affected servers P, then a cooling supply failure is considered possible. For these servers with abnormal temperature rise, the temperature rise value of their inlet temperature exceeding the cooling temperature at each moment during R data collection processes is calculated to obtain the temperature sequence of each server.

[0063] Next, the heat recirculation coefficient of the servers with abnormal temperature rise is calculated, and the temperature sequence is normalized. Based on the heat recirculation coefficient formula, and considering factors such as the heat output of the server's exhaust air and the distance between servers, the heat recirculation coefficient for each server with abnormal temperature rise is calculated. Then, the temperature sequence of each server is normalized according to the heat recirculation coefficient, thereby eliminating inlet temperature fluctuations caused by differences in recirculated hot air temperature, retaining only information reflecting the proportion of cold air flow, forming an N×R normalized temperature sequence matrix.

[0064] Next, servers with abnormal temperature rises due to heat recirculation were removed. The last column of the normalized temperature sequence matrix was taken and sorted from largest to smallest. Considering that the number of servers affected by cold air delivery anomalies is P, while the number of servers with temperature rises due to heat recirculation is L, which is much smaller than P, the last L values ​​after sorting were taken as the servers with temperature rises due to heat recirculation. These servers were then removed using standardization methods such as Z-Score to avoid interfering with subsequent cluster analysis.

[0065] After removing servers experiencing temperature rise due to heat recirculation, a weighted Euclidean distance function is constructed. For the remaining (NL) servers, a weighted Euclidean distance is constructed using their normalized temperature sequences, where the weights are determined by the Pearson correlation coefficient and the absolute value of the normalized temperature difference. This weighted function effectively distinguishes servers experiencing temperature rise due to abnormalities at the same air outlet, grouping them into the same cluster during the clustering process.

[0066] Finally, the weighted Euclidean distance between servers is calculated, and the K-medoids clustering method is used to cluster the servers. The CH (Calinski-Harabaz) index is used to evaluate different cluster numbers, and the clustering result with the largest CH index is selected as the optimal clustering scheme. This determines the category of servers with abnormal temperature rise and the location of their corresponding air conditioning supply failure points, thus enabling early warning of air conditioning supply failures.

[0067] Example 3 In some embodiments, this invention provides a method for identifying servers with abnormal temperature rise, comprising the following steps: first, collecting the server inlet temperature sequence and calculating its temperature rise value; then, analyzing the temperature rise value based on a sliding window mechanism to determine whether an abnormal temperature rise exists. Specifically, within an air duct system, the server's inlet temperature, exhaust fan speed, and server power information are collected at regular intervals (e.g., 30 seconds). After collecting R data sets, the server inlet temperature rise value is calculated by comparing the first and Rth temperature data sets. If the temperature rise value of a server exceeds a preset threshold, it is marked as a server with abnormal temperature rise.

[0068] Furthermore, for all servers marked as having abnormal temperature rise, a sliding window mechanism is used to continuously collect their temperature data. For example, the sliding window length is set to R. Each time new temperature data is collected, the data within the window is updated and the temperature rise value is recalculated, thereby dynamically determining whether the server is still in an abnormal temperature rise state. This sliding window mechanism can effectively capture temperature change trends and improve the real-time performance and accuracy of anomaly identification.

[0069] Subsequently, the number N of servers with abnormal temperature rises was compared with a preset threshold P. When N was greater than P, a preliminary judgment was made that there might be a cooling air delivery failure. For these servers with abnormal temperature rises, R temperature data points were collected, and the corresponding temperature sequences were calculated for subsequent analysis.

[0070] To more accurately distinguish between temperature rise anomalies caused by cold air delivery failures and those caused by hot air recirculation, servers with abnormal temperature rises are normalized. The normalized temperature rise value reflects the proportion of recirculating hot airflow; a smaller normalized temperature rise value indicates a server experiencing hot air recirculation. Specifically, the hot air recirculation coefficient of each server is calculated and used to normalize the temperature series, eliminating the influence of recirculating hot air on the server inlet temperature and retaining only the factors affecting temperature rise due to cold air delivery anomalies.

[0071] Next, servers with abnormal temperature rises due to heat reflow are removed. By sorting the normalized temperature sequences and using standardization methods such as Z-Score, servers whose temperature rises are caused by high operating power of themselves or surrounding servers are identified and removed to avoid interference with subsequent clustering analysis.

[0072] After outlier removal, a weighted Euclidean distance function was constructed using the temperature sequences of the remaining servers with abnormal temperature rises. This function combines the correlation and temperature difference of the temperature sequences as weighting factors to improve the clustering effect. Subsequently, the K-medoids clustering algorithm was used to cluster the servers, and the CH index was used to evaluate the clustering effect under different numbers of clusters. The clustering result with the largest CH index was selected as the optimal clustering scheme.

[0073] Finally, based on the clustering results, the distribution characteristics of servers with abnormal temperature rise were determined, the possible locations of air conditioning supply failure points were identified, and early warning signals were issued to guide maintenance personnel to promptly investigate and handle air conditioning supply failures, thereby achieving efficient operation and maintenance and fault prevention of data center computer rooms.

[0074] Example 4 In some embodiments, the present invention provides a method for eliminating servers with abnormal heat recirculation temperature rise, specifically including the following steps: First, the relative positions and heat values ​​of all servers within the air duct system are calculated to determine the heat recirculation coefficient of each server. This heat recirculation coefficient is calculated based on factors such as the distance between the server and other servers, the heat contained in the exhaust hot air, and the server's rated power consumption, thereby characterizing the proportion of recirculated hot air in the server's inlet airflow.

[0075] Next, temperature sequence points of all servers are collected, and servers with abnormal temperature rise are determined based on the temperature rise value. At regular intervals (e.g., 30 seconds), the air inlet temperature, exhaust fan speed, and power information of the servers are collected. Data changes are continuously monitored using a sliding window method, and servers whose temperature rise value exceeds the set threshold are counted, and their number, location, and temperature rise value are recorded.

[0076] When the number of servers with temperature rise exceeding the threshold is greater than the preset minimum number of affected servers P, a cold air delivery failure is considered possible. In this case, the temperature sequences of all servers with abnormal temperature rise are processed to calculate their normalized temperature sequences in order to eliminate the impact of inlet temperature rise caused by changes in return hot air temperature.

[0077] Subsequently, the normalized temperature series was analyzed, and the last column of data was extracted and sorted from largest to smallest. Since the normalized temperature rise values ​​of the servers with abnormal heat reflux temperature rise are small, their corresponding values ​​are usually located at the end of the sort. The last P values ​​of the sorted series were taken, and a few values ​​corresponding to servers with abnormal heat reflux temperature rise were removed using Z-Score normalization.

[0078] After removing servers with abnormal heat return temperature rise, a weighted Euclidean distance function is constructed using the temperature sequences of the remaining servers. This function is then combined with the Pearson correlation coefficient and the normalized absolute value of the temperature rise difference as weighting factors to further improve clustering accuracy. The K-medoids clustering method is used to cluster the servers, and the optimal number of clusters is selected using the CH index. Finally, the location of the air conditioning supply failure point is output, enabling early warning and location of air conditioning supply failures in the data center.

[0079] Example 5 In some embodiments, this invention proposes a data center air conditioning delivery fault early warning method based on a weighted Euclidean distance construction method. This method first constructs a normalized temperature sequence matrix, then defines a weighted Euclidean distance and designs a corresponding weighting function to more accurately identify server inlet temperature rises caused by abnormal air conditioning delivery.

[0080] Step 1: Construct a normalized temperature sequence matrix. For all servers in a duct system, calculate the heat recirculation coefficient based on their relative position to other servers. Combined with the air conditioning cooling temperature, server rated power consumption, and cooling fan speed, calculate the heat contained in the hot air discharged by each server at rated power. Calculate the heat recirculation coefficient of each server using formula (2), and normalize the server inlet temperature, i.e., normalize the temperature rise value using the heat recirculation coefficient, to obtain N normalized temperature sequences, forming an N*R order temperature sequence matrix, where N is the number of servers with abnormal temperature rise and R is the number of data collection time points.

[0081] Step 2: Eliminate servers with abnormal heat return temperature rise. Take the last column of the temperature sequence matrix. This column reflects the normalized temperature rise, indicating the proportion of recirculated hot air in the server inlet airflow. Sort this column from largest to smallest. Considering the minimum number of servers affected by cold air delivery anomalies is P, select the last P values ​​in the sorted sequence as candidates. Since the values ​​of servers with heat return temperature rise are relatively small, these servers can be eliminated using standardization methods such as Z-Score, thus retaining servers whose temperature rise is solely caused by cold air delivery anomalies.

[0082] Step 3: Construct Weighted Euclidean Distance. After removing the heat reflow temperature rise servers, a temperature sequence matrix of order (NL)*R is obtained, where L is the number of heat reflow temperature rise servers removed. For the remaining (NL) servers, calculate the Euclidean distance between their normalized temperature sequences. To improve clustering accuracy, a weighting function is introduced to weight the Euclidean distance, and the weighted Euclidean distance formula is constructed as follows: (2) in, For the first Taiwan and the The Euclidean distance between the normalized temperature series corresponding to the servers. and These are the normalized temperature sequences corresponding to the two servers, respectively.

[0083] Step 4: Design the weighting function. The weighting function is defined as follows: (3) in, For sequence order The Pearson correlation coefficient of j is closer to 1 if the correlation between the two series changes is greater, and smaller if the correlation is weaker. This represents the absolute value of the normalized temperature rise difference between the R-th samples of the two servers. A smaller difference in normalized temperature rise caused by an abnormal cooling output from the same vent indicates a larger difference in temperature rise, and vice versa. Therefore, using the aforementioned weighting function makes it easier to cluster servers whose inlet temperature rises due to an abnormal cooling output from the same vent into one group.

[0084] Step 5: Cluster Analysis and Fault Location. Based on the weighted Euclidean distance mentioned above, the K-medoids clustering algorithm is used to cluster the remaining (NL) servers. The CH (Calinski-Harabaz) index is used to evaluate the clustering results under different numbers of clusters, and the number of clusters with the largest CH index is selected as the optimal clustering scheme. Finally, the clustering results are output to determine the relative location of the air conditioning delivery fault point, realizing early warning of air conditioning delivery anomalies.

[0085] Example 6 In some embodiments, the present invention provides an early warning method for air conditioning delivery failures in data center computer rooms, comprising the following steps: First, calculating the relative position and heat values ​​of all servers. For any server within an air duct system, calculating its heat return coefficient based on its relative position to other servers; simultaneously, calculating the heat contained in the hot air discharged by the server at its rated power based on the air conditioning temperature, the server's rated power consumption, and rated speed, and performing calculations and storing this data for all M servers. Determining the minimum number of servers, P, affected by an air conditioning delivery failure at a given air outlet.

[0086] Next, collect temperature sequence points for all servers and identify servers with abnormal temperature rises. At regular intervals (e.g., 30 seconds), collect data on the inlet temperature, exhaust fan speed, and power of servers within the cooling range of the air duct system. After collecting R data points, observe the first and Rth temperature data points, calculate the server inlet temperature rise value, and identify servers whose temperature rise values ​​exceed a set threshold. Record the number, location, and corresponding temperature rise value of these servers with abnormal temperature rises. A sliding window of length R can be used to continuously collect server information.

[0087] Next, determine if there is a cooling supply failure and calculate the temperature sequence for all abnormal servers. If the number N servers with temperature rise exceeding the set threshold is greater than P, it is considered that there may be a cooling supply failure. For servers with abnormal temperature rise, calculate the temperature rise values ​​of R collected server inlet temperature values ​​that exceed the cooling supply temperature to obtain a temperature sequence, and calculate the temperature rise values ​​of N servers to obtain the corresponding temperature sequence.

[0088] Subsequently, the heat reflux coefficient of the server with abnormal temperature rise is calculated, and the temperature sequence is normalized. For the server with abnormal temperature rise, its heat reflux coefficient is calculated according to the heat reflux coefficient formula, and the temperature sequence is normalized to obtain N normalized temperature sequences, which constitute an N×R temperature sequence matrix.

[0089] Next, servers with abnormal heat recirculation temperature rise are removed. The last column of the temperature sequence matrix is ​​taken and sorted from largest to smallest. Considering that the minimum number of servers affected by an abnormal cold air output from an air outlet is P, the last P values ​​in the sorted sequence are taken. The fewest values ​​at the end of the sequence represent servers with abnormal heat recirculation temperature rise (assuming the number is L, and P is much larger than L). L servers with abnormal heat recirculation temperature rise are removed using data standardization methods such as Z-Score. For servers with abnormal heat recirculation temperature rise identified in this step, the load on these servers or surrounding servers can be adjusted to reduce hot air output; or the flow rate of the cold air aisles near the servers can be increased to increase the proportion of cold air entering these servers, lowering the inlet temperature and ensuring stable server operation.

[0090] Next, a weighting function is constructed using the server temperature sequences. After removing L servers with heat recirculation causing temperature rise, the original temperature sequence matrix becomes (NL)×R, where (NL) servers all belong to the category of servers whose inlet temperature rise is caused by abnormal cold air delivery. A weighted Euclidean distance is defined, where is the Euclidean distance between server i and server j among the (NL) servers with abnormal temperature rise, and are the normalized temperature sequences corresponding to the two servers, respectively. A weighting function is constructed, where is the Pearson coefficient of sequence i and sequence j, and is the absolute value of the difference in normalized temperature rise values ​​between the R-th samples of the two servers. Using the above weighting function makes it easier to cluster servers whose inlet temperature rise is caused by abnormal cold air delivery from the same outlet into one category.

[0091] Finally, the weighted Euclidean distance between servers is calculated, and the characteristics of temperature rise and the location of air conditioning failure points are determined through clustering methods and evaluation metrics. The weighted Euclidean distance between (NL) servers is calculated to estimate the upper limit of the possible number of clusters. The K-medoids clustering method is used to cluster the servers, and the number of clusters for each cluster is evaluated using the CH (Calinski-Harabaz) metric. The clustering result of the servers with the optimal CH metric for temperature rise anomalies is determined, and the relative location of the air conditioning delivery failure point is output with early warning.

[0092] Example 7 In some implementations, the overall process for early warning of air conditioning delivery faults includes a complete series of steps from data acquisition to cluster analysis, as well as fault classification and maintenance guidance strategies. First, by collecting real-time data on server inlet temperature, exhaust fan speed, and server power, a set of data is acquired at regular intervals (e.g., 30 seconds). The data collected R times is processed to calculate the inlet temperature rise value for each server. When the temperature rise value exceeds a preset threshold, the server is identified as having an abnormal temperature rise, and its number, location, and corresponding temperature rise value are recorded.

[0093] Subsequently, a heat recirculation coefficient is calculated for each server based on its relative position and the heat contained in its exhaust air. This coefficient characterizes the proportion of recirculated hot air in the server's inlet airflow, thus reflecting the likelihood of abnormal airflow. Based on this heat recirculation coefficient, the temperature sequences of servers with abnormal temperature rises are normalized to eliminate the temperature rise caused by hot air recirculation and to reduce the temperature difference between servers in the central and peripheral areas of abnormal airflow, thereby improving the accuracy of subsequent clustering.

[0094] After normalization, servers with abnormal temperature rises caused by heat recirculation are removed by analyzing the distribution characteristics of the normalized temperature sequence. Specifically, the last column of the normalized temperature sequence matrix is ​​taken and sorted from largest to smallest. The last P values ​​may contain servers with heat recirculation temperature rises, where P is the minimum number of servers affected by the abnormality of the cold air outlet. These servers are removed using standardization methods such as Z-Score to ensure that subsequent clustering processes are not affected by factors other than cold air delivery failures.

[0095] Next, a weighted Euclidean distance function is constructed using the remaining server temperature sequences. The normalized temperature sequence variation trend and the difference in normalized temperature rise between servers are used as weighting factors to enhance the clustering effect of servers under the influence of the same air conditioning delivery failure point. Then, the remaining servers are clustered based on the K-medoids clustering algorithm, and the effect of different numbers of clusters is evaluated using the CH (Calinski-Harabaz) index. The cluster with the largest CH index is selected as the optimal clustering result, thereby identifying the location of multiple air conditioning delivery failure points.

[0096] Regarding fault classification and operational guidance strategies, the nature of temperature rise is determined based on clustering results: if the temperature rise is caused by abnormal air supply, the corresponding air supply fault location is identified, guiding operations personnel to quickly investigate; if the temperature rise is caused by heat recirculation, it is recommended to adjust the load configuration of relevant servers or peripheral equipment, or optimize the airflow distribution in nearby air ducts to reduce inlet temperature and ensure stable server operation. This method achieves accurate early warning and classification of air supply faults, effectively improving the operational efficiency and energy utilization of data centers.

[0097] The sequence number of each process does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0098] In the various embodiments of this application, unless otherwise specified or in case of logical conflict, the terminology and / or descriptions of different embodiments are consistent and can be referenced by each other. The technical features of different embodiments can be combined to form new embodiments according to their inherent logical relationship.

[0099] Another embodiment of this application provides an electronic device, which is a task controller or a node resource controller, such as... Figure 6 As shown, it includes a transceiver 610, a processor 600, a memory 620, and a program or instructions stored in the memory 620 and executable on the processor 600; when the processor 600 executes the program or instructions, it implements the various processes of the above-described method embodiments on the task controller side or node resource controller side, and can achieve the same technical effect. To avoid repetition, it will not be described again here.

[0100] The transceiver 610 is used to receive and send data under the control of the processor 600.

[0101] Among them, Figure 6 In this context, the bus architecture can include any number of interconnected buses and bridges, specifically linking various circuits of one or more processors represented by processor 600 and memory represented by memory 620 together. The bus architecture can also link various other circuits such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein. The bus interface provides an interface. Transceiver 610 can be multiple elements, including transmitters and receivers, providing a unit for communicating with various other devices over a transmission medium. For different user equipment, user interface 630 can also be an interface capable of connecting external or internal devices, including but not limited to keypads, displays, speakers, microphones, joysticks, etc.

[0102] The processor 600 is responsible for managing the bus architecture and general processing, while the memory 620 can store the data used by the processor 600 when performing operations.

[0103] This application also provides a computer-readable storage medium storing a computer program. When executed by a processor, the computer program implements the various processes of the above-described task processing method or resource processing method embodiments and achieves the same technical effects. To avoid repetition, it will not be described again here. The computer-readable storage medium may include, for example, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0104] This application also provides a computer program product, including computer instructions. When the computer instructions are executed by a processor, they implement the various processes of the above-described task processing method or resource processing method embodiments and achieve the same technical effect. To avoid repetition, they will not be described again here.

[0105] This application embodiment also provides a chip, which includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the various processes of the above-described task processing method or resource processing method embodiments, and can achieve the same technical effect. To avoid repetition, it will not be described again here.

[0106] It should be understood that the chip mentioned in the embodiments of this application may also be referred to as a system-on-a-chip, system chip, chip system, or system-on-a-chip, etc.

[0107] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0108] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. The computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0109] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

Claims

1. A method for diagnosing faults in the air conditioning supply system of a computer room, characterized in that, include: The servers with abnormal temperature rise are identified based on the operating parameters of each server, and the temperature sequence corresponding to the servers with abnormal temperature rise is obtained. The temperature sequence is normalized using the heat reflux coefficient to obtain a temperature sequence matrix, wherein the heat reflux coefficient is related to the heat of the hot air discharged from the server and the flow rate of the hot air refluxed from the server inlet. A weighting function is constructed based on the temperature sequence matrix, and the servers with abnormal temperature rise are clustered using the weighting function and clustering method. The location of the abnormal air outlet of the servers with abnormal temperature rise is determined based on the clustering results.

2. The method according to claim 1, characterized in that, Before determining the servers with abnormal temperature rise based on the operating parameters of each server and obtaining the temperature sequence corresponding to the servers with abnormal temperature rise, the method further includes: Based on the relative position of each server to other servers, the air conditioning temperature, the server's rated power consumption and / or rated speed, determine the minimum number of servers affected by an abnormality in the air conditioning supply from an air outlet.

3. The method according to claim 2, characterized in that, Before constructing a weighting function based on the temperature sequence matrix, clustering the servers with abnormal temperature rise using the weighting function and a clustering method, and determining the location of the abnormal air outlet of the servers with abnormal temperature rise based on the clustering results, the method further includes: Take the last column of the temperature sequence matrix, sort it from largest to smallest, and remove the last preset number of abnormal temperature rise servers in the sorted sequence using data standardization. Then remove the preset number of abnormal temperature rise servers from the temperature sequence matrix.

4. The method according to claim 1, characterized in that, The step of identifying servers with abnormal temperature rise based on the operating parameters of each server and obtaining the temperature sequence corresponding to the servers with abnormal temperature rise includes: The temperature difference between the air inlet temperature of each server and the temperature of the cooled air output by the air conditioner is obtained as the temperature rise value. Obtain the temperature difference between the recirculated hot air and the cooled air at the air inlet of each server. The proportion of the return hot air flow rate in the inlet air flow rate of each server is calculated based on the temperature rise value and the temperature difference. The server's temperature rise due to heat recirculation is determined based on the stated percentage.

5. The method according to claim 1, characterized in that, The step of constructing a weighting function based on the temperature sequence matrix, clustering the servers with abnormal temperature rise using the weighting function and a clustering method, and determining the location of the abnormal air outlet of the servers with abnormal temperature rise based on the clustering results includes: A weighted function is constructed based on the Euclidean distance between the servers with abnormal temperature rise and the ratio of the temperature rise value at the air inlet of each server to the temperature difference between the return hot air and the cold air at the air inlet of each server. Cluster the temperature rise servers of cold air delivery failure, so that servers with abnormal temperature rise caused by abnormal cold air from the same outlet are clustered as cold air delivery failure temperature rise servers. The location of the abnormal air outlet of the air conditioner was determined based on the clustering results.

6. The method according to claim 1, characterized in that, Before normalizing the temperature sequence using the heat reflux coefficient to obtain the temperature sequence matrix, the method further includes: The heat of the server exhaust air, weighted by distance, is used to characterize the temperature difference between the recirculated hot air and the cold air at the air inlet of each server, and is used as the heat recirculation coefficient of each server.

7. A device for diagnosing faults in the air conditioning supply system of a computer room, characterized in that, include: The acquisition unit is used to determine the server with abnormal temperature rise based on the operating parameters of each server, and to acquire the temperature sequence corresponding to the server with abnormal temperature rise. The processing unit normalizes the temperature sequence using the heat recirculation coefficient to obtain a temperature sequence matrix, wherein the heat recirculation coefficient is related to the heat of the hot air discharged from the server and the flow rate of the hot air recirculated from the server inlet. The determining unit is used to construct a weighting function based on the temperature sequence matrix, and to cluster the abnormal temperature rise servers using the weighting function and a clustering method, and to determine the location of the abnormal air outlet of the abnormal temperature rise server based on the clustering result.

8. An electronic device, characterized in that, include: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program or instructions that, when executed on a computer, perform the method according to any one of claims 1 to 6.

10. A computer program product, characterized in that, When the computer program product is run on a computer, it performs the method according to any one of claims 1 to 6.