Adaptive thermal control for data centers and IT equipment
By adjusting the operating parameters and cooling strategies of the data center thermal control system, the problem of low cooling efficiency under seasonal changes was solved, achieving efficient cooling and extended hardware lifespan in different seasons, and reducing energy consumption and failure risk.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GOOGLE LLC
- Filing Date
- 2023-01-18
- Publication Date
- 2026-06-30
AI Technical Summary
Existing data center thermal management systems are unable to effectively adapt to seasonal changes, resulting in low cooling efficiency during cold seasons, increased energy consumption and hardware failure risks, and an inability to efficiently utilize cooling resources during hot seasons.
Adaptive thermal control is achieved by adjusting the operating parameters of the data center thermal control system, including reducing the temperature and flow rate of the cooling medium in cold seasons, improving cooling efficiency in hot seasons, optimizing the cooling path using external heat exchangers and local coolers, and optimizing hardware cooling strategies by combining piecewise functions and historical fault data.
It improves the cooling efficiency of data centers in different seasons, extends hardware lifespan, reduces energy consumption and failure risk, and improves overall operational efficiency.
Smart Images

Figure CN115942720B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to adaptive thermal control for data centers and IT equipment. Background Technology
[0002] Electronic hardware (such as information technology (“IT”) equipment, computer hardware, and servers) generates heat during operation. Such hardware operates more efficiently at lower temperatures and tends to fail more quickly when operating at higher temperatures. For these reasons, a wide variety of electronic hardware cooling solutions have been developed.
[0003] Data centers house vast amounts of electronic hardware, often used for remote applications such as cloud computing or internet hosting. To manage the significant heat generated by the housed hardware, data centers typically employ complex thermal management systems. These systems can comprise multiple heat exchange chains between various cooling media to remove heat from the hardware and out of the data center. Each link in the chain operates according to a set of parameters, such as cooling medium temperature and flow rate. These parameters are often chosen to balance multiple considerations. For example, cooling medium temperature and flow rate might be selected to balance the overall preference for lower operating temperatures of the cooled electronic hardware with the costs associated with removing heat from these components out of the building. The considerations being balanced can change over time, so the parameters are set to accommodate the most challenging conditions anticipated. Summary of the Invention
[0004] By adjusting the operating parameters of a data center's thermal control system as conditions change, efficiency and hardware lifespan can be improved. In certain specific examples, parameters can be adjusted when favorable conditions are anticipated to take advantage of them.
[0005] In some aspects of this disclosure, the operating parameters of the data center thermal control system can vary with the seasons. Where an acceptable balance has been found between the desire for greater cooling of the hardware or media and the difficulty of removing heat from the building during the hottest time of the year, these parameters can be varied during periods of expected colder weather throughout the year. In some examples, outdoor heat exchangers (such as cooling towers) can be used to transfer heat from the cooling medium used by the data center to the outside air. Since colder weather tends to result in lower temperatures of the cooling medium immediately downstream of the tower without additional cost, any heat exchange in the data center that transfers heat to the cooling medium will tend to be more efficient during colder weather. Therefore, the thermal control operations of the entire data center (such as any refrigerant cycle, and in some examples, the refrigerant cycle used to cool process water) can be adjusted to take advantage of the temporary increase in cooling efficiency by cooling to lower temperatures.
[0006] In other aspects of this disclosure, operating parameters can be adjusted to suit conditions within a data center. The storage and cooling areas for hardware can be divided into multiple zones, and the cooling medium circulating through zones containing less than the full capacity of the hardware can circulate at a higher rate, a lower temperature, or both, with minimal cost due to the lower cooling load provided by the underfilled zones. Cooling systems (such as fan systems) for individual hardware components or containers for components can be adjusted to lower coolant inlet temperatures by reducing the thermal margin for the components or contained components. After the installation of several new components, the thermal margin for these new components can be temporarily increased to mitigate component failure during the anticipated early failure phase, thereby reducing the probability of equipment shortages; or temporarily decreased to shorten the anticipated early failure phase, resulting in a stable operating phase during which minimal premature failures are expected.
[0007] In accordance with some aspects of any of the foregoing, a data center thermal control system may include: a local cooler configured to cool a local coolant used for cooling electronic hardware; an external heat exchanger configured to exchange heat from a fluid to outside air; and a fluid circulation system configured to transfer heat from the local cooler to the external heat exchanger by circulating at least one fluid cooling medium, the fluid circulation system including a cold section pointing towards an air cooler. The thermal control system may also include one or more processors and a non-transitory computer-readable medium storing instructions. When executed by one or more processors, the instructions may cause one or more processors to control the external heat exchanger to cool the fluid in the cold section to a first target temperature during hot seasons and to cool the fluid in the cold section to a second target temperature, which is lower than the first target temperature, during cold seasons.
[0008] In some examples according to any of the foregoing, the cold season may include all months in the geographic area where the control system is located where the average annual temperature is below a threshold temperature.
[0009] In some examples according to any of the foregoing, the fluid circulation system may include: an internal heat exchanger; an external loop that circulates fluid between the internal heat exchanger and the external heat exchanger; and an internal loop that includes a cooling section and circulates fluid between a local cooler and the internal heat exchanger.
[0010] In some examples according to any of the foregoing, the local coolant may be recirculated air.
[0011] In some examples according to any of the foregoing, the second target temperature may vary during the cold season.
[0012] In some examples according to any of the foregoing, the cold season may include multiple intervals, each having an annual average temperature, and at each transition from an earlier interval to a later interval, if the annual average temperature of the later interval is lower than the annual average temperature of the earlier interval, the difference between the first target temperature and the second target temperature increases; and if the annual average temperature of the later interval is higher than the annual average temperature of the earlier interval, the difference between the first target temperature and the second target temperature decreases.
[0013] In some examples according to any of the foregoing, throughout the cold season, the second target temperature can be a non-piecewise function of the difference between the threshold temperature and the annual average temperature of the current interval in the interval.
[0014] In some examples according to any of the foregoing embodiments, the data center may have a control system installed in any of the foregoing embodiments. The data center may also include multiple cooling zones in which local coolant circulates, each cooling zone having electronic hardware storage capacity. When executed by one or more processors, the instruction may cause the processors to manage the control system to increase the flow rate of local coolant in any cooling zone that is known to contain a predetermined proportion of the hardware storage capacity of the cooling zone.
[0015] In some examples according to any of the foregoing, the predetermined ratio may be equal for each cooling zone.
[0016] In some examples according to any of the foregoing, for each cooling zone containing a predetermined proportion of the hardware storage capacity of the cooling zone, the instruction, when executed by one or more processors, can increase the airflow rate by a certain amount, which decreases as the difference between the hardware storage capacity of the cooling zone and the known amount of hardware stored in the cooling zone decreases.
[0017] In some examples according to any of the foregoing, the control system may include multiple cooling zones in which local coolant circulates, each cooling zone having an electronic hardware storage capacity. When executed by one or more processors, the instruction may cause the processors to manage the control system to reduce the temperature of the local coolant circulating in any of the cooling zones, which is known to contain a predetermined proportion of the hardware storage capacity of the cooling zones.
[0018] On the other hand, a container for electronic hardware may include a cooling system comprising one or more processors and a non-transitory computer-readable medium storing instructions. When executed by the one or more processors, the instructions may enable the processors to control the cooling system to maintain a thermal margin above a minimum for electronic hardware components operating within the container, wherein the minimum is a piecewise function of the supply temperature of the cooling medium, and the thermal margin is the difference between a predetermined temperature and an actual temperature of the component.
[0019] In some examples according to any of the foregoing, the piecewise function may include a first subfunction applied to a first domain and a second subfunction applied to a second domain on the side of the threshold temperature opposite to the first domain, the first subfunction being a constant and the second subfunction being a function of the supply temperature.
[0020] In some examples according to any of the foregoing, the first domain may be above the threshold temperature.
[0021] In some examples according to any of the foregoing, the second subfunction can create a positive relationship between the minimum value and the absolute value of the difference between the threshold temperature and the supply temperature.
[0022] In some examples according to any of the foregoing, the piecewise function may include a first subfunction applied to a first domain and a second subfunction applied to a second domain on the side of the threshold temperature opposite to the first domain, wherein the first and second subfunctions are different functions of the supply temperature.
[0023] In some examples according to any of the foregoing, the first and second sub-functions can both create a positive relationship between the minimum and the absolute value of the difference between the threshold temperature and the supply temperature.
[0024] In some examples according to any of the foregoing, the container may include a fan. The cooling medium may be air, and the control system may be configured to maintain a margin above a minimum by varying the fan's operating speed as needed.
[0025] On the other hand, a data center thermal control system for cooling a group of electronic components may include one or more processors and a non-transitory computer-readable medium storing instructions that, when executed by the one or more processors, will cause the processor-controlled system to cool the group of electronic components to a different thermal margin lower limit during at least a portion of an early failure phase, compared to a stable phase. The early failure phase may be a window following component installation during which the expected failure rate of the components decreases at least at a first rate based on historical failure data. The stable phase may be a window following the early failure phase during which the failure rate of the components decreases at a rate lower than the first rate and increases at a rate lower than a second rate based on historical failure data. Thermal margin may be the difference between a predetermined temperature and an actual operating temperature of the components.
[0026] In some examples according to any of the foregoing, the instruction, when executed by one or more processors, enables the processor management control system to cool the group of electronic components to a stable thermal margin lower limit during a stable phase and to an early margin lower limit during an early failure phase, wherein the early margin lower limit is less than the stable margin lower limit.
[0027] In some examples according to any of the foregoing, when executed by one or more processors, the instruction may cause the processor control system to cool the group of electronic components to an early margin lower limit from the installation of the group of electronic components until an adjusted stabilization transition time, and to cool the group of electronic components to at least a stable margin lower limit at the start of the adjusted stabilization transition time, wherein the adjusted stabilization transition time is the earliest time after the installation of the group of electronic components when the actual failure rate of the components in the group of electronic components is expected to decrease at a rate less than a predetermined rate.
[0028] In some examples according to any of the foregoing, historical fault data may be derived from observed faults in electronic devices of the same type as the components before installation. Attached Figure Description
[0029] Figure 1 This is a schematic diagram of a thermal control system used in a data center.
[0030] Figure 2A and Figure 2B It is a curve showing the seasonal adjustment of the target temperature for the coolant according to various aspects of this disclosure.
[0031] Figure 3 It is a graph showing possible adjustments to coolant temperature and flow rate depending on the filling ratio of the area used to store electronic hardware.
[0032] Figures 4A to 4C It is a graph showing the adjustment of thermal margin for electronic hardware depending on coolant temperature, according to various aspects of this disclosure.
[0033] Figures 5A to 5C It is by Figures 4A to 4C The function is represented by a curve graph.
[0034] Figure 6 This is a schematic diagram of the container cooling system.
[0035] Figures 7A to 7C It is a curve that depends on the thermal margin adjustment of a set of electronic hardware for its lifespan. Detailed Implementation
[0036] Figure 1A thermal control system 100 for a data center is illustrated. The thermal control system 100 is managed by a controller 110, which communicates electronically with other components of the thermal control system 100. The controller 110 includes a memory 115 in the form of a non-transitory computer-readable medium that stores data that can be read by means of electronic devices. The medium may be, for example, a hard disk drive, a memory card, a read-only memory (“ROM”), a random access memory (“RAM”), an optical disk, or any other type of writable and read-only memory. The memory 115 stores instructions 117 that, when read by one or more processors 119 of the controller 110, will cause one or more processors 119 to perform adaptive thermal control operations as described herein, individually or in any combination, by controlling other hardware in the thermal control system 100. Instructions 117 may be instructions that perform any thermal control operations described herein, individually or in any combination. Although controller 110 is shown as a single unit, it can be multiple devices distributed across thermal control system 100, each with different functions and controlling different hardware. In various examples, these distributed devices may or may not communicate electronically with each other. Thus, in other examples, any element of thermal control system 100 shown in the illustrated example as communicating electronically with controller 110 may instead communicate electronically with discrete controllers that do not control other elements of the system.
[0037] The thermal control system 100 includes multiple cooling loops. The example shown will be described using water as the medium circulating in some of these loops (such as the "plant water" loop and the "process water" loop). Thus, water is presented as an example of a cooling medium that can be used to implement the concepts of this disclosure, but the same concepts can be applied in the same way to any other fluid cooling medium. Therefore, water should also be considered interchangeable with other fluid cooling media whenever it is mentioned in this specification.
[0038] The plant water loop 101 is used to transfer heat collected from the entire data center out of the building. The plant water loop 101 includes a hot side 112A that carries water heated by other parts of the thermal control system 100 toward the cooling tower 114; and a cold side 112B that carries water from the cooling tower 114 back to the other parts of the thermal control system 100. The cooling tower 114 is an external heat exchanger that transfers heat from the hot plant water to the environment outside the data center. In a specific example of the cooling tower 114, heat is expelled from the plant water through passages in the cooling tower 114 by ambient air 116. The cooling tower 114 is configured to conduct heat from the plant water within the cooling tower 114 to its outer surfaces, so that the flow of ambient air 116 across these outer surfaces cools the plant water. In various examples, the cooling tower 114 can be operated entirely passively, or it can be operated using fans or other impellers to compress ambient air 116 through the cooling tower 114. Cooling tower 114 is one example of an external heat exchanger, and in other examples, other external heat exchangers may be used to replace or supplement cooling tower 114 to transfer heat from plant water loop 101 to the environment outside the data center.
[0039] The process water loop 102 similarly includes a hot side 122A carrying water heated by other components in the data center and a cold side 122B carrying water that can be used as a coolant. Between the hot side 122A and the cold side 122B of the process water loop 102, heat is transferred from the process water to the plant water. Therefore, the heat collected by the process water loop 102 can be transferred to the plant water loop 101 to carry heat out of the building. Thus, the plant water loop 101 is an external loop because it is closer to the point where the heat leaves the building along the path of heat through the data center, while the process water loop 102 is an internal loop because it is further away from the point where the heat leaves the building along the path of heat through the data center. The illustrated example shows only one process water loop 102 and one plant water loop 101, but other examples of thermal control systems 100 may include multiple process water loops 102 that transfer heat to a single plant water loop 101, or multiple plant water loops 101 that collect heat from a single process water loop 102.
[0040] Heat is transferred from the process water loop 102 to the plant water loop 101 via an internal heat exchanger 118 and a chiller 120. Both the internal heat exchanger and the chiller are located downstream of the hot side 122A and upstream of the cold side 122B of the process water loop 102, and downstream of the cold side 112A and upstream of the hot side 112B of the plant water loop 101. The heat exchanger 118 can be any type of heat exchanger, such as a shell-and-tube heat exchanger, a plate heat exchanger, or any other type of structure that allows process water and plant water to flow across opposite sides of a heat conduction barrier. The chiller 120 is located downstream of the heat exchanger 118 and can be any device that uses a refrigerant loop to transport heat from the process water to the plant water. The heat exchanger 118 is optional, and therefore some embodiments other than the illustrated example omit the heat exchanger.
[0041] The process water circuit 102 collects heat from one or more zone coolers 130. Although only one zone cooler 130 is shown, multiple zone coolers 130 can be connected to the process water circuit 102 between the hot side 122A and the cold side 122B.
[0042] Each zone cooler 130 includes a cooling element 134, which may be, for example, a heat exchanger, a refrigerant-cycle-based chiller, or both, for transferring heat from the local medium loop 103 to the process water loop 102. The zone cooler 130 of the illustrated example also includes a driver 136 that forces the local cooling medium along the local medium loop 103. Thus, the illustrated zone cooler 130 may be, for example, a cooling fan, in which case the cooling element 134 may be a cooling fan coil, and the driver 136 may be an impeller for driving air as the local cooling medium, or it may be a cooling distribution unit (“CDU”), where the cooling element 134 may be any device suitable for cooling the fluid as the local cooling medium, and the driver 136 may be a pump. In other arrangements, if the local medium loop 103 is of a type that does not require pressure, the driver 136 may be omitted, for example, in an embodiment where the local medium loop 103 is an evaporative cooling loop and the cooling element 134 is a condenser. In a data center that houses multiple zone coolers 130 and local media loops 103, the zone coolers 130 and local media loops 103 can be of different types.
[0043] The local medium loop 103 includes: a cold side 132B that delivers a local cooling medium (such as air, water, dielectric fluid, or any other coolant suitable for the type of electronic hardware being cooled) to one or more housings 140 at a relatively low temperature; and a hot side 132A that returns the local cooling medium to the zone cooler 130 after the local cooling medium has been used to cool the electronic hardware. Although only one housing 140 is shown, each zone may contain several housings.
[0044] Each housing 140 contains one or more units 141. Each unit 141 may be a separate electronic hardware component, or it may be a container or housing for electronic hardware. In the example shown, unit 141 is a tray-type container for electronic hardware in the form of a server, but the concept of this disclosure is applicable to any type of heat-generating electronic hardware, and housing 140 is a server rack.
[0045] As illustrated in the example, each unit 141 may optionally include an onboard cooling system. An example of an onboard cooling system includes: a driver 148 (such as a fan or pump) for driving a localized cooling medium through a heat load 150; an inlet thermometer 146B for measuring the temperature of an inlet flow 142B of the localized cooling medium; and an outlet thermometer 146A for measuring the temperature of an outlet flow 142A of the localized cooling medium. The onboard cooling system may also include or communicate with one or more thermometers that measure the operating temperature of cooled electronic hardware stored in unit 141.
[0046] Load 150 is an object cooled by a localized cooling medium flowing through or across unit 141. Load 150 can be a standalone electronic hardware component, or, as shown in the example, can be a thermoelectric element 154, a heat sink 158, and a thermal interface 156 between the thermoelectric element 154 and the heat sink 158. The thermoelectric element 154 can be any electronic hardware that generates heat and will benefit from cooling, such as a processor die. The heat sink 158 can be any structure that facilitates the transfer of heat to the localized cooling medium passing through the load portion 152, such as fins, pins, or a cold plate. Although only one load 150 is shown in unit 141 in the example, each unit 141 may contain multiple loads 150.
[0047] The overloaded portion 152 of the local cooling medium exits unit 141 as an outlet flow 142A. Outlet flow 142A connects to the hot side 132A of the local medium loop 103 and returns to the zone cooler 130. The zone cooler uses cooling elements 134 to transfer the heat from the heated local cooling medium returned from the hot side 132A of the local medium loop 103 to the process water. The process water heated by the cooling elements 134 of the zone cooler 130 travels along the hot side 122A of the process water loop 102 to the heat exchanger 118 and the chiller 120. Both the heat exchanger 118 and the chiller 120 transfer the hot process water received from the hot side 122A of the process water loop 102 to the plant water. The plant water heated by the heat exchanger 118 and the chiller 120 travels along the hot side 112A of the plant water loop 101 to an external heat exchanger, such as a cooling tower 114, which transfers the heat from the plant water out of the building and into the environment surrounding the data center. Therefore, the thermal control system 100 uses a chain of cooling operations to collect heat from several individual loads 150 and transport the heat out of the building. This chain of operations ultimately relies on an external heat exchanger (i.e., cooling tower 114 in the example shown) to remove the heat out of the building and create the capacity to remove the heat from the loads 150.
[0048] Because cooling tower 114, or any other external heat exchanger that can be used to transfer heat from the plant water out of the building, relies on the external environment as the cooling medium, the temperature difference between the hot side 112A and the cold side 112B of plant water loop 101 will vary with the weather. Therefore, if the heat generated by the data center remains approximately constant, the temperature of the cold side 112B of plant water loop 101 will decrease as the weather gets colder. Conversely, if the heat generated by the data center remains approximately constant as the cold side 112B of plant water loop 101 cools, heat can be transferred more efficiently from process water loop 102 to plant water loop 101. That is, as the cold side 112B of plant water loop 101 cools, more heat will be transferred from process water to plant water at heat exchanger 118 with almost no energy cost. Therefore, as the cold side 112B of plant water loop 101 cools, the process water downstream of heat exchanger 118 and upstream of chiller 120 will cool. Therefore, the temperatures of both the process water and the plant water flowing into the chiller 120 will affect the amount of power required for the chiller 120 to operate and reduce the cold side 122B of the process water loop 102 to a given target temperature. As the temperature difference between the process water immediately upstream of the chiller 120 and the target temperature decreases, the chiller 120 will require less power to reduce the process water to the target temperature, and the chiller 120 will operate more efficiently as the plant water immediately upstream of the chiller 120 cools down. Due to the aforementioned interactions, when the weather outside the data center is cold, the cold side 122B of the process water loop 102 can become lower with little or no additional energy cost. Lowering the temperature of the cold side 122B of the process water circuit 102 can be used to enable the zone cooler 130 to more effectively lower the cold side 132B of its corresponding local medium circuit 103 to its normal temperature, or to lower the cold side 132B of its corresponding local medium circuit 103 to below normal temperature, thereby improving the operating efficiency and lifespan of the cooled hardware represented by the load 150.
[0049] like Figure 2A and Figure 2B As shown, the controller 110 can be given instructions 117 or operated manually to take advantage of colder weather by adjusting the target temperature 210 over time, wherein the controller 110 is configured to operate the chiller 120 as needed to maintain the cold side 122B of the process water circuit 102 at or below the target temperature 210. Figure 2A and Figure 2BIn this system, a year is divided into intervals 202, and these intervals 202 are grouped into hot seasons 204 and cold seasons 208, which together constitute a full year. Hot season 204 includes intervals 202 during which the average external temperature 207 exceeds a threshold temperature 206, while cold season 208 includes intervals 202 during which the average external temperature 207 is below the threshold temperature 206. The threshold temperature 206 represents a temperature at which the target temperature 210 can be reduced to below the upper limit temperature 216 without unacceptable energy costs.
[0050] During the hot season 204, the target temperature 210 remains constant at an upper limit temperature 216, which is the temperature at which an acceptable compromise is reached between the data center's cooling needs and the cost of cooling process water during the expected hottest weather at the data center location. The average external temperature 207 can be derived from historical weather data from previous years' intervals 202. While the hot season 208 and the cold season 204 consist of continuous intervals 202 in the illustrated example, seasons 204 and 208 can include non-continuous intervals, depending on the region. Intervals 202 can have any length, such as months, days, or the time between samplings at a sampling rate of a real-time weather monitoring system. In various embodiments, temperature data 207 and any other temperature measurements of conditions outside the data center can be pure temperature measurements, i.e., dry-bulb temperature (“DBT”), DBT measurements and any or any combination of humidity, wind chill, and cloud cover measurements, wet-bulb temperature (“WBT”), or wet-bulb black-bulb temperature (“WBGT”).
[0051] exist Figure 2A and Figure 2B Of the two, the target temperature 210 varies in response to weather changes during the cold season 208, such as by making the target temperature 210 a non-piecewise function of a selected form of weather temperature measurement. Figure 2A In the example, based on historical weather data from the data center location, the target temperature 210 varies over sub-interval 212 during the cold season 208. While sub-interval 212 is half the length of interval 202 in the illustrated example, sub-interval 212 in other examples can have any length, including lengths greater than interval 202. Figure 2B In the example, the target temperature 210 changes continuously throughout the cold season 208 in response to real-time temperature measurements or weather forecasts. In any case, when the target temperature 210 is made to change in response to current, predicted, or historical weather, rather than remaining constant at the upper limit temperature 216, the target temperature 210 can be set in various examples to maintain a constant difference between the external temperature and the target temperature 210, or to be set as the lowest temperature of the cold side 122B of the process water loop 102 that is expected to be maintained without exceeding acceptable energy costs.
[0052] Replacement Figure 2A and Figure 2B The predetermined discrete hot season 204 and cold season 208 can continuously determine whether the target temperature 210 will remain at the upper limit temperature 216 or vary with the weather, in response to weather forecasts or real-time temperature measurements. When making the decision on whether to maintain the target temperature 210 at the upper limit temperature 216 or vary with the weather, in response to weather forecasts or real-time temperature measurements, this decision can be made continuously or within any repeating time interval. Therefore, in the absence of predefined seasons 204 and 208, the target temperature 210 can be allowed to rise to the upper limit temperature 216 or fall below the upper limit temperature daily or even at different times of the day, in response to weather changes that cause the external temperature to be above or below the threshold temperature 206.
[0053] Because more heat leaves the building through cooling tower 114 during colder weather, the cold side 122B of the process water loop 102 can occasionally become colder with little or no additional energy cost by lowering the target temperature 210 below the upper limit temperature 216 when colder weather is predicted or detected as described in any of the examples above. The cooler process water, in turn, makes zone cooling more efficient, so seasonal or real-time weather-based adjustments to the target temperature 210 can be used to improve the overall energy efficiency of the thermal control system 100, to cool electronic hardware in the data center to lower temperatures during periods of cold weather, or both. Cooling the electronic hardware to lower temperatures (even occasionally) will lower the hardware's average operating temperature over the year, which tends to increase the hardware's lifespan and operational efficiency.
[0054] Based on the above text about Figure 2A and Figure 2B The modified weather data 207, measurements, and weather forecasts described may have any degree of geographic specificity. For example, the weather data 207, measurements, and weather forecasts may be derived individually or in any combination from measurements taken on-site at a data center, local weather records, or regional weather reports.
[0055] Alternatively or additionally, commands 117 may be provided to controller 110 or manual operation may be performed to adjust the zone cooling to such... Figure 3The area shown is filled in the following manner. An area can be, for example, any space used to store a set of electronic hardware or hardware containers, such as rows of housings 140 in a server lobby, where the stored hardware is cooled by a localized cooling medium supplied by a single zone cooler 130 or a set of discrete zone coolers 130. For example, a server lobby can be filled with rows of housings 140, all in the form of racks or server cabinets, configured to draw in a cooling medium in the form of cold air on one side and exhaust hot air from the opposite side. The housings 140 in each row can face the same direction, and the rows can face alternating directions. Thus, except for a few rows at the ends of the server lobby, each row will have an intake side facing the adjacent row across a passageway of relatively cold air, and each row will have an exhaust side facing the adjacent row across a passageway of relatively hot air. The passageway of relatively cold air can be referred to as a cold passageway, and the passageway of relatively hot air can be referred to as a hot passageway. In a server lobby configured in this manner, air in each hot aisle is considered at least part of the hot side 132A of the local media loop 103 and is therefore drawn from the lobby and directed to one or more fan-type zone coolers 130 with cooling coils. The fan-type zone coolers 130 with cooling coils then create the cold side 132B of the local media loop 103 by cooling the air and blowing the cooled air into one or more cold aisles. A server lobby configured in this manner can have multiple zone coolers 130 distributed around the lobby, and thus areas within the lobby can be identified by dividing the lobby into rows of groups or groups of sections of rows that primarily receive cool air from shared zone coolers 130.
[0056] Cooling in areas not filled with electronic hardware can be adjusted to utilize the lower heat load generated by the area by reducing the temperature of the localized cooling medium supplied to that area, increasing the flow rate of the localized cooling medium circulating through the area, or both. Due to this lower heat load, any of these measures can be implemented with relatively low additional energy costs. For example, as... Figure 3As shown, when the area of the cold passage described above is filled to less than a threshold amount 306 (which can be any amount less than 100% of the maximum amount of electronic hardware the area is designed to accommodate), the area cooler 130 responsible for that area can increase its fan speed 310 and decrease its local coolant supply temperature 320 by a certain amount from its normal setpoint. This amount is positively correlated with the difference between the threshold amount 306 and the actual area fill amount. Supplying the local coolant at a higher flow rate and a lower temperature allows the electronic hardware to be effectively cooled to a lower temperature. The same principle can be applied to the storage and cooling of other types of electronic hardware. For example, when the local coolant is a fluid, when the area is filled to less than the threshold amount 306, the speed at which the fluid is pumped through the local medium loop 103 of the area can be increased. These adjustments can be applied as the initially empty area is gradually filled with hardware, meaning that the flow rate and temperature of the local coolant in that area will approach its final setpoint as the area gets closer to being filled.
[0057] Figures 4A to 4C All are used for Figures 5A to 5C The graph shown shows the relationship between the heat margin TM of one of the piecewise functions and the inlet temperature TInlet. Figure 4A It shows Figure 5A The result of the function, Figure 4B It shows Figure 5B The result of the function, and Figure 5C It shows Figure 5C The result of the function. Figures 5A to 5C Each of the functions can be provided as instructions for the controller (such as instruction 117 for controller 110) to manage the thermal margin TM that the container (such as unit 141) will maintain for the electronic hardware (such as element 154) stored therein.
[0058] exist Figures 5A to 5C In each of the functions, the thermal margin TM is defined as the difference between the hardware's predetermined maximum tolerable operating temperature and the hardware's actual operating temperature. Therefore, as the thermal margin TM of a given hardware component increases, the actual operating temperature of that component decreases. The maximum tolerable operating temperature can be set, for example, by the hardware manufacturer, an independent testing organization, or the owner of a specific unit of the hardware. The inlet temperature TInlet is the temperature of the local cooling medium flowing into the container's inlet flow (such as inlet flow 142B), which can be measured using a thermometer at the container inlet (such as inlet thermometer 146B).
[0059] Each piecewise function is divided into a cold region 405 below a threshold temperature 406 and a hot region 407 above a threshold temperature 406. The threshold temperature 406 can be any temperature determined as a point at which the heat margin TM is changed.
[0060] exist Figure 5A In the piecewise function, the thermal margin TM remains constant at the lower bound margin in thermal domain 407. The lower bound margin refers to a margin that provides favorable operating conditions for cooled hardware and is consistent with the relevant domain (which can be...). Figure 5A An acceptable balance is achieved between the energy required to cool the hardware at the expected maximum inlet temperature (TInlet) of the function's thermal domain (407). Therefore, Figure 5A The function is applied as a constant subfunction in thermal zone 407, which is the lower bound margin. In cold zone 405, Figure 5A The function applies a subfunction that creates a direct correlation between the heat margin TM and the absolute value of the difference between the threshold temperature 406 and the inlet temperature TInlet. Figure 5A In the example, the subfunction applied to cold zone 405 is a geometric function, where the absolute value of the difference between the threshold temperature 406 and the inlet temperature TInlet is multiplied by a constant C, and the resulting product is added to the lower margin. However, piecewise functions according to other examples can keep the heat margin TM constant at the lower margin of hot zone 407, and any type of function, including non-geometric functions, can be applied to create a direct correlation between the heat margin TM and the absolute value of the difference between the threshold temperature 406 and the inlet temperature TInlet of cold zone 405.
[0061] When the inlet temperature TInlet exceeds the threshold temperature 406, the thermal margin TM is kept constant at the lower limit margin. Then, as the inlet temperature TInlet further falls below the threshold temperature 406, the thermal margin TM increases, allowing the cooled hardware to have a lower operating temperature during times when the localized cooling medium supply becomes cold (such as during winter or when localized weather is cold), while maintaining energy efficiency when the localized cooling medium supply is relatively warm. Therefore, compared to hardware that always maintains the lower limit margin, a stepwise function (such as...) with this characteristic... Figure 5A The stepwise function can reduce the lifetime average operating temperature of adaptively cooled hardware with very little energy cost.
[0062] Figure 5B and Figure 5C An additional piecewise function is shown, which creates a direct correlation between the absolute value of the difference between the thermal margin TM and the threshold temperature 406 of the thermal domain 407 and the inlet temperature TInlet. Figure 5B The piecewise function shows the heat margin TM at the lower limit margin in the cold region 405, while Figure 5C The piecewise function also creates a direct correlation between the heat margin TM and the absolute value of the difference between the threshold temperature 406 of the cold zone 407 and the inlet temperature TInlet. Therefore, Figure 5CThe function can also be expressed as a non-piecewise function of TM = |Threshold - TInlet| * C + Floor. Increasing the thermal margin in thermal domain 407 can offset the detrimental effects of higher ambient temperatures on cooled electronic hardware, which tend to occur simultaneously with higher inlet temperatures TInlet. Figure 5B and Figure 5C Piecewise functions use geometric subfunctions to create their direct correlations, but in other examples, non-geometric subfunctions can be used to create a direct correlation between the absolute value of the difference between the thermal margin TM and the threshold temperature 406 on one or both sides of the threshold temperature 406.
[0063] Maintaining at least a minimum difference between the hot and cold sides of a localized medium loop can benefit the efficiency of cooling and circulation of the localized cooling medium. Therefore, in some examples, the controller responsible for the container cooling system can be instructed to execute a function that seeks the highest possible thermal margin TM while maintaining the temperature difference DT at a target difference or at least that target difference. Depending on the example, the function that maintains the temperature difference DT at the target difference or at least that target difference can be used in place of any of the piecewise functions described above, or the function that maintains the temperature difference DT at the target difference or at least that target difference can be used in addition to the piecewise functions described above, wherein the piecewise functions are overridden once the temperature difference DT falls to the target difference.
[0064] The heat margin TM and temperature difference DT are directly related to the product of the rate of flow of the local cooling medium through the container. As the temperature difference between the hardware and the medium increases, the rate of heat transfer from the cooled hardware to the local cooling medium increases. Therefore, as the inlet temperature TInlet decreases, the ratio of the temperature difference DT to the rate of flow of the cooling medium through the container increases. Consequently, as the inlet temperature TInlet decreases, the ratio of the heat margin TM to the rate of flow of the local cooling medium through the container also increases. Therefore, when the inlet temperature TInlet is relatively low, the flow rate of the local cooling medium through the container can be driven higher without causing the temperature difference DT to fall below the target difference. Therefore, whenever the actual inlet temperature TInlet drops below the inlet threshold temperature TInlet, the fan or other hardware responsible for driving the local cooling medium through the container can be controlled to increase the flow rate of the local cooling medium proportionally to the difference between the inlet threshold temperature TInlet and the actual inlet temperature TInlet, in order to maintain a constant temperature difference DT while creating a larger heat margin TM.
[0065] Figure 6 A container cooling system 400 is shown, which can be configured according to the above description. Figures 4A to 5CAny functions described are used to control and create and maintain a thermal margin TM. The container cooling system 400 is configured to monitor the thermal margin of multiple electronic hardware components stored in a container (such as unit 141). The cooling system 400 includes multiple margin proportional-integral-derivative (PID) controllers 460, each receiving a corresponding thermal margin TM from a digital controller 410 to target a specific component of the electronic hardware stored in the container. If the container cooling system 400 is integrated into a thermal control system similar to the thermal control system 100, the digital controller 410 may be a controller 110 that electronically communicates 4111 with the margin PIDs 460. The margin PIDs 460 receive input from a thermometer 446 that measures at least the operating temperature of the electronic hardware in the container and the inlet temperature (TInlet) of the local cooling medium flowing into the container. Each margin PID 460 receives temperature from a thermometer that measures the operating temperature of a different corresponding electronic hardware component in the container. The margin PID 460 may optionally receive inlet temperature TInlet from a shared inlet thermometer (such as inlet thermometer 146) or different corresponding inlet thermometers. The margin PID 460 may also optionally receive outlet temperature from one or more outlet thermometers (such as outlet thermometer 146A) of the outlet flow (such as outlet flow 142A) of the local cooling medium from the container.
[0066] Each margin PID 460 uses temperature measurements received from thermometer 446 to determine whether the flow rate of the localized cooling medium through the container should be increased or decreased to achieve a corresponding desired thermal margin™ in the electronic hardware components within the container. The determined output 462 from the margin PID 460 is sent to the decision controller 464. The decision controller 464 considers output 462 and sends a single speed command to the driver PID 461. The driver PID 461 controls the driver 448 according to the speed command received from the decision controller 464. The driver 448 is any mechanism used to drive the localized cooling medium through the container, such as a fan, pump, or any other type of mechanism that can drive the cooling medium through a space. The driver 448 can be the same as the driver 448.
[0067] According to various examples, the decision controller 464 can be configured to select the lowest received output 462 or the highest received output 462 to send as a single speed command to the driver PID 461. In other examples, the decision controller 464 can be configured to strike a trade-off among the received outputs 462 (e.g., to satisfy the arithmetic mean of the fan speeds of each received output 462) and send that trade-off speed as a command to the driver PID 461. The decision controller 462 can be any device capable of performing any of the foregoing logical functions configured in a given embodiment, such as an integrated circuit or a programmable logic controller (PLC).
[0068] Although the illustrated example shows two margin PIDs 460, the container cooling system 400 can be configured with any number of margin PIDs 460 to receive measurements from the thermometer 446 and instructions from the digital controller 410, and send output 462 to the decision controller 464. In a given embodiment of the container cooling system 400, the number of margin PIDs 460 can be equal to the number of electronic hardware components in the container that are expected to be independently monitored in determining the speed of the drive 448. Therefore, the container cooling system 400 can include any plurality of margin PIDs 460. In other examples, the container cooling system 400 may include only a single margin PID 460. In such examples, the single margin PID 460 can send output 462 directly to the drive PID 461 without passing output 462 through the decision controller 464, or the margin PID 460 and the drive PID 461 can be integrated into a single PID.
[0069] Figures 7A to 7C Adaptive thermal margin distribution diagrams 517, 527, 537, and 547 are shown for the mounting of multiple electronic hardware components. Figures 7A to 7CIn each of these phases, distribution charts 517, 527, 537, and 547 are shown according to typical early failure phase 501, stabilization phase 502, and wear-out phase 503. These phases can be derived from historical data of hardware of the same type as the adaptively cooled hardware, or predicted for hardware without such historical data. Early failure phase 501, stabilization phase 502, and wear-out phase 503 are part of a “bathtub curve” frequently observed in the failure rates of mass-produced projects. Early failure phase 501 is the phase following hardware installation, during which failures due to manufacturing defects are expected. Stabilization phase 502 follows early failure phase 501. During stabilization phase 502, few failures are expected because most defective parts have failed, but inevitably, wear-out failures related to use have not yet begun. Wear-out phase 503 follows stabilization phase 502. Throughout wear-out phase 503, the failure rate climbs until all parts fail due to use-related degradation. The failure rate of a set of hardware will typically decrease during the early failure phase 501, remain relatively stable during the stabilization phase 502, and then increase throughout the wear-out phase 503. Therefore, the transition from the early failure phase 501 to the stabilization phase 502 can be defined at the earliest time after installation when the hardware failure rate decreases at a rate less than a predetermined rate. Similarly, the transition from the stabilization phase 502 to the wear-out phase 503 can be defined at the earliest time after installation when the hardware failure rate increases at a rate exceeding a predetermined rate. Any values can be used for the predetermined decrease rate and the predetermined increase rate to mark these transitions. Specific examples of predetermined increase rates or predetermined decrease rates that can be used to mark transitions between phases include 1% per day, 2% per day, 3% per day, 4% per day, 5% per day, 10% per day, 15% per day, 20% per day, and 25% per day.
[0070] Figures 7A to 7C Each example illustrates how the lower limit of the thermal margin for a group of components in a single installation adjusts over time to alter the failure rate profile of that group of components. Figure 7AIn this configuration, the lower thermal margin 517 for a group of components is lower during a portion of the early failure phase 501 than at any point during the stable phase 502. This accelerated early failure rate is visible in the relative height of the leftmost portion of the adjusted failure rate curve 515. Reducing the lower thermal margin 517 for a group of hardware causes the anticipated early failures to occur earlier, resulting in a shorter adjusted early failure phase 511 than early failure phase 501, which will be observed if the lower thermal margin 517 remains constant throughout the lifetime of the group of hardware. Therefore, the adjusted stable phase 512 following the adjusted early failure phase 501 arrives earlier than the typical stable phase 502, and the adjusted wear-out phase 513 following the adjusted stable phase 512 arrives earlier than the typical wear-out phase 503. Thus, in applications where reliable operation of the entire group of hardware is required, reducing the lower thermal margin 517 early in the lifetime of the group of hardware can be useful, as defective parts can be identified and replaced early in the lifetime of the group of hardware.
[0071] In various examples, the amount of time for which the thermal margin lower limit 517 can be kept low after the installation of this hardware set can be a predetermined amount of time, or it can be an amount of time corresponding to the actual or predicted transition from a typical early failure phase 501 to a typical stable phase 502, or from an adjusted early failure phase 511 to an adjusted early stable phase 512. As noted above, these transitions can be marked by the earliest time after the installation of this hardware set when the typical, actual, or predicted failure rate decreases less than a predetermined amount.
[0072] like Figure 7B As shown, during the adjusted early failure phase 521, the lower thermal margin 527 can be increased to mitigate the failure rate 525 during the adjusted early failure phase 521, resulting in the adjusted early failure phase 521 lasting longer than the typical early failure phase 501. Consequently, the adjusted stabilization phase 522 arrives later than the typical stabilization phase 502, and the adjusted wear-out phase 523 arrives later than the typical wear-out phase 523. Therefore, increasing the lower thermal margin 527 earlier in the lifespan of this pair of components can help reduce the peak demand for hardware replacement expected during the typical early failure phase 501.
[0073] In various examples, the amount of time for which the thermal margin lower limit 527 can be maintained high after the installation of this hardware set can be a predetermined amount of time, or it can be an amount of time corresponding to the actual or predicted transition from a typical early failure phase 501 to a typical stable phase 502, or from an adjusted early failure phase 521 to an adjusted early stable phase 522. As noted above, these transitions can be marked by the earliest time after the installation of this hardware set when the typical, actual, or predicted failure rate decreases less than a predetermined amount.
[0074] like Figure 7C As shown, by increasing the lower thermal margin 537 before the expected typical wear-out stage 503 is reached, the adjusted stabilization stage 532 can be extended beyond the typical stabilization stage 502. By maintaining the elevated lower thermal margin 537 during the adjusted wear-out stage 513, the failure rate 535 during the adjusted wear-out stage 533 can be kept relatively low. Therefore, raising the lower thermal margin 537 before the expected arrival of the typical wear-out stage 503 can postpone the need for hardware replacement and slow down the rate at which hardware must be replaced after wear-out begins. Figure 7C The adjustment can be with Figure 7A or Figure 7B The adjustments are applied together. According to various examples, after applying a relatively high or relatively low lower thermal margin during at least a portion of the typical early failure phase 501, the lower thermal margin may be increased before the expected arrival of the typical wear phase 503, the early-adjusted wear phase 513, or the late-adjusted wear phase 523.
[0075] Although the concepts described herein are illustrated with reference to specific examples, it should be understood that these examples are merely illustrative of the principles and applications of the concepts. Therefore, it should be understood that many modifications can be made to the illustrative examples, and other arrangements can be designed, without departing from the spirit and scope of the concepts as defined by the appended claims.
Claims
1. A data center thermal control system for cooling a group of electronic components, the data center thermal control system comprising one or more processors and a non-transitory computer-readable medium storing instructions, the instructions being configured, when executed by the one or more processors, to cause the one or more processors to control the data center thermal control system to cool the group of electronic components to a different thermal margin lower limit compared to a steady-state phase during at least a portion of an early failure phase, wherein: The early failure phase is a window after the installation of the electronic component, during which the expected failure rate of the electronic component decreases by at least a first ratio based on historical failure data. The stabilization phase is a window following the early failure phase, during which the failure rate of the electronic component decreases at a rate lower than the first rate and increases at a rate lower than the second rate based on historical failure data. The thermal margin is the difference between the predetermined temperature and the actual operating temperature of the electronic component; and The historical fault data is derived from observed faults in electronic devices of the same type as the electronic components prior to installation.
2. The data center thermal control system according to claim 1, wherein, When executed by the one or more processors, the instructions will cause the one or more processors to control the data center thermal control system to cool the set of electronic components to a stable lower thermal margin limit during the stable phase, and to cool the set of electronic components to an early lower margin limit during the early failure phase, wherein the early lower margin limit is less than the stable lower thermal margin limit.
3. The data center thermal control system according to claim 2, wherein, When executed by the one or more processors, the instructions will cause the one or more processors to control the data center thermal control system to cool the group of electronic components to the early margin lower limit from the installation of the group of electronic components until an adjusted stabilization transition time, and to cool the group of electronic components to at least the stable thermal margin lower limit at the start of the adjusted stabilization transition time, wherein the adjusted stabilization transition time is the earliest time after the installation of the group of electronic components when the actual failure rate of the electronic components in the group of electronic components is expected to decrease less than a predetermined rate.
4. The data center thermal control system according to claim 1, wherein, The lower limit of the thermal margin is lower than at any point during the stable phase, at least during a portion of the early failure phase.
5. The data center thermal control system according to claim 4, wherein, The amount of time during which the lower limit of thermal margin is kept low during at least a portion of the early failure phase is a predetermined amount of time, or the amount of time corresponding to the actual or predicted transition from a typical early failure phase to a typical stable phase or from an adjusted early failure phase to an adjusted early stable phase.
6. The data center thermal control system according to claim 1, wherein, The lower limit of the thermal margin is increased during at least a portion of the early failure phase compared to any point during the stable phase.
7. The data center thermal control system according to claim 6, wherein, The amount of time during which the lower limit of thermal margin is kept elevated during at least a portion of the early failure phase is a predetermined amount of time, or an amount of time corresponding to the actual or predicted transition from a typical early failure phase to a typical stable phase or from an adjusted early failure phase to an adjusted early stable phase.
8. The data center thermal control system according to claim 1, wherein, The lower limit of the heat margin is increased during at least a portion of the wear-out phase compared to any point during the steady-state phase.
9. The data center thermal control system of claim 3, wherein the predetermined ratio is a predetermined increase ratio or a predetermined decrease ratio that marks the transition between the early failure phase, the stable phase, and the wear-out phase.
10. The data center thermal control system according to claim 9, wherein, The predetermined rates include: 1% per day, 2% per day, 3% per day, 4% per day, 5% per day, 10% per day, 15% per day, 20% per day, and 25% per day.