What is Liquid Cooling and How Can It Help Data Centers?
AI and machine learning technologies are placing increasing demands on data centers. These technologies, which have now infiltrated nearly every aspect of the digital ecosystem, are energy intensive and thus generate enormous amounts of heat. It is crucial that heat be diverted from server equipment: heat is the primary cause of equipment failure.
While air cooling has been the standard due to its relative flexibility and ease of use, it is failing to keep up with the heat output in modern server farms. Prior to the advent of more computation-intensive technologies, most server racks peaked at 20kW. Now, many push or exceed 30kW. Graphics processing units (GPUs), which support AI and machine learning technology, exceed 40kW.
Air cooling simply cannot keep up with the heat generated by that level of power use. Whereas previously racks containing denser, more power-intensive servers could be dispersed throughout the data center so that heat could be efficiently distributed and removed, now these power-intensive servers are becoming the norm.
Thus, liquid cooling solutions increasingly appear to be a necessity in data centers that support more energy-intensive technologies. A recent survey found that while racks exceeding 30kW were relatively uncommon, around one in nine providers was using some form of liquid cooling in these cases.
Liquids have a higher heat capacity and can thus absorb heat more efficiently than air does. Water for example can absorb heat some 30 times faster than air due to its greater thermal conductivity. Because of this, effective cooling can be maintained using fluids that are closer in temperature to the components that produce the heat. Air, on the other hand, must be cooled prior to passing over the components to remove heat, creating an additional energy cost. Removal of air conditioning components can improve energy efficiency by up to 90%.
Reductions in the energy used to cool equipment may result in both cost savings and more sustainable operations. Some 40-50% of data center costs are attributable to cooling. Equipment usually needs to be maintained at around 85°C to avoid failure, though certain components have lower thresholds. Failure rates start to increase at around 75°C. Given the growing pressure on edge-computing systems, liquid cooling may even be an appropriate solution outside of data center environments.
Liquid cooling is not actually a new innovation. Immersion cooling was first used to remove heat from transformers in the 1800s. And other forms of liquid cooling were deployed in computing contexts in the 1960s. Other forms of liquid cooling have everyday applications, such as car radiators. Many desktop PCs employ liquid cooling for some of their components.
Increasing refinements have made it more feasible for IT equipment. Three main liquid cooling technologies have emerged: cold-plate cooling, immersion cooling, and spray cooling. They vary in their approaches. Cold-plate and spray cooling target specific components while all components are in contact with fluid in immersion cooling. They may use water, a combination of water and other fluids such as glycol, or oils.
All three technologies allow for greater density of servers due to their more efficient removal of heat.
Here, InformationWeek investigates the complexities of liquid cooling, with insights from two technologists: Joe Capes, CEO of Liquid Stack, a provider of liquid cooling technologies; and Arno van Gennip, VP of global IBX operations engineering at Equinix, a data center management provider.
“Nobody could have forecasted that liquid cooling is scaling this quickly or at this magnitude,” Capes says.
Types of Liquid Cooling
“At the moment there are several technologies available to provide liquid cooling. An enormous amount of development is going on in the industry and there is no standard available now,” van Gennip says. “Liquid cooling can mean different things to different organizations using one or a combination of cooling technologies.”
1. Cold-plate liquid cooling
Cold-plate technology is currently the most popular due to its compact nature, which is conducive to the creation of hybrid air and liquid-cooled systems.
In this setup, devices that contain liquid elements are directly attached to heat-generating components such as dual in-line memory modules (DIMMs), CPUs and GPUs in various ways. This is known as direct-to-chip cooling.
“There’s been a huge momentum toward liquid cooling — specifically direct to chip, ” Capes says.
The liquid may remain a liquid throughout the process (single phase) or become a gas (two phase) and then recondense. These plates, which are constructed using materials with high thermal conductivity, circulate the liquids to and from the components, carrying heat away, allowing it to dissipate, and then recirculating the cooled liquid again.
Cold-plate technologies may use just water or a combination of liquid and glycol. The liquid components exchange heat either through contact with two liquid components or through contact between liquid and air. These components link to an external cooling distribution unit (CDU). CDUs may be local or may be linked to multiple server racks.
The thermal resistance — or, roughly, the temperature difference — can be reduced by increasing the rate of liquid flowing through the cold plate. Increasing the fluid’s temperature can result in substantial reduction of drops in pressure, leading to cost savings due to the lower levels of power required to move the liquid.
As much as 75% of the heat may be removed using these systems but air cooling is required to remove the remaining portion.
“It’s really a hybrid liquid cooling approach using a combination of liquid liquid cooling at the chip, but also removing the rest of the heat with air,” Capes says. “Part of the reason it’s being adopted now is probably the easiest solution to retrofit in an existing data center.”
Backups must be installed to address potential failure of the CDUs or other system components.
This technology can also be implemented on a larger scale, with rear-door heat exchangers. The doors of the server racks contain liquid cooling elements and air circulation blows hot air over them, allowing them to absorb and remove heat from the structure.
“Technically, a rear-door heat exchanger (RDHx) isn’t true liquid cooling because the chip at the server level is still air cooled, but this approach does bring liquid closer to the rack to harness greater air-cooling effectiveness and is a great first step for many companies,” van Gennip says.
2. Immersion cooling
Immersion cooling is a far more dramatic innovation than cold-plate cooling. While these systems are receiving renewed attention, the technology is actually rather old.
Immersion cooling systems date to the 1880s following discoveries by Michael Faraday. A patent for immersion cooling of electronic components was filed in 1966 and IBM further developed the technology with the patenting of an ““immersion cooling system for modularly packaged components” in 1968. They were first commercialized on a broad scale in 2009-2010.
In immersion cooling systems, IT components are partially or fully immersed in hydrocarbons or fluorocarbons that absorb heat. The level of contact between the fluid and the components improves the efficiency of heat removal: heat resistance is reduced.
Fluids used in immersion cooling are typically non-toxic and non-flammable — though they may be combustible, meaning that they can catch fire under certain conditions but typically not under ambient temperatures. Mineral oil and coconut oil are often used.
Immersion cooling may be single-phase or two-phase, as in cold plate cooling. In single-phase immersion cooling, the liquid in which servers are immersed is circulated, removing heat and then returning cooled liquid to the bath. The fluid may be mechanically circulated or rely on convection created by the heat.
In two-phase immersion cooling, the liquid in which the servers are immersed is boiled as the temperature increases.
Joe Capes, Liquid Stack
“The hardware is immersed in a dielectric bath. The heat boils off as a vapor. We simply recondense the vapor back to a liquid and the whole process just keeps happening over and over and over again without consuming any energy because it’s just a passive system,” Capes explains.
A condensing element, typically cooled by water, is suspended above the bath containing the servers. This element absorbs the heat, allowing the vaporized coolant to condense and thus drip back into the bath. Fluids used in two-phase immersion cooling, such as fluorocarbons, are more expensive. New fluids with lower boiling points are being developed, making for a more efficient process.
“While immersion cooling can allow organizations to achieve high power densities within the data center, it also requires the most substantial changes to server technology and data center architecture. Because it’s such a radical departure from traditional methods of deploying IT equipment, immersion cooling can often have substantial upfront costs and considerations,” van Gennip claims.
Nonetheless, immersion cooling is likely to become a $1.6 billion market by 2027.
3. Spray cooling
Of the three liquid cooling types, spray cooling has been the least explored. In this system, a coolant is sprayed onto heated served components or onto a plate that is in contact with those components. Jet impingement is a related form of cooling in which fluids are ejected at high velocity onto heated components. As the liquid evaporates, heat is removed. Liquids used may be hydrofluorocarbons, water or methanol.
In direct systems, the components must be sealed inside a chamber so that vaporized coolants can be recovered and reused. In indirect systems, the spray functions occur in smaller sealed chambers where the atomized liquid contacts cooling plates.
Lower levels of fluid are required than in immersion cooling. However, the potential for nozzles to clog and the challenges of swapping out equipment without interruption to overall cooling have so far prevented the wide adoption of spray cooling technology.
Energy Savings
“As we see the amount of power used by new processors steadily increasing, air cooling will not be sufficient for certain applications. Hence there is a re-introduction of liquid cooling in the industry. From a sustainability point of view this is excellent news as it opens the way to designs with cooling installations without chillers,” says van Gennip.
Because they often require less mechanical circulation, liquid cooling systems offer substantial cost savings. A comparison of a liquid immersion cooling system and a hybrid air-liquid cooling system found that the liquid system was 88% more energy efficient. Another study indicates that implementation of liquid cooling may improve energy efficiency by as much as 45%.
Other research on liquid cooling suggests the potential for energy savings as well. An investigation of spray cooling, for example, found that spray cooling could reduce energy consumption by more than a quarter. And direct-to-chip cooling can take on 75% of the cooling load in hybrid air-liquid systems.
NVIDIA and Vertiv concluded that liquid cooling resulted in a total usage effectiveness (TUE) reduction of 15% and a 10.2% reduction in total power use by data centers employing liquid cooling.
Additionally, the fact that fans are not needed in liquid cooling systems makes them far quieter than air-cooled systems, reducing noise pollution. Servers can also be more densely packed than in air-cooled systems, making for more efficient use of space and further reducing energy costs associated with operations in larger buildings. Implementation of immersion cooling can reduce required space by 75%, for example.
In immersion systems in particular, risks posed by dust, humidity, and vibration are also reduced substantially, though fluids may need to be replenished on occasion or filtered depending on their composition. Most systems do not require toxic refrigerants. In fact, increasing the temperature of systems that use water can actually make systems more energy efficient. Minor losses in heat absorption are incurred but are made up in the reduced necessity for cooling the water.
And heat removal is far more consistent and uniform, leading to lower levels of equipment failure and consequent maintenance and replacement. Hotspots in server equipment are a persistent problem and all forms of liquid cooling appear to be superior to air cooling in mitigating or eliminating them.
Liquid cooling further expands the possibilities for locating data centers too. While air cooling systems are often best suited to temperate climates, where ambient air temperatures reduce the cost of cooling, liquid cooled centers can be located almost anywhere. Thus, countries where data centers have not historically been an option now see significant opportunity for economic advancement by soliciting the establishment of new liquid cooling-based operations.
Liquid cooling systems also offer greater opportunities for heat recovery than air-cooled systems. Air-cooled systems operate at lower temperatures and thus the waste heat produced is of lower value for other uses. It typically ranges between 30-45°C. Liquid cooling systems produce waste heat that may be up to 80°C.
“We can use much higher entering and leaving water temperatures. The waste heat from the liquid loop is a much higher value than the waste heat of an air-cooled data center,” Capes says.
This heat could be used for heating buildings or local communities or for powering biomass digestion facilities, for example.
“Liquid cooling allows us to preserve heat more effectively in the facility loop or export heat to the community where it is feasible to be used for other purposes where heat is an asset,” van Gennip adds.
Challenges to Liquid Cooling
Conversion of existing data centers to support liquid cooling technology may present one of the most exigent challenges to its wider implementation. The pipe systems that supply server rooms will need to be retrofitted in some cases, depending on the technology employed. And especially in cases where data centers are converted to immersion cooling systems, the load bearing capabilities of the infrastructure will need to be assessed.
“Your weights are different, and they’re spread out over different dimensions,” Capes says.
Assessing whether existing server components are compatible with liquid cooling, particularly in immersion and spray cooling, is also essential.
“Though direct-to-chip fits in a standard footprint, it still requires architectural changes and additional equipment to deliver liquid to the cabinet and distribute it to the individual servers — typically more so than with RDHx but less than immersion cooling,” van Gennip says.
“What you’re going to see is a new build microcosm within an existing data center because the pods that are running generative AI workloads are going to have to have higher powered power distribution units (PDUs), higher power breakers, higher power, low and medium voltage switchgear,” Capes says. “Everything has to be upsized. That still presents a challenge.”
Monitoring components for corrosion and other potential deleterious effects from contact with cooling liquids is essential. Fluids must be tested for contaminants that may interfere with their functioning or the functioning of the equipment. Incursion of particulate matter may make some coolants more viscous and thus negatively impact their ability to transfer heat, for example. Materials from the equipment itself, including metals and sealants, may also leach into the fluid. And bacterial growth may be a concern especially in water-based systems.
Filtration systems that address some of these issues have been developed. Leakage must also be intensively monitored due to its potential for damaging sensitive equipment components.
And in some cases, especially in two-phase cooling, fluids must be periodically replenished. Access to properly formulated fluids and specific parts for these systems may become an issue until demand increases.
Ease of access to regularly repaired server components must be taken into consideration, too. Cold plates that service those components and those in near proximity must be easily detachable without interrupting the entire system.
All of the processes involved require specially trained personnel, which constitutes an additional operating cost.
The conversion to liquid cooling could have happened more quickly, but Capes thinks that the proprietary concerns of chip manufactures have been an impediment.
“Chip manufacturers have always kept things really close to the vest. Even they have held on to air cooling as long as they could. When you have a cash cow, you don’t want to disrupt it,” he says. “The inflection point is that the application is driving the scale up of liquid cooling.”