SC24 – Bursting at the Seams

SC24 took place in Atlanta, GA, November 17-22. As I'm writing this, 17,959 attendees – that’s 3,000+ more than last year – registered. More than 500 companies filled the exhibition floor, which was a net new 136 companies over last year. Both attendance and corporate presence set records.

I can’t possibly present a complete round-up or summary of SC24, but you can find other recaps – written, audio, and video – that cover a lot more detail than I can here. What I can do is present some things I found interesting. If you feel I’ve missed something important, please let me know. Perhaps you can write something about your thoughts.

One of my favorite things to do is attend the Beowulf Bash (BeoBash). It is the only open party during the conference, so a wide range of people attend. Alas, my stamina gave out this year, but about 1,300 people attended, which is also a new record. For an “open” party, the list of sponsors on the home page is amazing. The power of community is noticeable. Thanks Lara, Michael, and Doug. They, and others, are the unsung heroes putting this party together.

SCinet and Power

One thing I learned this year is that SCinet, which was used for Internet access during the conference, uses quite a bit of power. I don’t know how SCinet started, but it demonstrates what it takes to build the most powerful and advanced network anywhere, including power and cooling, monitoring, fixing issues, and supporting users. A number of volunteers built and ran this global collaboration along with a respectable number of companies who donated hardware and allowed their employees to volunteer.

While attending SC24, I learned that SCinet uses 1MW/day. After a little Google search, I discoverd that’s enough power for 1,000 American homes. This number led me to think about liquid-cooled networking. The power used by a single node today is quite high, forcing systems to be cooled. Much of the time, it means cooling with some sort of liquid, but other options are possible. With large multirail systems, quite a bit of networking equipment is needed, especially for artificial intelligence (AI) and deep learning (DL) applications, where network operation is a key driver in performance. Higher clock speed network chips with an associated increase in power results in the need for liquid cooling.

What’s going to make this “fun” is that you will probably have networking cables mixed with cooling tubes coming out of the front or back of the racks. How these will be organized will be interesting and probably the key to making them usable.

Overall, given the SCinet example of 1MW/day, I am led to wonder when liquid-cooled network equipment will be needed. In my mind, the answer is “soon.” I think in the next year you will see an experiment or two around liquid-cooled networking.

HPCG and Fugaku

The TOP500 project is built around a single test: High Performance Linpack (HPL). The High Performance Conjugate Gradient (HPCG), while still a single benchmark, has several components that stress different aspects of the system from HPL. The HPCG conjugate gradient code creates different measures of performance that go into the resulting HPCG performance (reported in teraflops per second; TFLOPS). Its design focuses on data access patterns of algorithms (e.g., sparse matrix computations) that are driven by memory bandwidth (irregular memory access) and the interconnect performance.

This year’s number 1 for the HPCG Benchmark is Fugaku, which has been at the top of the list since June 2020. In the high-performance computing (HPC) world, being number 1 leads to the use of terms such as “dominant” when discussing the system. Fugaku was number 1 on the HPL list for a while and is number 6 today, but it has been number 1 on the HPCG list for four years. Let me illustrate why it is so dominant.

Fugaku was first on the June 2020 list at 13,400 TFLOPS. The previous system, Summit, achieved only 2,925.75 TFLOPS. Fugaku’s results went up a little on the next list (November 2020), at almost 5.5 times faster than Summit, until November 2022 when Frontier became the number 1 system on the TOP500. However, Fugaku was still number 1 on the HPCG list, with Frontier close behind. Frontier achieved 14,054 TFLOPS, whereas Fugaku recorded an HPCG performance of 16,004.5 TFLOPS.

On the latest TOP500 list, November 2024, Frontier was number 2, with El Capitan becoming number 1, and Fugaku slipped to number 6, but remained number 1 on the HPCG list; note that El Capitan did not appear on the HPCG list (perhaps yet).

Taking all this into account, especially considering the large increases in performance for the number 1 system on the TOP500, Fugaku remains a very strong HPC system and dominant in HPCG. Although I am not discussing the Green500, Fugaku is currently number 86 on the list and started at number 9 in June 2020 when it joined the lists. A ranking of 86 sounds low, but this is a four-year-old system that is still doing amazing research.

Although you can argue about the usefulness of HPCG as a system benchmark, when you see such great HPCG performance and very good HPL performance, it is difficult to fault Fugaku. Perhaps new systems aim at the HPL and not HPCG benchmark, which could explain why Fugaku is still number 1 on HPCG. However, these new systems also might not be as capable as Fugaku on the HPCG test. Inquiring minds want to know. I think it would be advantageous to study the design of Fugaku relative to current designs, because HPC and AI algorithms love memory bandwidth.

System Power and Cooling

It bears repeating that power and cooling is not a secondary consideration and is now a primary consideration, along with processing capability, memory, and networking. The exhibit floor at SC24 proved this beyond a shadow of a doubt. The show floor had a very large number of companies in this space, showing and discussing what they could do to help the HPC and AI industries reach their targets. These targets are a massive increase in computing power (resources) with the best possible energy efficiency and as little additional cost as possible.

Several companies showed a variety of liquid cooling options. In the past, companies offered liquid-cooled doors, but there has been a shift to direct cooling of the processors and perhaps other components in a server (e.g., memory). Typically, targeted components have heat transfer plates directly attached, through which a liquid runs to remove the heat. The heat is then transferred to a heat exchanger and dumped somewhere (usually outside the data center). Although it sounds simple, in practice it is not.

Several companies showed direct liquid-cooled server designs that are available now. The key thing to make sure is that your total data center design, including not just servers but networking and the physical building, are ready for these servers. Therefore, power and cooling experts are a huge part of any new system discussion.

Other companies displayed immersive cooling. These solutions submerge the entire server without a case into a vat of liquid. This cooling technique has been around for a while. The Texas Advanced Computing Center (TACC) experimented with this method, and they called it the “chicken fryer” because the immersive fluid smelled like fried chicken. I think today’s solutions have gone beyond this example, with several solutions being promoted on the floor.

I have to mention one vendor whose name shocked many people: Valvoline. They’ve been dealing with automotive oil and cooling for many years, including high-performance racing. Now they are taking their knowledge to data center cooling, which tells you how important cooling has become. They also won the most annoying marketing award for the most email received before I blocked them.

Reduced Precision

On Bluesky, the new social media hangout for HPC, Si Hammond from the National Energy Research Scientific Computing Center (NERSC) posted an image (Figure 1) he took during a talk at the 15th International Workshop on Performance Modeling, Benchmarking, and Simulation (PMBS24) titled “System-Wide Roofline Profiling – a Case Study on NERSC’s Perlmutter Supercomputer.”

Figure 1: FLOP types at NERSC (cropped image from PMBS24 presentation by Si Hammond; used with permission).

Notice that whereas FP64 and FP64 Tensor dominate the overall floating-point precision on Perlmutter GPUs, FP32 usage is about 36% of the total. This report surprises me because the broad domain of simulation-based HPC applications still seem to be heavily based on FP64. The presenter points out that about half of the FP64 floating-point precision is Tensor based, but no other precisions use Tensor Cores.

Overall, I think this result shows that researchers are using lower precision, perhaps by itself or in combination with higher precision in portions of the application. Moreover, they are using Tensor Cores, which indicates they are taking advantage of the hardware (which they should).

Other Topics

Almost everyone registered the big topics from SC24, such as a new number 1 on the TOP500, the rise of AI, the increasing importance of power and cooling, and so on, because a great deal has been written about them. I also want to look at topics I think are important that slip under the radar, such as Open OnDemand and MATLAB. Open OnDemand is a rapidly growing HPC and AI tool with a web-based portal that allows users to access HPC resources, which means you can get a compute node in a cluster, get a desktop running on that node, and then run interactive applications such as Jupyter notebooks or MATLAB.

I started noticing Open OnDemand a few years ago and found it to be an amazing tool. It can be argued that using a cluster node with high-speed networking and storage for interactive applications is underutilizing the resource, but the counter argument is that sometimes users need a very performant single node for interactive work, and creating a heterogeneous cluster to accommodate them leads to complications and a reduced number of high-performance compute nodes. Moreover, if a user needs a single high-performance node, a high-end workstation might not have enough compute resources, and they might not have a place to plug it in at their office or lab (too much power).

Caution is still needed when, for example, a user gets an entire node with eight GPUs, lots of CPU cores, terabytes of memory, and lots of local solid-state drive (SSD) storage and then just uses one GPU with maybe 16 cores (or worse, one core). This configuration might be important to the user, but underutilizing the resources in this case might be too much. Of course, you could allow multiple users to use the same node, so resources don’t go to waste. The moral is, carefully consider the use case because it is becoming very common.

One of the more recent presentations I’ve seen on Open OnDemand, had the presenter explaining how they used their Tesla parked in their home driveway for a remote login to the cluster and then use Open OnDemand to get an interactive desktop on a compute node that displayed on the screen in their Tesla. Of course, it was a stunt, but it showed that Open OnDemand is very flexible and easy to use once configured.

Mike Croucher of MathWorks posted during SC24 that many HPC sites are using MATLAB on Open OnDemand. MATLAB is an extraordinarily popular tool in science and engineering. On Bluesky, Croucher posted a picture of MATLAB running on a University of Utah cluster. I’ve run Open OnDemand on my small home lab clusters and it works just great. It’s particularly popular with MATLAB users, because they can get access to a very powerful single node when their laptop or desktop doesn’t have the resources they need.

The combination of MATLAB, one of the most powerful science and engineer tools, along with single servers with many GPUs (more than four) per server, terabytes of memory, and lots of local SSD storage that is all made available by Open OnDemand can enable researchers to go to the next level beyond a small laptop or even a desktop. You really can’t take such a server and plug it in where a user sits or even in a lab. With Open OnDemand and powerful tools such as MATLAB and other easy-to-use tools, you can use multiple CPUs and GPUs and move up the food chain without being condemned to the end of the “long tail of science.”

comments powered by Disqus