Update on Containers in HPC

Observations on the recent HPC Containers Survey.

In this article, I review the results of the HPC Container Community Survey, which comes from the Containers Working Group (CWG), whose charter describes the members and mission as: “container enthusiasts that want to discuss the needs of the HPC community and how they can be integrated into existing or new Open Container Initiative (OCI) projects, or perhaps CNCF [Cloud Native Computing Foundation] projects.”

The CWG is apparently part of the Open Containers Initiative (OCI) established primarily by Docker around 2015. However, the CWG, particularly the HPC container community, isn’t limited to only Docker solutions.

To review, a container is an isolated userspace that “contains” an application and all of the dependencies, which usually includes system tools with system libraries and settings, but not a complete operating system (that would be a virtual machine, VM). Containers require a host operating system.

Containers have two parts: the container image and the container runtime. The image is what people think of when discussing containers, whereas the runtime is the key to turning an image into a running container. You need both.

The most common question asked about containers was what you might think: “Why Containers?” I have my reasons, but I looked around for what others had to say, and I found an article that highlighted some generic points not geared just to high-performance computing (HPC) but that I think still apply to HPC to varying degrees:

Scalability: You can use system resources more efficiently because containers are lightweight, quick to deploy, and quick to scale.

Agility: Because of their isolated nature, updates and troubleshooting performed on one container won’t affect applications running in different containers.

Portability: You can run the same application on different operating systems without the need to rewrite portions of code or change the application’s configuration.

Fault tolerance: Containerized applications are resilient, meaning that a sudden failure in one container won’t have an effect on others, ensuring higher availability.

Cloud migration: You can encapsulate your company’s legacy applications in a container and move the applications to the cloud almost effortlessly.

The survey or questionnaire the CWG created was to collect user input on the state of containers in HPC. An online survey and a summary of the survey results was created. In this article, I want to go over the results, adding my observations and opinions to the comments in the survey summary.

Container Technologies

The study considered a number of key container technologies, including:

  • Charliecloud
  • Shifter
  • Apptainer (formerly Singularity)
  • Sarus
  • Podman
  • Docker
  • Enroot/enroot
  • Kubernetes
  • CRI-O, containerd, Kata
  • NVIDIA/enroot, Pyxis
  • Containers/runc
  • Fuzzball
  • K3s/containerd
  • Enroot/Pyxis
  • Rocker in R
  • containerd
  • uDocker
  • podman-hpc
  • pcocc
  • crictl
  • and others

The list is exhaustive, but I think it’s very reasonable.

Study Results

The study asked a series of questions that could be answered with a choice of radio buttons and ended with two questions that had free-entry fields. The results were published online with some brief commentary. I've gone through the input of the 189 respondents and pulled the top three or so from each question. Usually, the top three answers constituted the majority of responses, but not always.

I don’t know the details of how the study was conducted; I assume that a respondent got to “vote” for a specific option, but it looks like they could choose more than one because many of the respondents used more than one container technology.

Even before examining the answers, you need to understand the distribution of respondents. More than 50% were from academia. The next largest group was from national labs. The combination of academia and national labs made up just about 75% of the people responding. The other categories were:

  • Commercial
  • Consulting
  • Private research institutions
  • Government
  • Other

The commercial sector was a bit smaller than the national laboratory group, but after that, the collective number of contributors from the other sectors was small in comparison.

In the question about experience with containers, around 60% said they had intermediate experience, and about 30% said they had expert experience. “Beginner” was relatively small, and the “no experience” sector was very small.

Combining these two questions, the majority of people who responded were from academia, national laboratories, or the commercial sector and had intermediate to expert experience with containers (self-rated). Keep this in mind when reviewing the answers in the next section.

About Container Technologies

One section of the survey asked some detailed questions about the specific container technologies listed previously. As I mentioned earlier, I think the respondents were allowed to select more than one response. The questions are listed below along with the top three answers to each question. Of course, I’ll add my observations, as well.

 

Q: Which container technologies are supported on system(s) you are working on?

  1. Singularity/Apptainer
  2. Docker
  3. Podman

I’m not surprised with the top three results, because they are the top three technologies overall for containers (not just HPC). However, a fairly long “tail” of people use other technologies, such as Charliecloud, Sarus, Enroot/Pyxis, Shifter, uDocker, and Kubernetes. Note that the question is not about which containerization technology they use but which technologies are supported on the systems they are using.

Note that Singularity/Apptainer and Podman are “rootless” technologies, whereas generally, Docker requires root (yes, I’m aware of the rootless Docker capability). This may come into play in later questions.

The next question focused more on HPC container technologies.

 

QOn those same systems and out of the set above (previous question), which HPC container technologies are you using?

  1. Singularity/Apptainer
  2. Docker
  3. Podman

Singularity/Apptainer got the largest number of votes by far. It had more than two times the number of votes as Docker. This result makes some sense because it was designed for HPC applications and workloads. Docker and Podman had a reasonable number of votes, with a long tail of responses.

The study summary notes something interesting: Only about 50% of the respondents who voted for Podman actually use it.

The next question in the study focuses more on the individual.

 

Q: What container technologies do you use on your local machine(s), personal or for work?

  1. Docker
  2. Singularity/Apptainer
  3. Podman

For this question, Docker ranks as the number 1 technology, about two times that of Singularity/Apptainer, followed by Podman. I think it’s interesting that despite the use of Singularity/Apptainer on HPC systems, Docker is primarily used on the respondents’ local machines. I suppose this result can be atttributed to users having root access to their local systems, so they don't need rootless technologies. However, I’m curious what they do with the Docker containers beyond their local systems. Do they port them to Singularity/Apptainer or just use the Docker containers with Singularity/Apptainer on larger systems?

A related question was asked next.

 

Q: Which HPC container technologies have you not used that you would like to use?

  1. There are no new container runtimes that I want to use
  2. Podman
  3. Charliecloud
  4. Shifter

The responses are very interesting because the number 1 response was no new container runtime technologies. The number 3 and 4 answers have about the same number of votes, but the number is about half that for the number 2 answer, Podman.

I'm not sure what to make of the response to this question, but it seems like most people prefer to stay with what they have or try Podman.

About Images

The next set of questions is around building container images.

 

Q: What specification or recipe do you use to build containers?

  1. Dockerfile
  2. Singularity recipe
  3. From scratch or manually

Personally, I expected the sum of Docker and Singularity/Apptainer to be the majority of the votes, and this is the case. Dockerfile is the number 1 answer, with about 50% more votes than Singularity/Apptainer. This result probably has reasons that require a deeper dive into the why, but Dockerfile being number 1 is still interesting for HPC users.

The next logical question after which recipe to use is about supporting tools.

 

QDo you use any supporting tools to build containers?

  1. Spack
  2. easybuild
  3. other
  4. HPCCM (HPC Container Maker)
  5. repo2docker

I included all of the answers because after Spack, the number of votes is sort of close. I think a number of HPC sites are using Spack and easybuild to build and install their application packages, and this trend appears to continue when building containers.

Once the respondents build containers, they need to put them somewhere that users can access. The next question required a simple Yes or No response.

 

Q: Once built, do you tend to push containers to a central registry?

  • Yes – a bit more than 50%
  • No – a bit less than 50%

The response to this question is interesting and a bit puzzling. If the containers are created, how do users access them? Are they left as TAR files in shared storage? Are they created and used by only one user, so they don’t need to be in a central repository? I would like to see a deeper dive – perhaps interviews with respondents – around this question.

The next question was for those respondents that use a central repository.

 

Q: What container registries are you pushing builds to?

  1. Docker Hub
  2. GitLab
  3. GitHub Packages (ghcr.io)

A total of 12 answers were given for this question, but the first three make up a majority of the votes, with a long tail of other options, such as self-hosted or major cloud service providers (CSPs).

The combination of the response to this and the previous questions allows you to think about the possibility that participants who don’t use a central repository perhaps don’t like any of the 12 options. Perhaps they find them difficult to use? Perhaps they don’t want their containers to be available outside of the system? (They could still self-host in some fashion.) Again, I would like to get more information on the combination of these two questions.

The next question is one of my favorites, so I provide all answers, although the last four responses make up just 13% of the votes.

 

Q: In what context(s) are you using containers?

  1. HPC applications or simulations
  2. Developer environments (local machine)
  3. Kubernetes
  4. Remote developer environments
  5. Cloud or stateless services (e.g., Cloud Run, App Engine, etc.)
  6. HPC Workload managers in Kubernetes
  7. Provisioning nodes
  8. Other
  9. Continuous Integration/Continuous Deployment

The number one answer is, of course, HPC simulations (this is a survey focused on HPC), but the other responses are fun to think about.

The number 2 answer was for developer environments. I like the approach of putting a developer environment in a container image and then customizing it to suit my needs without having the system administrator build something just for me in the system image or on shared storage. I can use that image when I want or need, even on other systems (e.g., my local machine). If I need a container image that was created three years ago, I can most likely run that container on the current system. That’s one of the benefits of containers: I should be able to run that same container image in another three years if I need it.

Other interesting responses to the question were stateless services (why does HPC need these?), HPC Workload managers in Kubernetes, and provisioning nodes.

Free-Form Responses

I’m skipping the next two questions in the survey summary and jumping to the last two questions, which provided space for free-form answers that allowed for a wide range of responses. I went through the answers and created a summary of what I think are the top responses. Below is the list I gathered, but not in any specific order.

 

Q: What are your biggest challenges or pain points when using containers, or reasons that you don’t use them?

  • Better message passing interface (MPI) operations/instructions/integration
  • Better GPU/accelerator integration
  • More compatibility with container and host MPIs and GPU drivers
  • Better security (wider range of specific items)
  • Better ease of use, reduced complexity

First, I have many questions around the use of MPI and GPUs within a container and how they interact with host MPI libraries and GPU drivers. Some people run containers every day with MPI and GPU code. For example, I run codes all the time that uses Horovod (a distributed deep learning training framework) with MPI and GPUs. I know these types of containers can work, so I suppose it’s more a question of explaining the details and giving lots of examples for how to create container images and configure the hosts with these two technologies.

Moreover, it’s not just a matter of putting MPI and GPU code into the container image but how to execute that image to create a container that can interact with the host(s), including the GPUs using MPI. Of course, how to configure the host properly should be included and should include TCP and InfiniBand protocols, as well as multiple interfaces for MPI functions.

From the responses, I think a solution would be along the lines of better documentation and education rather than the use or creation of a new technology. MPI and GPUs are used on containers all the time, so it works. In my opinion, the best practices regarding this integration need to be documented in a fairly detailed way with lots of examples that are communicated to the broader community with good explanations. Perhaps the starting point would be a table of MPI library version combinations for hosts and containers that work together, as well as documentation for combinations that don’t work. The same approach for GPUs, with a table of combinations that are known to work, could also be presented.

The second challenge or pain point that I thought was important is security. I’m not enough of a security person to dive into the topic of container security, but I know issues exist in creating and running containers. Something along the lines of a FAQ for security with answers for creating and executing containers needs to be developed that address common questions about combinations of technologies known to be secure (along with defining what is meant by “secure”). The FAQ could begin with how to create container images without users requiring elevated security privileges or, if elevated privileges are required, how to create an environment so such privileges would not be an issue (if possible).

The third main topic I took away from the long list of responses is that the building and execution of containers for HPC should be easier, with reduced complexity. Unfortunately, this type of question is very specific and needs a specific response. An aspect that one user thinks is difficult might not be a problem for another user. However, I believe it is possible to make both creating the container image and managing it, along with running the container image, easier to accomplish.

For some time, I have thought that you can use 20% of the features of a tool or library and get 80% of the benefit. An example of this is the OpenMP Common Core document, which describes a few core directives needed to parallelize your code and get reasonably good performance. The rest of the directives, although still very useful, deal with the underlying details that can be used for tuning specific use cases. The same is true for compilers, because you don’t use most of the options, and for libraries (e.g., MPI libraries), because you use just a small number of the MPI standard functions. [Note: At some point in the past, someone started a project to create an MPI subset, but I can’t remember where that was. Perhaps Ames National Laboratory at Iowa State? If you have any pointers or links, please post them somewhere like LinkedIn or Twitter.]

By applying this 80/20 principle, couldn't container developers create “defaults” that encompass 80% of the needed container commands and functions? Everything else would be an optional variable or function. Think of it as the Container Common Core Toolkit. No one really has such a beast.

An alternative to this approach, and one that I really like, is the HPCCM toolkit, which allows you to create very short Python scripts from which you can write out Dockerfiles or Singularity recipes. For example, HPCCM can process this simple Python script that creates a container with a base OS, compilers, and OpenMPI:

"""This example demonstrates recipe basics.
 
Usage:
$ hpccm.py --recipe test2.py --format docker
# hpccm.py --recipe test2.py --format singularity
"""
 
# Choose a base image
Stage0.baseimage('ubuntu:16.04')
 
ospackages = ['make', 'wget', 'bzip2', 'tar']
Stage0 += apt_get(ospackages=ospackages)
 
# Install GNU compilers (upstream)
Stage0 += apt_get(ospackages=['gcc', 'g++', 'gfortran'])
 
Stage0 += openmpi(cuda=False, infiniband=False,
                  prefix='/usr/local/openmpi', version='3.1.0')

Warning: I took this example from an older article on HPCCM, so the versions are all older. From this short snippet of code, HPCCM can create a Dockerfile or a Singularity recipe to build a container. Personally, I think maintaining this short snippet of code is easier than maintaining both a Dockerfile and a Singularity recipe.

 

Q: What features would you like to see in containers?

  1. Better MPI/Process Management Interface – Exascale (PMIx) support
  2. Better security (range of things)
  3. Out-of-the-box InfiniBand, better ID management, and better out-of-the-box GPU support
  4. Docker and identity

As with the previous question, I went through the responses and collected the top answers. However, this question was more difficult to tally because of the diversity of the responses.

If you scan through the responses, the most obvious one is that better (easier?) support for MPI is needed. You can couple this with better out-of-the-box InfiniBand, ID management, and GPU support. I believe this goes along with the issues and challenges listed in the previous question.

I can’t comment on the security aspects of the features. I’m hoping someone with experience in security and HPC can comment on that.

Summary

I thought the study was interesting because it acts like a checkpoint on the HPC container journey. The number and types of questions seemed reasonable and not overlong. From this survey, I think it is safe to say that containers are being used in HPC, which I believe is a good thing.

Generally, people tend to use three container image types: Docker, Singularity/Apptainer, and to perhaps a smaller degree, Podman, along with their runtimes. That’s not to say that the other options aren’t useful, but the majority of responses seem to circle those three image types.

Although people use containers every day, they still encounter some challenges or issues with HPC containers in general, such as problems with MPI and GPUs in the container and coordinating with the host.

Also, people have questions about the security of containers and how to manage them securely. I understand their pain because when containers started to be used for HPC, security was a question at the top of my mind, and in some cases, I could find no solutions to improve security.

Finally, people are looking for new tools or features that allow them to build, manage, store in repositories, and maintain containers.

All three categories of issues can be addressed with better and more documentation from the community that is more broadly disseminated. These documents should have a number of examples with thorough explanations so that people can learn by doing. With these developments, I think 80% of the issues can be addressed with little to no code being developed so that anyone can contribute.

Contributions can come by writing general documentation, creating additional and better examples, proofing the documents and examples, giving talks, perhaps putting these talks on social media, and creating pre-made containers – or at least instructions on how to make containers for common tasks, common applications, or both. Performing maintenance and updates of container images and recipe files is very important, as well, so that you’re not creating one-time solutions. Although it takes effort, keep updating the container. Above all, testing what people create is very important. Without testers, circumstances won’t improve.