Getting Data Into and Out of the Cluster

Recently a friend who was teaching a course around HPC systems (clusters) sent me an email with the top three storage questions he kept getting from his students and asked whether I had good answers for them. The number one question was “How do I get data into and out of the cluster?” My first reaction was something like, “doesn’t everyone know this?” However, I quickly caught myself and realized two things: (1) I’m glad people are not afraid to ask questions, and (2) everyone starts somewhere. For quite a while, I had not thought of people getting started with Linux and HPC. I have spent quite a bit of time in HPC and Linux, and I had grown to assume that everyone should know Linux and HPC, which of course is not true. Fortunately, I mentally slapped myself in the face, and then sent an email back to my friend with my answers.

Over the next couple of days, I thought more about the questions and the people taking the course. I felt bad that I had assumed everyone should know Linux and HPC. As with everyone reading this article, we all started with zero knowledge of Linux (or any *nix for that matter) and HPC. This realization reoriented my thinking to where I now no longer assume people have the same or better background than myself but rather assume that they might not have any experience with whatever topic is being discussed. I enjoy this refreshing change in thought patterns for many topics in Linux and HPC.

But I digresss. I wanted to go back to my friend’s top three questions and use them as a basis for discussing various storage topics in Linux, especially HPC. The three questions are:

  1. How do I get data into and out of the cluster?
  2. How do you know where data is located after a job is finished?
  3. How do you manage the storage?

In my opinion, these are three great questions. I want to start with the first.

Getting Data Into and Out of the Cluster

This question can have a very long answer if you want to include all options. To narrow it down a bit, it needs a little more background, such as: Is it a Windows machine? Is it a Linux machine? Is it a Mac? I’m choosing to answer the question assuming it is either a Linux or Windows machine. I’m also choosing to discuss two approaches for moving data (1) with scpsftp , or another command-line interface (CLI) tool and (2) by mounting remote storage from the cluster on the local machine.

CLI Tools on a Linux Machine

Assume that you have a local Linux laptop or desktop and either local data that you want to transfer to the cluster or data that is on the cluster that you want to copy to your machine. By “local” I mean something in your office, in your home, or in a hotel where you are working. You can even be on the proverbial beach, but if you’re doing work on the beach, well, I’m not sure what to tell you.

I will also assume you are connected to the corporate or university network by a virtual private network (VPN). VPNs allow you to connect to networks from virtually anywhere in the world if you have access to the Internet. (I'm ignoring the person doing work at the beach.) Once connected by VPN, you are part of your “work” network that allows access to the cluster. Please note that it is very rare for a cluster to be directly available from the Internet without a VPN.

Linux can have a desktop environment with a graphical user interface (GUI), but it is not focused on the GUI. Rather, Linux uses quite a number of CLI tools. In this section, I cover the CLI commands first.

One common CLI tool for securely transferring data into and out of the cluster is scp , which is based on the Secure Shell Protocol (SSH) and is implemented as OpenSSH on Linux. It is (was) commonly used for moving data from your Linux machine to the cluster (and back). In 2019, the OpenSSH developers stated that scp  is “… outdated, inflexible, and not readily fixed.”

The currently recommended CLI tool for securely copying files to and from your machine to the cluster and back is sftp , which can be run as a simple command or used interactively. The sftp  command, which uses the SSH File Transfer Protocol, runs inside the SSH connection, so everything is encrypted. Some distributions have created symlinks or wrappers for other tools that emulate scp , but sftp  is underneath them all.

To transfer files into the cluster, you can use the command:

$ sftp file1.hd5 user1@<ip_address>:/home/user1/data_files/file1.hd5

This command copies file1.hd5  from the current directory on your local system to a remote system with a specific < ip_address > , assumed to be the cluster, to the user account user1  at location /home/user/data_files/file1.hd5  on the cluster. You need to be able to log in to the cluster system specified by < ip_address >  with account user1  and have write permission for the /home/user/data_files/  directory.

Note that you don’t have to copy your data to your home directory on the cluster. You can copy it to any directory on the cluster where you have write permission and, one hopes, read permission. For example, with sftp  you can copy the same file from your local system to a different directory on the cluster:

$ sftp file1.hd5 user1@<ip_address>:/data/team1_data/file1.hd5

You just have to be able to write to the target directory (i.e., /data/team1_data  in this case).

So far, the stfp  command has been used to copy a single file between systems. If you want to copy everything in a specific directory, including the directories that lie below, you could use an sftp  option to tell it to copy everything recursively from that directory down:

$ sftp -r * user1@<ip_address>:/home/user1/data_files/

The addition of -r  to the command means “recursive.” The asterisk (*) indicates the current directory on your machine. This command copies everything from the current directory in which you run the sftp  command to the cluster and the location /home/user1/data_files/ .

I won’t be discussing the interactive use of sftp , mostly because I’ve never used it that way, but online tutorials can explain how to do that.

Because sftp  copies one file or recursively copies everything, some other tools you might want to use are tar  and a compression tool (e.g., gzip  or pigz ) to allow more control over what you transfer. The reason for using these tools is to allow you to take a large number of files and collect them into one file (with tar ) and then compress the TAR file with a compression tool. Then you can use sftp  to copy over the resulting file, usually ending in .tar.gz .

You can reverse the data transfer process and copy data from the cluster to your Linux machine. The sftp  command is the same except that you need to execute it from the cluster, the IP address is the address of your local machine, user1  is the user account on your local machine, and you know where you want to put the file(s). An example command might be:

$ sftp file1.hd5 laptop-user@<ip_address_laptop>:/home/laptop-user/data_files/file1.hd5

Note that you can be laptop-user  on your machine and not necessarily user1 ,as you were on the cluster.

Overall, sftp  allows you to transfer files to and from the cluster and your local system. Any time you need a file copied to, or copied from, the cluster storage, you need to use the command line, as mentioned previously. However, what happens when you edit a file on your local machine, copy it to the cluster, execute code that uses that data with some output, and copy that output back to your machine to perhaps edit the data for another run of the code? You have this back and forth, requiring sftp  commands to transfer the file every time you need to edit or change the data. It would be much nicer to be able to exchange data between the cluster and your local system as if it were a directory on your local machine; then you could use all of the Linux commands for data management. The rest of the section on Linux CLI tools discusses two options for mounting remote file systems.

A filesystem client named SSHFS can be used to mount or attach storage from the cluster to your local system on the command line with one or two commands. Underneath, SSHFS uses SFTP as the protocol for data transmission to and from the cluster storage server. The data transfer is encrypted and secure (unless you turn off encryption). The really cool thing about SSHFS is that the cluster administrator does not have to do anything on the cluster except make SFTP accessible through the firewall, which very likely already is configured because SSH is configured, but you should check with your cluster administrator to be sure.

SSHFS is also what is called a “userspace tool,” meaning that an administrator does not need to be involved beyond opening a port on the firewall for SFTP. All data exchange is done by the user running commands to mount the cluster storage that is used on the local system. In essence, you are exchanging data with yourself on the cluster and the local machine without any intervention from the cluster admin. It is a 1:1 connection between your machine and your data storage on the cluster. You cannot access other user’s data, and they cannot access yours, allowing for more security.

To begin using sshfs , you’ll need to install it on your machine with either sudo  or root  or have an administrator install it for you. After sshfs  is installed, it is very easy to use. The commands

$ mkdir /home/user1/sshfs_mnt   # Do this on you laptop/desktop
$ sshfs user1@<ip_address>:/home/user1/data_files/ /home/user1/sshfs_mnt

run on your local machine mount a filesystem from the cluster to your local machine, where < ip_address >  is the address of the cluster server and /home/user1/sshfs_mnt  is the mountpoint on your local machine. You can access your data in the directory /home/user1/sshfs_mnt  for reading, writing, editing, moving, or deleting, just as you would for a local filesystem.

Interestingly, from your local system you can mount multiple directories by SSHFS from the cluster to your machine. You just use different mount directories. For example, if you have data in /home/user1/data_files  and /team1/data , you can mount them to two different directories on your local machine as long as you have read and write permission to those directories. For example, from your laptop or desktop you can run:

$ mkdir /home/user1/sshfs_mnt    # Do this on your laptop/desktop
$ mkdir -p /home/user1/team1/sshfs_mnt    # Do this on your laptop/desktop
$ sshfs user1@<ip_address>:/home/user1/data_files/data1 /home/user1/sshfs_mnt
$ sshfs user1@<ip_address>:/team1/data /home/user1/team1/sshfs_mnt

Another important filesystem used to share data with Linux machines is the Network File System (NFS), which is a very important protocol in the HPC world and particularly in the Linux world. It has been around since 1984, is simple to configure, and is widely used. It allows the sharing of data from the NFS server to many systems across the network and is even used as a shared filesystem inside of a cluster.

To use NFS, the admin of the cluster must make data available outside of the cluster (i.e., set up the NFS server). Assuming you’re on your work VPN, you should be able to access the server. To accomplish this, you need to know the IP address of the NFS server (ask your administrator).

Next, be sure the NFS client is installed on your Linux machine and that it is working. Several online tutorials on this topic can be easily googled. After the client is installed, you are ready to mount the NFS filesystem, which means making it available on your local Linux system.

You should have sudo  or root  access. If you don’t, you will have to ask your IT support to mount the filesystem for you. Assuming that the client is installed, you next create the mountpoint on your local machine, which is the directory where the storage can be accessed. At this point you need sudo  or root  access. Classically, people mount NFS filesystems in the /mnt  directory, but you can mount it anywhere you want if you have the correct permissions. For this example, I stick to /mnt  as the mountpoint:

$ sudo mkdir -p /mnt/clusterdata

The third step requires your system administrator to give you the IP address of the NFS server, as well as the export directory, so you can run the command:

$ sudo mount 192.68.1.88:/home/user1 /mnt/clusterdata

This command assumes that the NFS server has IP address 192.168.1.88 (check with your system administrator on the actual IP address), and 192.168.1.88:/home/user1  means you are mounting the /home/user1  directory from the NFS server on your local system. Again, you might have to check with your system administrator for the full mount  command.

Various Linux commands let you check to see whether the filesystem is mounted. A simple command to just check the used space on all mounted filesystems is df -h . You should see the NFS-mounted filesystem in the list: It begins with an IP address because it is not attached to your system. I like to use the -h  option because it produces human-readable output.

A second way to check that the NFS filesystem is mounted is to use the mount  command and specify the filesystem type:

$ mount -t nfs4

With this command, mount  displays all filesystems of type nfs4 . If you don’t see any NFS-mounted system(s), then try nfs  instead of nfs4 .

Now that the NFS-exported filesystem is mounted on your local machine as /mnt/clusterdata , you can copy files to or from that directory. You should be able to edit, delete, or move files there as well. With this simple test,

$ touch /mnt/clusterdata/testjunk
$ ls /mnt/clusterdata/testjunk

everything is good if the file testjunk  exists (i.e., you can list it).

It may get a little tedious having to mount your NFS filesystem manually every time your system boots. Linux allows you to accomplish this automatically by mounting the filesystems listed in /etc/fstab  on your local machine. You just add a line to the end of the file (you need sudo  or root  access) that tells Linux how to mount the filesystem:

192.68.1.88:/home/user1     /mnt/clusterdata    nfs    defaults   0   0

If you don’t have permission, or you have difficulty, contact your admin.

The first part of the line is the NFS server's IP address (192.168.1.88), a colon (:), and the filesystem or directory that is made available from the NFS server (/home/user1 ). The second part of the line is where you want to mount it on your local machine (/mnt/clusterdata ). The third part defines the filesystem type (i.e., nfs ). The defaults  are the options for mounting the filesystem. This example just uses the default settings. Don’t worry about the last two numbers for now. You can always read about them later.

After editing /etc/fstab  you can either reboot your system or mount the added NFS filesystem:

$ mount -a

This command re-mounts all filesystems listed in /etc/fstab  that are not already mounted. In this example, the one filesystem that is not mounted is NFS.

Be sure to check whether the filesystem mounts correctly, as previously described. If the NFS server (192.168.1.88) is down or the network is having problems, you won’t see the filesystem mounted. When this happens, contact your system administrator.

Because Linux uses the command-line interface much more than Windows, not many general GUI tools exist for handling SSHFS or NFS. One possibility is Simple NFS GUI. However, I haven’t tried it so I’m not sure how it works.

Beyond NFS, a commercial GUI tool named ExpanDrive has a Linux client. Although it doesn’t do NFS, it can mount other filesystems such as SSHFS. This tool will be discussed a bit more in the next section on Windows machines.

Mounting Remote Storage on Windows

Because Windows is the most common operating system in the world, tools are needed to copy data to and from the cluster and a local Windows system. As with the previous section on Linux, I assume you are on your work network over a VPN.

Windows is more GUI oriented than Linux, so I focus on attaching storage from the cluster to your local machine with Windows tools or third-party tools rather than with a CLI tool like sftp . In the Linux world this, is referred to as “mounting” the storage, and I’ll still use that term here, although I may use the phrase “mapping” (they aren’t quite the same, but for now, ignore those details and think of them as interchangeable). I will also use the word “remote” to indicate that the storage is not directly attached to your Windows machine.

How exactly you mount the remote storage depends on the details of the system that is managing the storage (i.e., the cluster). One method is to have the cluster storage exposed as a network drive to Windows systems by the Server Message Block (SMB) protocol. The cluster or storage admin will have taken some cluster storage and made it available on the network over SMB, which allows the storage to be “discovered” by Windows systems on the VPN network. You will likely need your login and password credentials on the cluster to proceed.

All Windows versions you are likely to use can discover the cluster drive and mount it. I found a reasonable video that explains how to map that storage so that it appears like another drive on your local computer. The first 2:30 of the video explains how to make storage on your computer available for others to mount on their computers, but starting at 2:30 it explains how to find and mount the network drive onto your local machine.

If you have issues discovering, mounting, and mapping SMB-based storage devices, I recommend contacting your cluster administrator or IT group for help.

Once the cluster storage is mounted, you can use the Windows Explorer tool to edit, copy, move, or delete the data as if it were a drive attached to your laptop or desktop. Think of it as plugging in a USB drive: The network-mapped storage behaves the exact same way. You can copy data to and from the drive, edit data on the drive, and delete data on the drive.

As mentioned earlier, the cluster storage server is making the data available by a Windows protocol (SMB). You might want to mount storage that uses other protocols, such as Amazon Web Services (AWS) Simple Storage Service (S3) buckets, NFS, Google Drive, Microsoft OneDrive, Backblaze, SFTP and SSHFS, Nextcloud, and others.

All of these filesystems and protocols share data in different ways, but third-party tools can mount them on your Windows machine. The first commercial tool of which I’m familiar is named ExpanDrive. It is inexpensive and allows you mount storage from:

  • Goole Drive
  • Microsoft SharePoint
  • Microsoft OneDrive
  • Box
  • Dropbox
  • Backblaze
  • SFTP web server
  • SSHFS (SFTP protocol)
  • Nextcloud
  • Wasabi

ExpanDrive comes in versions for Windows, macOS, and Linux. The developers also have a pretty good blog to explain how it can be used. For example, a May 2023 article discussed mapping Google Drive as a network drive.

ExpanDrive makes it trivial to use SSHFS on your local Windows system by mounting storage from the cluster as previously discussed in the Linux section. You can even map it to a drive on your Windows machine.

Notice that ExpanDrive does not list NFS as a supported protocol because Microsoft Windows has an NFS client in Windows 10 and 11. However, not all versions of Windows 10 and 11 support NFS. For example, the Windows 10 Home edition does not have this capability. However, for Windows 10 Pro, you can find several online articles that explain how to map an NFS “share” or drive to your local Windows system, as well as for Windows 11.

Summary

I admit that I started to think of simple questions about HPC and Linux, as “doesn’t everyone know this?” The email from my friend jolted me back to reality, reminding me that, in the beginning, not everyone will know the answer to questions that are seemingly “easy.”

I hope I’ve answered the question of how to copy data to and from your local system to the cluster, at least to some degree. These local systems can be Linux or Windows (I don’t own a machine with macOS, so I couldn’t write about that). Linux has commands that can copy data to and from the cluster that typically come with your distribution, but they are not always convenient, and mounting the cluster storage to your local system is better (IMHO).

The cluster storage can be attached in various ways with a variety of protocols. Which one is “best” really depends on what the cluster admin is using. I recommend SSHFS because it involves so little admin work to configure. However, you will have to buy an inexpensive third-party tool to use SSHFS on Windows.

Next, I would recommend NFS for Linux and SMB for Windows. Although I didn’t mention it, Linux can mount storage that uses SMB. Also, Windows can mount NFS storage if you have the correct version.

Related content

comments powered by Disqus