Optimizing Your NFS Filesystem
NFS is the most widely used HPC filesystem. It is very easy to set up and performs reasonably well for small to medium clusters as primary storage. You can even use it for larger clusters if your applications do not use it for I/O (e.g., /home). One of the most common questions about NFS configuration is how to tune it for performance and management and what options are typically used.Tuning for performance is a loaded question because performance is defined by so many different variables, the most important of which is how to measure performance. However, you can at least identify options for improving NFS performance. In this article, I'll go over some of the major options illustrating some of the pluses and minuses.
In addition to tuning for performance, I will present a few useful options for managing and securing NFS. It's not an extensive list by any stretch of the imagination, but the options are typical for NFS. NFS tuning can occur on both server and clients. You can also tune both the NFS client and server TCP stacks. In this article, I've broken the list of tuning options into three groups: (1) NFS performance tuning options, (2) system tuning options, and (3) NFS management/policy options. The options for these three categories are listed below.
1. NFS performance tuning options
- Synchronous vs asynchronous
- Number of NFS daemons (nfsd)
- Block size setting
- Timeout and retransmission
- FS-Cache
- Filesystem-independent mount options
2. System tuning options
- System memory
- MTU
- TCP tuning on the server
3. NFS management/policy options
- Subtree checking
- Root squashing
In the sections that follow, these options are presented and discussed.
Synchronous vs Asynchronous
Most people use the synchronous option on the NFS server. For synchronous writes, the server replies to NFS clients only when the data has been written to stable storage. Many people prefer this option because they have little chance of losing data if the NFS server goes down or network connectivity is lost.
Asynchronous mode allows the server to reply to the NFS client as soon as it has processed the I/O request and sent it to the local filesystem; that is, it does not wait for the data to be written to stable storage before responding to the NFS client. This can save time for I/O requests and improve performance. However, if the NFS server crashes before the I/O request gets to disk, you could lose data.
Synchronous or asynchronous mode can be set when the filesystem is mounted on the clients by simply putting sync or async on the mount command line or in the file /etc/fstab for the NFS filesystem. If you want to change the option, you first have to unmount the NFS filesystem, change the option, then remount the filesystem.
The choice between the two modes of operation is up to you. If you have a copy of the data somewhere, you can perhaps run asynchronously for better performance. If you don't have copies or the data cannot be easily or quickly reproduced, then perhaps synchronous mode is the better option. No one can make this determination but you.
Number of NFS Daemons
NFS uses threads on the server to handle incoming and outgoing I/O requests. These show up in the process table as nfsd (NFS daemons). Using threads helps NFS scale to handle large numbers of clients and large numbers of I/O requests. By default, NFS only starts with eight nfsd processes (eight threads), which, given that CPUs today have very large core counts, is not really enough.
You can find the number of NFS daemons in two ways. The first is to look at the process table and count the number of NFS processes with
ps -aux | grep nfs
The second way is to look at the NFS config file (e.g., /etc/sysconfig/nfs) for an entry that says RPCNFSDCOUNT, which tells you the number of NFS daemons for the server.
If the NFS server has a large number of cores and a fair amount of memory, you can increase RPCNFSDCOUNT. I have seen 256 used on an NFS server with 16 cores and 128GB of memory, and it ran extremely well. Even for home clusters, eight NFS daemons is very small, and you might want to consider increasing the number. (I have 8GB on my NFS server with four cores, and I run with 64 NFS daemons.)
You should also increase RPCNFSDCOUNT when you have a large number of NFS clients performing I/O at the same time. For this situation, you should also increase the amount of memory on the NFS server to a large-ish number, such as 128 or 256GB. Don't forget that if you change the value of RPCNFSDCOUNT, you will have to restart NFS for the change to take effect.
One way to determine whether more NFS threads helps performance is to check the data in /proc/net/rpc/nfs for the load on the NFS daemons. The output line that starts with th lists the number of threads, and the last 10 numbers are a histogram of the number of seconds the first 10% of threads were busy, the second 10%, and so on.
Ideally, you want the last two numbers to be zero or close to zero, indicating that the threads are busy and you are not "wasting" any threads. If the last two numbers are fairly high, you should add NFS daemons, because the NFS server has become the bottleneck. If the last two, three, or four numbers are zero, then some threads are probably not being used. Personally, I don't mind this situation if I have enough memory in the system, because I might have reached a point at which I need more NFS daemons.
Block Size Setting
Two NFS client options specify the size of data chunks for writing (wsize) and reading (rsize). If you don't specify the chunk sizes, the defaults are determined by the versions of NFS and the kernel being used. If you have NFS already running and configured, the best way to check the current chunk size is to run the command
cat /proc/mounts
on the NFS client and look for the wsize and rsize values.
Chunk size affects how many packets or remote procedure calls (RPCs) are created and sent. For example, if you want to transmit 1MB of data using 32KB chunks, the data is sent in quite a few chunks and a correspondingly large number of network packets. If you increase the chunk size to 64KB, then the number of chunks and the number of packets on the network are reduced.
The chunk size that works best for your NFS configuration depends primarily on the network configuration and the applications performing I/O. If your applications are doing lots of small I/O, then a large chunk size would not make sense. The opposite is true as well.
If the applications are using a mixture of I/O payload sizes, then selecting a block size might require some experimentation. You can experiment by changing the options on the clients; you have to unmount and remount the filesystem before rerunning the applications. Don't be afraid of testing a range of block sizes, starting with the default and moving into the megabyte range.
Timeout and Retransmission
Timeout and retransmission options, although they don't affect performance at first glance, are very important for NFS, especially for the clients. The options determine how long the NFS client waits until retransmitting a packet (timeo) and how many times an NFS client attempts to resend the packet (retrans) before restarting the entire process.
The timeo (timeout) option is the amount of time the NFS client waits on the NFS server before retransmitting a packet (no ACK received). The value for timeo is given in tenths of a second, so if timeo is 5, the NFS client will wait 0.5 seconds before retransmitting. The default is 0.7 (0.07 seconds), but you can adjust the option with the timeo option of the mount command or by editing the /etc/fstab file on the NFS client to indicate the value of timeo.
The other option, retrans, specifies the number of tries the NFS client will make to retransmit the packet. If the value is 5, the client resends the RPC packet five times, waiting timeo seconds between tries. If, after the last attempt, the NFS server does not respond, you get a message Server not responding. The NFS client then resets the RPC transmission attempts counter and tries again in the same fashion (same timeo and retrans values).
On congested networks, you often see retransmissions of RPC packets. A good way to tell is to run the
nfsstat -r
command and look for the column labeled retrans. If the number is large, the network is likely very congested. If that is the case, you might want to increase the values of timeo and retrans to increase the number of tries and the amount of time between RPC tries. Although taking this action will slow down NFS performance, it might help even out the network traffic so that congestion is reduced. In my experience, getting rid of congestion and dropped packets can result in better, more even performance.
FS-Cache
Another option you might want to consider using to improve NFS client performance is FS-Cache, which caches NFS client requests on a local storage device, such as a hard drive or SSD, helping improve NFS read I/O: Data that resides on the local NFS client means the NFS server does not have to be contacted.
To use NFS caching you have to enable it explicitly by adding the option -o fsc to the mount command or in /etc/fstab:
# mount <nfs-share>:/ </mount/point> -o fsc
Any data access to </mount/point> will go through the NFS cache unless the file is opened for direct I/O or if a write I/O is performed.
The important thing to remember is that FS-Cache only works if the I/O is a read. FS-Cache can't help with a direct I/O (read or write) or an I/O write request. However, there are plenty of cases in which FS-Cache can help. For example, if you have an application that needs to read from a database or file and you are running a large number of copies of the same application, FS-Cache might help, because each node could cache the database or file.
Filesystem-Independent Mount Options
The mount command in Linux has a number of options that are independent of the filesystem and might be able to improve performance. Some key options are:
- noatime – Inode access times are not updated on the filesystem. This can help performance because the access time of the file is not updated every time a file is accessed.
- nodiratime – The directory inode is not updated on the filesystem when it is accessed. This can help performance in the same way as not updating the file access time.
- relatime – Inode access times are relative to the modify or change time for the file, so the access time is updated only if the previous atime (access time) was earlier than the modify or change time.
Before using these options, please read the mount man page. You also need to decide whether access time is worth tracking accurately. By not tracking it or not tracking it as accurately, you can increase performance.
System Memory
System tuning options aren't really NFS tuning options, but a system change can result in a change in NFS performance. In general, Linux and its services like memory and will grab as much system memory as possible. Of course, this memory is returned if the system needs it for other applications, but rather than let it go to waste (i.e., not being used), it uses it for buffers and caches.
NFS is definitely one of the services, particularly on the server, that will use as much buffer space as possible. With these buffers, NFS can merge I/O requests to improve bandwidth. Therefore, the more physical memory you can add to the NFS server, the more likely the performance will improve, particularly if you have lots of NFS clients hitting the server at the same time. The question is: How muchmemory does your NFS server need?
The answer is not easy to determine because of conflicting goals. Ideally, you should put as much memory in the server as you can afford. But if budgets are a little on the tight side, then you'll have to deal with trade-offs between buying the largest amount of memory for the NFS server or putting the money into other aspects of the system. Could you reduce memory on the NFS server from 512 to 256GB and perhaps buy an extra compute node? Is that worth the trade? The answer is up to you.
As a rule of thumb for production HPC systems, however, I tend to put in no less than 64GB on the NFS server, because memory is less expensive overall. You can always go with less memory, perhaps 16GB, but you might pay a performance penalty. However, if your applications don't do much I/O, then the trade-off might be worthwhile.
If you are choosing to use asynchronous NFS mode, you will need more memory to take advantage of async, because the NFS server will first store the I/O request in memory, respond to the NFS client, and then retire the I/O by having the filesystem write it to stable storage. Therefore, you need as much memory as possible to get the best performance.
The very last word I want to add about system memory is about speed and number of memory channels. To ring out every last bit of performance from your NFS server, you will want the fastest possible memory, while recognizing the trade-off between memory capacity and memory performance. The solution to the trade-off is really up to you, but I like to see how much memory I can get using the fastest memory DIMMs possible. If the memory capacity is not large enough, you might want to step down to the next level in memory speed to increase capacity.
For the best possible performance, you also want an NFS server with the maximum number of memory channels to increase the overall memory bandwidth of the server. In each memory channel, be sure to put:
- at least one DIMM in each channel,
- the same number of DIMMs in each channel, and
- the same DIMM size and speed in each channel.
Again, this is more likely to be critical if you are using asynchronous mode, but it's a good idea for even synchronous mode.
MTU
Changing the network MTU (maximum transmission unit) is also a good way to affect performance, but it is not an NFS tunable. Rather, it is a network option that you can tune on the system to improve NFS performance. The MTU is the maximum amount of data that can be sent via an Ethernet frame. The default MTU is typically 1500 (1,500 bytes per frame), but this can be changed fairly easily.
For the greatest effect on NFS performance, you will have to change the MTU on both the NFS server and the NFS clients. You should check both of these systems before changing the value to determine the largest MTU you can use. You also need to check for the largest MTU the network switches and routers between the NFS server and the NFS clients can accommodate (refer to the hardware documentation). Most switches, even non-managed "home" switches, can accommodate an MTU of 9000 (commonly called "jumbo packets").
The MTU size can be very important because it determines packet fragments on the network. If your chunk size is 8KB and the MTU is 1500, it will take six Ethernet frames to transmit the 8KB. If you increase the MTU to 9000 (9,000 bytes), the number of Ethernet frames drops to one.
The most common recommendation for better NFS performance is to set the MTU on both the NFS server and the NFS client to 9000 if the underlying network can accommodate it. A study by Dell a few years back examined the effect of an MTU of 1500 compared with an MTU of 9000. Using Netperf, they found that the bandwidth increased by about 33% when an MTU of 9000 was used.
TCP Tuning on the Server
A great deal can be done to tune the TCP stack for both the NFS client and the NFS server. Many articles around the Internet discuss TCP tuning options for NFS and for network traffic in general. The exact values vary depending on your specific situation. Here, I want to discuss two options for better NFS performance: system input and output queues.
Increasing the size of the input and output queues allows more data to be transferred via NFS. Effectively, you are increasing the size of buffers that can store data. The more data that can be stored in memory, the faster NFS can process it (i.e., more data is queued up). The NFS server NFS daemons share the same socket input and output queues, so if the queues are larger, all of the NFS daemons have more buffer and can send and receive data much faster.
For the input queue, the two values you want to modify are /proc/sys/net/core/rmem_default (the default size of the read queue in bytes) and /proc/sys/net/core/rmem_max (the maximum size of the read queue in bytes). These values are fairly easy to modify:
echo 262144 > /proc/sys/net/core/rmem_default echo 262144 > /proc/sys/net/core/rmem_max
These commands change the read buffer sizes to 256KiB (base 2), which the NFS daemons share. You can do the same thing for the write buffers that the NFS daemons share:
echo 262144 > /proc/sys/net/core/wmem_default echo 262144 > /proc/sys/net/core/wmem_max
After changing these values, you need to restart the NFS server for them to take effect. However, if you reboot the system, these values will disappear and the defaults will be used. To make the values survive reboots, you need to enter them in the proper form in the /etc/sysctl.conf file.
Just be aware that increasing the buffer sizes doesn't necessarily mean performance will improve. It just means the buffer sizes are larger. You will need to test your applications with various buffer sizes to determine whether increasing buffer size helps performance.
Subtree Checking
Assume the NFS server has exported a directory from the root filesystem (e.g., /usr/local). Also assume that it is part of the root disk for the system (i.e., it's not on a separate partition or drive). On a compromised NFS client, the cracker could guess the file handle for a file that is in the filesystem but not in /usr/local/ (the NFS-exported directory). Now your NFS server has been compromised.
Adding the option subtree_check to the exports on the NFS server checks that the file being accessed is contained within the exported directory. In the case here, it would force the NFS server to check that the requested file was located within /usr/local/. Alternatively, you can specify the option no_subtree_check on the NFS server, and it will not check that the requested file is in the exported directory. Many people have the opinion that subtree_check can have a big effect on performance, but the final determination is up to you. Is performance more important than security for the configuration and your situation?
One way to overcome the need for subtree_check is to put the exported directory on a separate partition or separate drive to prevent a rogue user from guessing a file handle to anything outside of the filesystem. You should partition your drive space and give a specific mountpoint to the directory that is to be exported. For example, if you want to export /usr/local/, it should have its own storage partition (or drive) and be mounted as /usr/local on the NFS server. By doing this, crackers can't guess file handles outside of the specific export.
Root Squashing
By default, user root is “squashed” to the user nobody so that NFS access is compartmentalized. This point is important because if a rogue user boots a system from some sort of media (e.g., a USB stick), the user can be root on that system and could then change the IP address to gain access to the system, mount a filesystem, and copy data from the server. However, if root is squashed to user nobody, then root will have the same privileges given to all users, thus preventing a compromised system from allowing root to pull data from your system.
On the other hand, if you want root to have access to an NFS-mounted filesystem, you can add the option no_root_squash to the file /etc/exports to allow root access. Just be aware that if someone reboots your system to gain root access, it's possible for them to copy (steal) data.
Summary
In this article, I presented various options you could use to improve performance on an NFS filesystem, although depending on your circumstances, they might not help or might even result in reduced performance. Some of these tuning parameters are NFS options, whereas others involve changes to the system that improve performance or are options for managing NFS filesystems. The best way to judge which options are useful is to run tests, particularly with the applications you plan on running.
This blog represents my own view points and not those of my employer, Amazon Web Services.