Filesystem Murder Mystery
Hunting down a performance problem
As I was generating results for one of last Fall's columns, James Saryerwinnie of the Amazon AWS team was chasing an interesting performance problem and relaying his progress live on Twitter (Figure 1) [1], providing me yet another unnecessary opportunity to procrastinate. What followed was one of the most interesting Twitter threads in performance-tuning in recent memory and a great practical demonstration of the investigative nature of performance problems.
Saryerwinnie was observing occasional unexplained delays of a few seconds while renaming some gigabyte-sized files [2] on a fast cloud instance [3]. I was able to reproduce his result in an AWS EC2 m5d.large instance running Ubuntu Server 18.04 (Figure 2). I used an EBS class gp2 volume of 8GB showing remarkable performance in this particular test (including a throughput of 175MBps), but similar results are achievable at other scales with some lucky timing.
My Figure 2 result shows a file rename taking 4,707 times
longer than it did originally on the same system when it happens to rename over an existing file that was recently manipulated; the slowdown otherwise disappeared if I waited between operations. Just like Saryerwinnie's original result, the slowdown is measured in whole seconds, spanning four orders of magnitude from the ideal performance level, which appears to indicate that the rename
system call [4] is for some reason waiting for recent I/O to be written to disk before completing.
This observation seemed to point the finger at some unexpected queuing of operations at the filesystem level, as if some part of the previous operation, still latent somewhere in the kernel, had to complete before the rename could take place. Given the symbolic nature of file entries in a directory structure, this appears architecturally unnecessary: The completion of the file operation should not interfere with a directory entry's renaming, even when the entry is its own.
Or should it? Perhaps an unexpected guarantee of in-order operation is in play, which is why I, and presumably Saryerwinnie himself before me, found this observation interesting. A similar slowdown is observed when the move command is targeting an existing empty file (Figure 3), seemingly indicating that the rename syscall itself and not actual data movement through caches [5] is the culprit.
AWS Storage Basics
AWS block storage comes in two types: ephemeral
, which is locally attached and gets its name from being lost when an instance is de-provisioned, and network-based
, as delivered by the Elastic Block Storage service (EBS). EBS volumes are highly available and come with a multitude of options too large to detail here [6]. In broad strokes, EBS volumes come in four types: General Purpose SSD (gp2
), Provisioned IOPs SSD (io1
), Throughput Optimized HDD (st1
), and Cold HDD (sc1
).
EBS volumes can be used to boot EC2 instances and can be as large as 16TiB. They support encryption, snapshots, and most other accoutrements generally associated with enterprise data storage but stand out for their built-in monitoring delivered via AWS CloudWatch. Figure 4 shows the spike in write bandwidth caused by my earlier tests.
Strace to the Rescue
Unrivaled as a debugging tool, strace
[7] is also a great way to unpack an executable's performance when used with the -t
option to include the time of each call. As shown in Listing 1, nearly the entirety of the mv
execution time was spent in an unusually long (nine seconds in this case!) rename
system call, confirming the suspicion that it is the key to this enigma.
Listing 1
Timing the Execution of Calls in mv
ubuntu@aws:~/slow-mv$ strace -t mv 3GB.copy 3GB 19:00:09 execve("/bin/mv", ["mv", "3GB.copy", "3GB"], 0x7ffd0e7dddf8 /* 21 vars */) = 0 19:00:09 brk(NULL) = 0x55cd7d1ce000 19:00:09 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) 19:00:09 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) ... 19:00:09 getgid() = 1000 19:00:09 access("3GB", W_OK) = 0 19:00:09 rename("3GB.copy", "3GB") = 0 19:00:18 lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) 19:00:18 close(0) = 0 19:00:18 close(1) = 0 19:00:18 close(2) = 0 19:00:18 exit_group(0) = ? 19:00:18 +++ exited with 0 +++ ubuntu@aws:~/slow-mv$
The next step of analysis would take me into the kernel, were it not for Matt Wilson delivering a timely resolution [8]: The combination of the data=ordered
and auto_da_alloc
ext4 options results in a behavior akin to invoking sync
[9] before renaming over an existing file. I can validate this result by running a few tests in the test instance, after remounting the root partition with modified options, as shown in Listing 2. I could not observe a delay longer than a third of a second with this configuration, as opposed to a maximum of nine seconds with the original (Figure 5). Of course, there is a reason for this setting's trade-off.
Listing 2
Disabling auto_da_alloc
ubuntu@aws grep ext4 < /proc/mounts /dev/nvme1n1p1 / ext4 rw,relatime,discard,data=ordered 0 0 ubuntu@aws sudo mount -o remount,rw,relatime,discard,noauto_da_alloc,data=ordered /dev/nvme1n1p1 / ubuntu@aws grep ext4 < /proc/mounts /dev/nvme1n1p1 / ext4 rw,relatime,discard,noauto_da_alloc,data=ordered 0 0
Puzzling the Riddle
As is explained in the kernel docs [10], setting auto_da_alloc
in the default data=ordered
mode will result in the replace-via-rename and replace-via-truncate file access patterns having their data blocks flushed to disk before a rename
is persisted to disk, consistent with what was observed. The solution to the puzzle lies in filesystem designers attempting to prevent data loss during system crashes for those applications designed with the assumptions of the older ext3 filesystem environment, as detailed by Ted Ts'o in a vintage 2009 blog [11]. As the distribution was turning from ext3 to ext4, some Ubuntu users encountered unexpected data loss because of buggy applications in the Jaunty Jackalope (v9.04) alpha installs. In Ted's own words, "Application writers had gotten lazy, because ext3 by default has a commit interval of 5 seconds".
Infos
- mv performance problem: https://twitter.com/jsaryer/status/1069775822880333824
- Unexplained rename() delay: https://twitter.com/jsaryer/status/1069831718389960704
- GitHub, James Saryerwinnie: https://gist.github.com/jamesls/9d52c7b3a19491f4d14c2d487515115f
- rename(2) man page: https://linux.die.net/man/2/rename
- "Assess USB performance while exploring storage caching" by Federico Lucifredi, ADMIN , issue 48, 2018, pg. 94, http://www.admin-magazine.com/Archive/2018/48/Assess-USB-performance-while-exploring-storage-caching
- Amazon EBS user guide: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html
- strace(1) man page: https://linux.die.net/man/1/strace
- ext4 options problem: https://twitter.com/_msw_/status/1069853156534669312
- sync(8) man page: https://linux.die.net/man/8/sync
- auto_da_alloc in kernel ext4 documentation: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
- "Delayed allocation and the zero-length file problem" by Theodore Ts'o, March 12, 2009: http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
Buy this article as PDF
(incl. VAT)