Defining measures
What is an IOPS Really?
Many articles have explored the performance aspects of filesystems, storage systems, and storage devices. Classically, performance results are reported with statements such as Peak throughput is x MBps or Peak IOPS is x . However, what does "IOPS" really mean and how is it defined?
Typically, an IOP is an I/O operation, wherein data is either read or written to the filesystem and subsequently the storage device, although other IOPs exist that don't strictly include a read or write I/O operation (more on that later).
The number of input/output operations per second (IOPS) sounds simple enough, but the term has no hard standard definition. For example, what I/O functions are used during IOPS testing? If the I/O functions involve reading or writing data, how much data is used for a read or write?
Despite not having a precise definition, IOPS is a very important storage performance measure for applications. Think about the serial portion of Amdahl's Law, which typically includes I/O. Getting data to and from the storage device as quickly as possible affects application performance and scalability. With the large number of cores in today's systems, either several applications run at the same time – all possibly performing I/O – or a running application uses a large number of processes, all possibly performing I/O. Storage performance is under even more pressure.
IOPS Specifics
The IOPS acronym implies that more I/O operations per second is better than fewer. The larger the IOPS, the better the storage performance. As a consequence, an important aspect of measuring IOPS is the size of the data used in the I/O function. (I use the terminology of the networking world and refer to this as the "payload size.") Does the I/O operation involve just a single byte or does it involve 1MiB, 1GiB, 1TiB?
Most of the time, IOPS are reported as a plain number (e.g., 100,000). Because IOPS has no standard definition, the number is meaningless because it does not define the payload size. However, over time, an "accepted" payload has been created for measuring IOPS. This size is 4KB.
The kilobyte [1] is defined as 1,000 bytes and is grounded in base 10 (10^3). Over time, kilobyte has been incorrectly used to mean numbers grounded in base 2, or 1,024 bytes (2^10). This usage is incorrect. The correct notation for 1,024 bytes is kibibyte (KiB). In this article I use the correct kibibyte unit notation to mean 1,024 bytes, so that 4KiB is 4,096 bytes. Therefore, the commonly used read or write payload is 4,096 bytes or 4KiB.
Just to emphasize the point, whereas 4KiB commonly is used for IOPS measurements, IOPS has no real definition, particularly for the payload size. If IOPS numbers are stated and they have no payload size associated with them, you cannot be sure what the number really means. The number could be 4KiB, but without a clear statement, you don't know. I view this as a criminal storage offense (pay the bailiff on the way out).
The reason 4KiB is so commonly used is that it corresponds to the page size on almost all Linux systems and usually produces the best IOPS result (but not always). However, in light of no formal definition, it can be safe to say that not all reported IOPS numbers use this size. Sometimes, the results are reported with the use of 1KiB sizes – or even a 1-byte size.
Applications don't always do I/O in 4KiB increments, so why should IOPS be restricted to a single payload size? In the absence of a definition, any payload size can be used. Personally, because applications use various payload sizes, I want to see IOPS measures for a range of I/O operation sizes. I like to see results with payload sizes of 1KiB (in case really small payload sizes have some exceptional performance); 4KiB, 32KiB, or 64KiB; and maybe even 128KiB, 256KiB, or 1MiB. The reason I like to see a range of payload sizes is that it allows me to compare it to the spectrum of payload sizes in my applications. However, if I must pick a single payload size, then it should be 4KiB. Most important, I want the publisher of any IOPS numbers to tell me the payload size they used.
IOPS Categories
Up to this point, the type of I/O operation has not been specified. Traditionally, IOPS are used to measure data throughput. Accordingly, IOPS are measured by read or write I/O functions, and many times two IOPS results are reported: one for only read operations and one for only write operations. Think of these as guardrail measurements that define the limit of read and write IOPS performance.
Again, because IOPS is not a defined standard, IOPS could be reported with a mixture of read and write operations. For example, it could be defined as a read/write mixture of 50/50 or 25/75. Because of the lack of a definition, whoever is reporting the IOPS results should define whether the number is all-read, all-write, or a mixture of operations. Personally, I would like to see all-read IOPS, all-write IOPS, and at least one read/write IOPS mixture (more is better).
IOPS reported as a mixture of read and write operations is a step forward, in my opinion, but reporting a mixed number can also be ambiguous. For example, does reporting the read/write IOPS as 25/75 mean the test did three write operations followed by a read operation? It could mean one read operation followed by three write operations or two write operations followed by one read operation and then another write operation. Without specifying the pattern, the mixed read/write IOPS measurement becomes uncertain, reducing the worth of reporting the number.
Although some applications perform all reads or all writes during portions of the execution, many applications use a mixture of reads and writes, so I want that mixed read/write IOPS result reported.
Personally, at a very minimum, I would like to see IOPS reported as:
- 4KiB read IOPS = x
- 4KiB write IOPS = y
- 4KiB (x % read/y % write) IOPS = w
The third IOPS measure should report the read/write pattern.
Beyond these three numbers, as I mentioned earlier, I want to see IOPS with different I/O function payload sizes. Although 4KiB is mandatory because it is the closest to a standard, I know of applications that do a great number of I/O functions in the range of 1--100 bytes. Knowing the same three IOPS measures – read, write, and a read/write mix – for a range of I/O payloads can be key to relating the IOPS numbers to application performance. The same is true for larger payloads, perhaps in the 32KiB to 1MiB range.
Diving deeper, a commonly overlooked aspect of measuring IOPS is whether the I/O operations are sequential or random. With sequential IOPS, the I/O operations happen with sequential blocks of data. For example, block 233 is used for the first I/O operation, followed by block 234, followed by block 235, and so on. (This discussion assumes the payload is 4KiB aligning with the blocks.) With random IOPS, the first I/O operation could be on block 233 and the second could be on block 8675309, or something like that.
With random I/O access, the tests can invalidate data caches because the blocks are not in the cache, resulting in IOPS measures that are more realistic. If you run several applications at the same time, the blocks needed by each application may be widely separated on the storage device. One application might need something from one range of blocks, and another might use a different block range. Depending on the sequence of the I/O operations, to the filesystem and storage devices, this looks like random I/O access.
Keeping this in mind, the list of IOPS that should be reported is now:
- 4iKB random read IOPS = x
- 4iKB random write IOPS = y
- 4iKB sequential read IOPS = x
- 4iKB sequential write IOPS = y
- (Optional) 4iKB random (x % read/y % write) (sequential/random) IOPS = w
Now up to four IOPS numbers should be reported, with a highly recommended fifth number that uses a mixture of read and write operations.
Beyond I/O function payload sizes and sequential or random data access, further options can be used for the best possible IOPS results (e.g., a read-ahead cache in the operating system or a storage device). Reporting random access IOPS results can, one hopes, provide insight into what happens without cache effects.
Another option for tuning IOPS performance is queue depth, which is a measure of the number of I/O operations queued together before execution. A queue provides the opportunity for the operating system to order the I/O operations to make data access more sequential, but it does so by using more memory, a few more CPU cycles, and a delay before executing the I/O operations. Holding I/O operations in system memory can be a little dangerous in that, if the system loses power, those operations can be lost. However, the queues are flushed very quickly, so you might not be affected by the loss of power.
I commonly see varying queue depths for Windows IOPS testing. Linux does a pretty good job setting good queue depths, so there is much less need to change the default of 128 because it provides good overall performance. However, depending on the workload or the benchmark, you can adjust the queue depth to produce possibly better performance. Be warned that if you change the queue depth for a particular benchmark or workload, application performance beyond these specific applications could be affected.
You can check what I/O scheduler is being used and the corresponding queue depth by querying the /sys
filesystem. An example of a mechanical disk (e.g., sdx
) might be:
<T01> $ cat /sys/block/sd*/queue/scheduler [mq-deadline] none $ cat /sys/block/sd*/queue/nr_requests 64
or an NVMe drive:
<T02> $ cat /sys/block/nvme1n1/queue/scheduler [none] mq-deadline $ cat /sys/block/nvme1n1/queue/nr_requests 1023
Because queue depth can be varied, the IOPS performance results could (should) be published with the queue depth:
- 4iKB random read IOPS = x (queue depth = z )
- 4iKB random write IOPS = y (queue depth = z )
- 4iKB sequential read IOPS = x (queue depth = z )
- 4iKB sequential write IOPS = y (queue depth = z )
- (Optional) 4iKB random (x % read/y % write IOPS = w (queue depth = z )
Another example of selecting options for better performance is the Linux I/O scheduler, which has the ability to sort the incoming I/O request into something called the request queues, where they are optimized for the best possible device access. Two types of I/O schedulers exist in recent kernels: multiqueue I/O schedulers and the non-multiqueue I/O schedulers.
The three non-multiqueue I/O schedulers are:
- Completely fair queuing (CFQ)
- Deadline
- NOOP (no-operation)
There are various reasons for using one over the other, but it is worthwhile to try all three to get the best IOPS performance, which, however, does not mean the best application performance, so be sure to test real applications when you change the scheduler.
With the advent of very fast storage devices (think solid-state storage), the storage bottleneck [2] moved from the storage device to the operating system. These devices can achieve highly parallel access, improving I/O performance to a degree that requires a new way to think about how I/O operations are queued and scheduled. Multiqueue I/O schedulers [3] were created for these cases.
More discussion about the specific multiqueue schedulers can be found online [4]. The current list is:
- BFQ (budget fair queuing)
- Kyber
- None (like NOOP)
- mq-deadline (like a multiqueue deadline scheduler)
Note that some of these schedulers can be tuned for the best performance.
Including the specific I/O scheduler as part of the reported IOPS results makes the list of possible variations grow very quickly, which is a consequence of not having a standard IOPS definition. Overall, I would personally like to see five IOPS numbers reported for a specific payload size, a specific queue depth, and a specific I/O scheduler:
- 4iKB random read IOPS = x (queue depth = z )
- 4iKB random write IOPS = y (queue depth = z )
- 4iKB sequential read IOPS = x (queue depth = z )
- 4iKB sequential write IOPS = y (queue depth = z )
- (Optional) 4iKB random (x % read/y % write) IOPS = w (queue depth = z )
This report would give at least some upper bounds to IOPS performance. Furthermore, reporting IOPS for different I/O function payloads is important because it provides a wider view of IOPS results, and they can help in understanding application performance.
Total IOPS
Another IOPS measure I have come to value, which I refer to as "Total" IOPS, is the measure of I/ O operations per second for all I/O operations including read, write, and metadata operations (non-read and non-write filesystem or storage operations). Total IOPS is the sum of all possible I/O operations.
Metadata operations include I/O functions such as stat, fstat
, getdents
, or fsync
that don't really have a "payload," as do read or write functions. However, they all still affect performance because they can cause data to be read or written to storage. I consider them important to application performance because applications might call them in rapid succession, thus affecting application performance. Note that the payloads for these metadata I/O functions is very small or non-existent, so to me, this is something very much like IOPS, hence the name, total IOPS.
Currently, tests for metadata rates include MDTest [5], fdtree [6], Postmark [7], and MD-Workbench [8]. These benchmark codes test overall I/O functions such as creating and deleting, renaming, and gathering statistics. These tests are done in isolation and not in combination with each other, and sometimes they are run in a single directory or as part of a directory tree with files at each directory level.
The metadata operations that are tested in the previously mentioned benchmarks are limited to a very few scenarios. Nonetheless, applications use more than these, sometimes repeatedly throughout the application execution and sometimes even in rapid succession. Therefore, knowing how quickly these operations can be performed (i.e., IOPS) is an important part of testing file systems, storage devices, and operating system parameters. Which I/O functions are called vary depending on the application and even the size of the data set. What is important is measuring or testing the IOPS associated with these metadata operations.
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.