Linux I/O Schedulers
The Linux kernel is a very complex piece of software used on a variety of computers, including embedded devices that need real-time performance, hand-held devices, laptops, desktops, servers, database servers, video servers, and very large supercomputers, including all of those in the TOP500. All of these computers have very different requirements, some of which include responsiveness to user input (e.g., so streaming music, video, or other interactivity is not interrupted). At the same time, the devices require good I/O performance to make sure data is saved properly. Some workloads have very high I/O throughput, so to make sure these requirements are met, the kernel uses schedulers.
Schedulers do exactly what they say: schedule activities within the kernel so that system activities and resources are scheduled to achieve an overall goal for the system. This goal could be low latency for input (as embedded systems require), better interactivity, faster I/O, or even a combination of goals. Primarily, schedulers are concerned with CPU resources, but they could also consider other system resources (e.g., memory, input devices, networks, etc.).
The focus of this article is the I/O scheduler, including I/O scheduler concepts and the various options that are available for I/O tuning.
Intro to I/O Schedulers
Virtually all applications running on Linux do some sort of I/O. Even surfing the web writes a number of small files to disk. Without an I/O scheduler, every I/O request would send an interrupt to the kernel so the I/O operation could be performed, moving the disk head around different blocks to satisfy the read and write requests. Over time, the disparity between the performance of disk drives and the rest of the system grows very rapidly, making I/O more important to overall system performance. As you can imagine, when the kernel has to address an interrupt, any kind of processing or interactive work is paused. Consequently, the system may appear unresponsive or slow.
How do you schedule the I/O requests to preserve interactivity while also ensuring good I/O performance? The answer, as with most things, depends on the workload. In some cases, it would be nice to be able to do I/O while doing other things. In other cases, doing I/O as fast as possible is required, such as when a distributed application is creating a checkpoint. To balance these two very different workloads or to ensure that one workload is emphasized, an I/O scheduler is used.
The scheduling of I/O events must address many parts. For example, the scheduler might need to store events for some future execution. How it stores the events, the possible reordering of events, the length of time it stores the events, the execution of stored events when some condition is reached, the length of I/O execution, and so on are all crucial aspects of the scheduler. Exactly how these various aspects of the scheduler are implemented can have a huge effect on the overall I/O performance of the system and the perception of users when interacting with the system.
Defining the function or role of the system is probably the best place to start when considering scheduler design or the tuning of existing schedulers. For example, you should know whether the target system is an embedded device, a hand-held device, a laptop, desktop, server, supercomputer, and so on so that you can define what your goals are for the scheduler.
For example, suppose your target system is a desktop user doing some web surfing, perhaps playing a video or music, and maybe even running a game. Although this is a simple and common scenario, this mix of workloads has enormous implications. If you are watching a video or listening to music or playing a game, you don’t want it to be interrupted. There’s nothing like a video that pauses – plays – pauses – plays to stretch your patience in a hurry. When gaming, if the system pauses while you are about to blow the head off a mutant zombie, you might find that the zombie has removed your character’s head when the system returns control to your game. Although “stuttery” music might be a legitimate genre to some, in general, it’s quite annoying. Therefore, a desktop target system that requires as little interruption of interactive programs as possible has a great influence on the design of the scheduler.
Disk I/O can be much slower than other aspects of the system. Because I/O scheduling allows you to store events and possibly reorder them, it’s possible to produce contiguous I/O requests to improve performance. Newer filesystems are incorporating some of these concepts, and you can even extend these concepts to make the system better adapt to the properties of SSDs.
I/O schedulers typically use the following techniques:
- Request Merging. Adjacent requests are merged to reduce disk seeking and to increase the size of the I/O syscalls (usually resulting in higher performance).
- Elevator. Requests are ordered on the basis of physical location on the disk so that seeks are in one direction as much as possible. This technique is sometimes referred to as “sorting.”
- Prioritization. Requests are prioritized in some way. The details of the ordering are up to the I/O scheduler.
Almost all I/O schedulers also take resource starvation into account so that all requests are serviced eventually.
These techniques and others are combined to create an I/O scheduler with a few goals, with three of the top goals being:
- Minimize disk seeks
- Ensure optimum disk performance
- Provide fairness among I/O requests
Balancing these goals or trading them against one another is the essence of the art of creating an I/O scheduler.
The techniques used by I/O schedulers as they apply to SSDs are a bit different. SSDs are not spinning media, so merging requests and ordering them might not have much of an effect on I/O. Instead, I/O requests to the same block can be merged, and small I/O writes can either be merged or adjusted to reduce write amplification (i.e., the need for more physical space than the logical data would imply because of the way write operations take place on SSDs).
Linux I/O Schedulers
Because Linux is an open-source project, developers can submit additions to the kernel for inclusion, and over the years, several I/O schedulers have been proposed. Currently, four are included in the kernel:
- Completely Fair Queuing (CFQ)
- Deadline
- NOOP
- Anticipatory
NOOP
The NOOP I/O scheduler is fairly simple. All incoming I/O requests for all processes running on the system, regardless of the I/O request (e.g., read, write, lseek, etc.), go into a simple first in, first out (FIFO) queue. The scheduler also does request merging by taking adjacent requests and merging them into a single request to reduce seek time and improve throughput. NOOP assumes that some other device will optimize I/O performance, such as an external RAID controller or a SAN controller.
Potentially, the NOOP scheduler could work well with storage devices that don’t have a mechanical component (i.e., a drive head) to read data, because it does not make any attempts to reduce seek time beyond simple request merging (which helps throughput). Therefore, storage devices such as flash drives, SSD drives, USB sticks, and the like that have very little seek time could benefit from a NOOP I/O scheduler.
Anticipatory
The anticipatory I/O scheduler was the default scheduler a long time ago (in kernel years). As the name implies, it anticipates subsequent block requests and implements request merging, a one-way elevator (a simple elevator), and read and write request batching. After the scheduler services an I/O request, it anticipates that the next request will be for the subsequent block. If the request comes, the disk head is in the correct location, and the request is serviced very quickly. This approach does add a little latency to the system because it pauses slightly to see if the next request is for the subsequent block. However, this latency is possibly outweighed by the increased performance for neighboring requests.
Putting on your storage expert hat, you can see that the Anticipatory scheduler works really well for certain workloads. For example, one study observed that the Apache web server could achieve up to 71% more throughput using the anticipatory I/O scheduler. On the other hand, the anticipatory scheduler has been observed to result in a slowdown on a database run.
Deadline
The deadline I/O scheduler was written by well-known kernel developer Jens Axboe. The fundamental principle is to guarantee a start time for servicing an I/O request by combining request merging, a one-way elevator, and a deadline on all requests (hence the name). It maintains two deadline queues in addition to the sorted queues for reads and writes. The deadline queues are sorted by their deadline times (time to expiration), with shorter times moving to the head of the queue. The queues are sorted according to their sector number (the elevator approach).
By moving the I/O requests that have been in the queues the longest (the same as having the shortest deadline time), they will be executed before others, which guarantees that I/O requests won’t “starve” for various reasons, resulting in a very long time to execute the request.
A deadline scheduler really helps with distant reads (i.e., fairly far out on the disk or with a large sector number). Read I/O requests sometimes block applications because they have to be executed while the application waits. On the other hand, because writes are cached, execution can quickly return to the application – unless you have turned off the cache in the interest of making sure the data reaches the disk in the event of a power loss, in which case, writes would behave like read requests. Even worse, distant reads would be serviced very slowly because they are constantly moved to the back of the queue as requests for closer parts of the disk are serviced first. However, a deadline I/O scheduler makes sure that all I/O requests are serviced, even the distant read requests.
Diving into the scheduler a bit more, the concepts are surprisingly straightforward. The scheduler decides on the next request by first deciding which queue to use. It gives a higher priority to reads because, as mentioned, applications usually block on read requests. Next, it checks the first request to see if it has expired. If so, it is executed immediately; otherwise, the scheduler serves a batch of requests from the sorted queue.
The deadline scheduler is very useful for some applications. In particular, real-time systems use the deadline scheduler because, in most cases, it keeps latency low (all requests are serviced within a short time frame). It has also been suggested that it works well for database systems that have TCQ-aware disks.
CFQ
The completely fair queue (CFQ) I/O scheduler, is the current default scheduler in the Linux kernel. It uses both request merging and elevators and is a bit more complex that the NOOP or deadline schedulers. CFQ synchronously puts requests from processes into a number of per-process queues then allocates time slices for each of the queues to access the disk. The details of the length of the time slice and the number of requests a queue is allowed to submit are all dependent on the I/O priority of the given process. Asynchronous requests for all processes are batched together into fewer queues with one per priority.
Jens Axboe is the original author of the CFQ scheduler, incorporating elevator_linus, which adds features to prevent starvation for worst case situations, as could happen with distant reads. An article by Axboe has a good discussion on the design of the CFQ I/O scheduler (and others) and the intricacies of scheduler design.
CFQ gives all users (processes) of a particular device (storage) about the same number of I/O requests over a particular time interval, which can help multiuser systems see that all users get about the same level of responsiveness. Moreover, CFQ achieves some of the good throughput characteristics of the anticipatory scheduler because it allows a process queue to have some idle time at the end of a synchronous I/O request, creating some anticipatory time for I/O that might be close to the request just serviced.