Linux I/O Schedulers
The Linux kernel has several I/O schedulers that can greatly influence performance. We take a quick look at I/O scheduler concepts and the options that exist within Linux.
The Linux kernel is a very complex piece of software used on a variety of computers, including embedded devices that need real-time performance, hand-held devices, laptops, desktops, servers, database servers, video servers, and very large supercomputers, including all of those in the TOP500. All of these computers have very different requirements, some of which include responsiveness to user input (e.g., so streaming music, video, or other interactivity is not interrupted). At the same time, the devices require good I/O performance to make sure data is saved properly. Some workloads have very high I/O throughput, so to make sure these requirements are met, the kernel uses schedulers.
Schedulers do exactly what they say: schedule activities within the kernel so that system activities and resources are scheduled to achieve an overall goal for the system. This goal could be low latency for input (as embedded systems require), better interactivity, faster I/O, or even a combination of goals. Primarily, schedulers are concerned with CPU resources, but they could also consider other system resources (e.g., memory, input devices, networks, etc.).
The focus of this article is the I/O scheduler, including I/O scheduler concepts and the various options that are available for I/O tuning.
Intro to I/O Schedulers
Virtually all applications running on Linux do some sort of I/O. Even surfing the web writes a number of small files to disk. Without an I/O scheduler, every I/O request would send an interrupt to the kernel so the I/O operation could be performed, moving the disk head around different blocks to satisfy the read and write requests. Over time, the disparity between the performance of disk drives and the rest of the system grows very rapidly, making I/O more important to overall system performance. As you can imagine, when the kernel has to address an interrupt, any kind of processing or interactive work is paused. Consequently, the system may appear unresponsive or slow.
How do you schedule the I/O requests to preserve interactivity while also ensuring good I/O performance? The answer, as with most things, depends on the workload. In some cases, it would be nice to be able to do I/O while doing other things. In other cases, doing I/O as fast as possible is required, such as when a distributed application is creating a checkpoint. To balance these two very different workloads or to ensure that one workload is emphasized, an I/O scheduler is used.
The scheduling of I/O events must address many parts. For example, the scheduler might need to store events for some future execution. How it stores the events, the possible reordering of events, the length of time it stores the events, the execution of stored events when some condition is reached, the length of I/O execution, and so on are all crucial aspects of the scheduler. Exactly how these various aspects of the scheduler are implemented can have a huge effect on the overall I/O performance of the system and the perception of users when interacting with the system.
Defining the function or role of the system is probably the best place to start when considering scheduler design or the tuning of existing schedulers. For example, you should know whether the target system is an embedded device, a hand-held device, a laptop, desktop, server, supercomputer, and so on so that you can define what your goals are for the scheduler.
For example, suppose your target system is a desktop user doing some web surfing, perhaps playing a video or music, and maybe even running a game. Although this is a simple and common scenario, this mix of workloads has enormous implications. If you are watching a video or listening to music or playing a game, you don’t want it to be interrupted. There’s nothing like a video that pauses – plays – pauses – plays to stretch your patience in a hurry. When gaming, if the system pauses while you are about to blow the head off a mutant zombie, you might find that the zombie has removed your character’s head when the system returns control to your game. Although “stuttery” music might be a legitimate genre to some, in general, it’s quite annoying. Therefore, a desktop target system that requires as little interruption of interactive programs as possible has a great influence on the design of the scheduler.
Disk I/O can be much slower than other aspects of the system. Because I/O scheduling allows you to store events and possibly reorder them, it’s possible to produce contiguous I/O requests to improve performance. Newer filesystems are incorporating some of these concepts, and you can even extend these concepts to make the system better adapt to the properties of SSDs.
I/O schedulers typically use the following techniques:
- Request Merging. Adjacent requests are merged to reduce disk seeking and to increase the size of the I/O syscalls (usually resulting in higher performance).
- Elevator. Requests are ordered on the basis of physical location on the disk so that seeks are in one direction as much as possible. This technique is sometimes referred to as “sorting.”
- Prioritization. Requests are prioritized in some way. The details of the ordering are up to the I/O scheduler.
Almost all I/O schedulers also take resource starvation into account so that all requests are serviced eventually.
These techniques and others are combined to create an I/O scheduler with a few goals, with three of the top goals being:
- Minimize disk seeks
- Ensure optimum disk performance
- Provide fairness among I/O requests
Balancing these goals or trading them against one another is the essence of the art of creating an I/O scheduler.
The techniques used by I/O schedulers as they apply to SSDs are a bit different. SSDs are not spinning media, so merging requests and ordering them might not have much of an effect on I/O. Instead, I/O requests to the same block can be merged, and small I/O writes can either be merged or adjusted to reduce write amplification (i.e., the need for more physical space than the logical data would imply because of the way write operations take place on SSDs).
Linux I/O Schedulers
Because Linux is an open-source project, developers can submit additions to the kernel for inclusion, and over the years, several I/O schedulers have been proposed. Currently, four are included in the kernel:
- Completely Fair Queuing (CFQ)
- Deadline
- NOOP
- Anticipatory
NOOP
The NOOP I/O scheduler is fairly simple. All incoming I/O requests for all processes running on the system, regardless of the I/O request (e.g., read, write, lseek, etc.), go into a simple first in, first out (FIFO) queue. The scheduler also does request merging by taking adjacent requests and merging them into a single request to reduce seek time and improve throughput. NOOP assumes that some other device will optimize I/O performance, such as an external RAID controller or a SAN controller.
Potentially, the NOOP scheduler could work well with storage devices that don’t have a mechanical component (i.e., a drive head) to read data, because it does not make any attempts to reduce seek time beyond simple request merging (which helps throughput). Therefore, storage devices such as flash drives, SSD drives, USB sticks, and the like that have very little seek time could benefit from a NOOP I/O scheduler.
Anticipatory
The anticipatory I/O scheduler was the default scheduler a long time ago (in kernel years). As the name implies, it anticipates subsequent block requests and implements request merging, a one-way elevator (a simple elevator), and read and write request batching. After the scheduler services an I/O request, it anticipates that the next request will be for the subsequent block. If the request comes, the disk head is in the correct location, and the request is serviced very quickly. This approach does add a little latency to the system because it pauses slightly to see if the next request is for the subsequent block. However, this latency is possibly outweighed by the increased performance for neighboring requests.
Putting on your storage expert hat, you can see that the Anticipatory scheduler works really well for certain workloads. For example, one study observed that the Apache web server could achieve up to 71% more throughput using the anticipatory I/O scheduler. On the other hand, the anticipatory scheduler has been observed to result in a slowdown on a database run.
Deadline
The deadline I/O scheduler was written by well-known kernel developer Jens Axboe. The fundamental principle is to guarantee a start time for servicing an I/O request by combining request merging, a one-way elevator, and a deadline on all requests (hence the name). It maintains two deadline queues in addition to the sorted queues for reads and writes. The deadline queues are sorted by their deadline times (time to expiration), with shorter times moving to the head of the queue. The queues are sorted according to their sector number (the elevator approach).
By moving the I/O requests that have been in the queues the longest (the same as having the shortest deadline time), they will be executed before others, which guarantees that I/O requests won’t “starve” for various reasons, resulting in a very long time to execute the request.
A deadline scheduler really helps with distant reads (i.e., fairly far out on the disk or with a large sector number). Read I/O requests sometimes block applications because they have to be executed while the application waits. On the other hand, because writes are cached, execution can quickly return to the application – unless you have turned off the cache in the interest of making sure the data reaches the disk in the event of a power loss, in which case, writes would behave like read requests. Even worse, distant reads would be serviced very slowly because they are constantly moved to the back of the queue as requests for closer parts of the disk are serviced first. However, a deadline I/O scheduler makes sure that all I/O requests are serviced, even the distant read requests.
Diving into the scheduler a bit more, the concepts are surprisingly straightforward. The scheduler decides on the next request by first deciding which queue to use. It gives a higher priority to reads because, as mentioned, applications usually block on read requests. Next, it checks the first request to see if it has expired. If so, it is executed immediately; otherwise, the scheduler serves a batch of requests from the sorted queue.
The deadline scheduler is very useful for some applications. In particular, real-time systems use the deadline scheduler because, in most cases, it keeps latency low (all requests are serviced within a short time frame). It has also been suggested that it works well for database systems that have TCQ-aware disks.
CFQ
The completely fair queue (CFQ) I/O scheduler, is the current default scheduler in the Linux kernel. It uses both request merging and elevators and is a bit more complex that the NOOP or deadline schedulers. CFQ synchronously puts requests from processes into a number of per-process queues then allocates time slices for each of the queues to access the disk. The details of the length of the time slice and the number of requests a queue is allowed to submit are all dependent on the I/O priority of the given process. Asynchronous requests for all processes are batched together into fewer queues with one per priority.
Jens Axboe is the original author of the CFQ scheduler, incorporating elevator_linus, which adds features to prevent starvation for worst case situations, as could happen with distant reads. An article by Axboe has a good discussion on the design of the CFQ I/O scheduler (and others) and the intricacies of scheduler design.
CFQ gives all users (processes) of a particular device (storage) about the same number of I/O requests over a particular time interval, which can help multiuser systems see that all users get about the same level of responsiveness. Moreover, CFQ achieves some of the good throughput characteristics of the anticipatory scheduler because it allows a process queue to have some idle time at the end of a synchronous I/O request, creating some anticipatory time for I/O that might be close to the request just serviced.
Changing the Scheduler
You can easily change the I/O scheduler in use. By default, that is CFQ, but if you want to change it to something like NOOP or deadline, you can, for example, change the default scheduler with the elevator= option, either manually at the kernel line during the boot process or in the GRUB configuration file.
To change the default I/O scheduler in GRUB, edit the /boot/grub/menu.lst file by adding the elevator= option. For example, you could change it from cfq to deadline by adding elevator=deadline to the end of the line that begins with kernel. Be sure to use lowercase for the name of the scheduler. After a change, run the update-grub command.
A second way to change the I/O scheduler is on the fly. For example, you can determine which I/O scheduler is being used by looking at the /sys/block/[device]/queue/scheduler file, where [device] is the name of the device. For example, on my laptop, the command
root@laytonjb-laptop:~# cat /sys/block/sdb/queue/scheduler noop anticipatory deadline [cfq]
shows that the current I/O scheduler (in square brackets) is cfq. To change the scheduler, just echo the name of the desired scheduler:
root@laytonjb-laptop:~# echo deadline > /sys/block/sdb/queue/scheduler root@laytonjb-laptop:~# cat /sys/block/sdb/queue/scheduler noop anticipatory [deadline] cfq
Notice how the I/O scheduler was changed to deadline. When a change of scheduler is requested, the “old” scheduler completes all of its requests before control switches over to the new scheduler.
Summary
Today’s systems can have a large number of users, I/O-intensive workloads, requirements for high levels of interactivity, real-time demands, and a large number of disks and filesystems. Given the enormous strains that current systems impose on I/O subsystems, some way of controlling I/O requests is mandatory. This is where the I/O scheduler comes in.
The I/O scheduler is a very important tool for getting the best I/O performance. These schedulers can be designed to influence I/O and system behavior in whatever manner you desire. Currently, Linux includes four I/O schedulers: NOOP, anticipatory, deadline, and completely fair queuing (CFQ).
In this quick introduction to I/O schedulers, I did not discussed how to tune the various schedulers for your workload, which you can find in the documentation that comes with the source for your current kernel or online. Watching and measuring I/O performance can help you better understand your I/O workloads and how to get better performance.
The easiest way to change your I/O scheduler is to echo the name of the new scheduler to the appropriate device file in the /sys filesystem. The Phoronix website publishes test results for various workloads.