Tuning SSD RAID for optimal performance
Flash Stack
Conventional hard disks store their data on one or more magnetic disks, which are written to and read by read/write heads. In contrast, SSDs do not use mechanical components but store data in flash memory cells. Single-level cell (SLC) chips store 1 bit, multilevel cell (MLC) chips 2 bits, and triple-level cell (TLC) chips 3 bits per memory cell. Multiple memory cells are organized in a flash chip to form a page (e.g., 8KB). Several pages then form a block (~2MB).
At this level, the first peculiarity of flash memory already comes to light: Whereas new data can be written to unused pages, subsequent changes are not possible. This only works after the SSD controller has deleted the entire associated block. Thus, a sufficient number of unused pages must be available. SSDs have additional memory cells (spare areas), and depending on the SSD, the size of the spare area is between 7 and 78 percent of the rated capacity.
One way of telling the SSD which data fields are no longer used and can therefore be deleted is with a Trim (or TRIM) function. The operating system tells the SSD controller which data fields can be deleted. Trim is easy to implement for a single SSD, but for parity RAID, the implementation would be quite complex. Thus far, no hardware RAID controller supports Trim functionality. This shortcoming can be easily worked around, however: Most enterprise SSDs natively come with a comparatively large spare area, which is why Trim support hardly matters. And, if the performance is not good enough, you use overprovisioning – more on that later.
Metrics
When measuring SSD performance, three metrics are crucial: input/output operations per second (IOPS), latency, and throughput. The size of an I/O operation is 4KB unless otherwise stated; IOPS are typically "random" (i.e., measured with randomly distributed access to acquire the worst-case values). Whereas hard drives only manage around 100-300 IOPS, current enterprise SSDs achieve up to 36,000 write IOPS and 75,000 read IOPS (e.g., the 800GB DC S3700 SSD model by Intel).
Latency is the wait time in milliseconds (ms) until a single I/O operation has been carried out. The typical average latency for SSDs is between 0.1 and 0.3ms, and between 5 and 10ms for hard disks. It should be noted that hard disk manufacturers typically publish latency as the time for one-half revolution of the disk. For real latency, that is, the average access time, you need to add the track change time (seek time).
Finally, throughput is defined as the data transfer rate in megabytes per second (MBps) and is typically measured for larger and sequential I/O operations. SSDs achieve about twice to three times the throughput of hard drives. For SSDs with a few flash chips (lower capacity SSDs), the write performance is somewhat limited and is approximately at hard disk level.
The following factors affect the performance of SSDs:
- Read/write mix: For SSDs, the read and write operations differ considerably at the hardware level. Because of the higher controller overhead of write operations, SSDs typically achieve more write IOPS than read IOPS. The difference is particularly high for consumer SSDs. With Enterprise SSDs, the manufacturers improve write performance by using a larger spare area and optimizing the controller firmware.
- Random/sequential mix: The number of possible IOPS also depends on whether access is distributed randomly over the entire data area (logical block addressing [LBA] range) or occurs sequentially. In random access, the management overhead of the SSD controller increases, and the number of possible IOPS thus decreases.
- Queue depth: Queue depth refers to the length of the queue in the I/O path to the SSD. Given a larger queue (e.g., 8, 16, or 32), the operating system groups the configured number of I/O operations before sending them to the SSD controller. A larger queue depth increases the possible number of IOPS, because the SSD can send requests in parallel to the flash chips; however, it also increases average latency, and thus wait time for a single I/O operation, simply because each individual operation is not routed to the SSD immediately, but only when the queue is full.
- Spare area: The size of the spare area has a direct effect on the random write performance of the SSD (and thus on the combination of read and write performance). The larger the spare area, the less frequently the SSD controller needs to restructure the internal data. The more time the SSD controller has for host requests, the more random write performance increases.
Determining Baseline Performance
Given the SSD characteristics above, specially tuned performance testing is essential. More specifically, the fresh-out-of-the-box condition and the transition phases between workloads make it complicated to measure performance values reliably. The resulting values thus depend on the following factors:
- Write access and preconditioning – the state of the SSD before the test.
- Workload pattern – the I/O pattern (read/write mix, block sizes) during the test.
- Data pattern – the data actually written.
The requirements for meaningful SSD tests even prompted the Storage Networking Industry Association (SNIA) to publish its own enterprise Performance Test Specification (PTS) [1].
In most cases, however, there is no possibility of carrying out tests on this scale. It is often sufficient, in the first step to deploy simple methods for determining a baseline performance of the SSD. This will give you the metrics (MBps and IOPS) tailored for your system.
Table 1 looks at the Flexible I/O Tester (FIO) [2] performance tool. FIO is particularly common in Linux but is also available for Windows and VMware ESXi. Developed by the maintainer of the Linux block layer, Jens Axboe, this tool draws on some impressive knowledge and functionality. Use the table for a simple performance test of your SSD on Linux. Windows users will need to remove the libaio
and iodepth
parameters.
Table 1
FIO Performance Measurement
Test | Command |
---|---|
Read throughput | fio --name=readTP --rw=read --size=5G --bs=1024k --direct=1 --refill_buffers --ioengine=libaio --iodepth=16
|
Write throughput | fio --name=writeTP --rw=write --size=5G --bs=1024k --direct=1 --refill_buffers --ioengine=libaio --iodepth=16
|
IOPS read | fio --name=readIOPS --rw=randread --size=1G --bs=4k --direct=1 --refill_buffers --ioengine=libaio --iodepth=16
|
IOPS write | fio --name=writeIOPS --rw=randwrite --size=1G --bs=4k --direct=1 --refill_buffers --ioengine=libaio --iodepth=16
|
IOPS mixed workload (50% read/50% write) | fio --name=mixedIOPS --rw=randrw --size=1G --bs=4k --direct=1 --refill_buffers --ioengine=libaio --iodepth=16
|
RAID with SSDs
An analysis of SSDs in a RAID array shows how the performance values develop as the number of SSDs increases. The tests also cater to the different RAID level characteristics.
The hardware setup for the RAID tests are Intel DC S3500 series 80GB SSDs, an Avago (formerly LSI) MegaRAID 9365 RAID controller and a Supermicro X9SCM-F motherboard. Note that the performance of an SSD within a series also depends on its capacity. A 240GB model has performance benefits over an 80GB model.
The performance software used in our lab was TKperf on Ubuntu 14.04. TKperf implements the SNIA PTS with the use of FIO. Since version 2.0, it automatically creates Linux software RAID (SWR) arrays using mdadm
and hardware RAID (HWR) arrays storcli
with Avago MegaRAID controllers. Support for Adaptec RAID controllers is planned.
Buy this article as PDF
(incl. VAT)