S.M.A.R.T., smartmontools, and drive monitoring
Working Smart
S.M.A.R.T. (self-monitoring, analysis, and reporting technology) [1] is a monitoring system for storage devices that provides some information about the status of the drive as well as the ability to run self-tests. The intent of S.M.A.R.T. is for the drive to collect presumably useful information on its state or condition, with the idea that the information can be used to predict impending drive failure.
The S.M.A.R.T. standard was based on previous work by various drive manufacturers to provide health information about their drives. The information was specific to the manufacturer, making life difficult for everyone, so the S.M.A.R.T. standard was developed to provide a set of specific metrics and methods that could be communicated to the host OS. The original standard was very comprehensive, with standard data for all devices, but the final standard is just a shadow of the original.
To be considered S.M.A.R.T., a drive just needs the ability to signal between the internal drive sensors and the host system. The standard provides no information about what sensors are in the drive or how this data is exposed to the user. At the lowest level, S.M.A.R.T. provides a simple binary bit of information – the drive is OK or the drive has failed. This bit of information is called the "S.M.A.R.T. status." The drive fail status doesn't necessarily indicate that the drive has failed; rather, the drive might not meet its specifications. Whether this means the drive is about to fail (i.e., not work) is defined by the drive manufacturer, so S.M.A.R.T. is not a panacea.
In addition to the S.M.A.R.T. status, virtually all drives provide additional details on the health of the drive via S.M.A.R.T. attributes. These attributes are completely up to the drive manufacturers because they are not part of the standard. This becomes even more obvious when comparing spinning drives to solid state drives, which have very different characteristics and health information. Thus, each type of drive must be scanned for various S.M.A.R.T. attributes and possible values. Along with S.M.A.R.T. attributes, the drives can also contain some self-tests with the results stored in the self-test log. These logs may be scanned or read to track the state of the drive. Moreover, you can also tell the drives to run self-tests.
Attributes have a threshold value beyond which the drive will not pass under ordinary conditions (sometimes lower values are better and sometimes larger values are better), so the attributes are difficult to read. These threshold values are only known to the manufacturer and might not be published.
Additionally, each attribute returns a raw value, the measurement of which is up to the drive manufacturer, and a normalized value that spans from 1 to 253. A "normal" attribute value also is completely up to the manufacturer. It's not too difficult to get S.M.A.R.T. attributes from various drives, and the values are easy to interpret. Wikipedia [1] has a pretty good list of common attributes and their meanings.
All of the features and capabilities of S.M.A.R.T. drives sound great, but how does one collect S.M.A.R.T. attribute information or control drive self-tests and obtain the logs?
smartmontools
Fortunately for Linux types, an open source project, smartmontools [2], allows you to interact with the S.M.A.R.T. data in drives on your systems. Using this tool, you can query the drives to gather information about them, or you can test the drives and gather logs. This kind of information can be an administrator's dream, because it can be used to keep track of the system configuration as well as the state of the drives (more on that later).
Smartmontools is compatible with all S.M.A.R.T. features and supports ATA/ATAPI/SATA-3 to -8 disks and SCSI disks and tape devices. It also supports the major Linux RAID cards, which can sometimes cause difficulties because they require vendor-specific I/O control commands. Check the smartmontools page for more details on your specific card.
For this article, I used a freshly installed CentOS 6.4 [3] distribution on a newly built system with smartmontools installed using yum. I also made sure the smartmontools daemon, smartd, starts with the system by using:
chkconfig smartd on
I will illustrate some smartmontools tricks by exploring two types of drives in my system. The first is a Samsung 840 SSD (120GB), and the second is a Seagate Barracuda 3TB SATA drive. I want to test both types of drives, so you can see how the output differs, particularly for SSDs that don't have rotating media.
First Steps
Before using smartmontools, you might want to read the man pages, but they are rather long, so I hope to give you a few quick tips on getting started; then, you can read the man pages as needed. The first thing I like to do is an "inquiry" on the drives. The output for the Samsung SSD is shown in Listing 1.
Listing 1
Samsung SSD Inquiry
[root@home4 ~]# smartctl -i /dev/sdb smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: Samsung SSD 840 Series Serial Number: S19HNSAD620517N LU WWN Device Id: 5 002538 5a005092e Firmware Version: DXT08B0Q User Capacity: 120,034,123,776 bytes [120 GB] Sector Size: 512 bytes logical/physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4c Local Time is: Sun Oct 13 09:11:24 2013 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
Note that the command for using smartmontools is smartctl
. As this output shows, the drive is S.M.A.R.T. capable, but not in the smartmontools
database. Recall that the S.M.A.R.T. attributes can vary from one vendor to another. The smartmontools community contributes information to the database for various drives so that other people can use the information.
The Seagate hard drive inquiry output is shown in Listing 2. Even though this device is in the database, using the command smartctl -P show
doesn't show anything (there are no presets for the drives, i.e., no preset S.M.A.R.T. output information).
Listing 2
Seagate HD Inquiry
[root@home4 ~]# smartctl -i /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors) Device Model: ST3000DM001-1CH166 Serial Number: Z1F35P0G LU WWN Device Id: 5 000c50 050b954c3 Firmware Version: CC27 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ACS-2 (unknown minor revision code: 0x001f) Local Time is: Sun Oct 13 09:12:15 2013 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
After performing an inquiry on a drive, I like to check its health. This is pretty simple using the -H
option. Listings 3 and 4 show the output for the SSD and the hard drive. You can see that both drives pass their overall health self-assessment.
Listing 3
SSD Health Check
[root@home4 ~]# smartctl -H /dev/sdb smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
Listing 4
Hard Drive Health Check
[root@home4 ~]# smartctl -H /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
If the output says FAIL , your best bet is to copy immediately all data from that drive and take it out of service. Although the drive might be able to function for some time, you can't be sure, so it's best to get the data off that drive immediately.
As an administrator of an HPC system, you could put this health status check in a cron job to run periodically. You could even create a metric as part of a monitoring tool, such as Ganglia [4], to show the status of all of the drives, either in a central storage system or in the compute nodes.
Potentially, a drive's S.M.A.R.T. system has a huge number of attributes. You can discover those attributes using the -a
option. The output can be quite long, but an example for the Seagate drive is shown in Listing 5.
Listing 5
Attribute Discovery (abridged)
[root@home4 ~]# smartctl -a /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors) Device Model: ST3000DM001-1CH166 Serial Number: Z1F35P0G LU WWN Device Id: 5 000c50 050b954c3 Firmware Version: CC27 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ACS-2 (unknown minor revision code: 0x001f) Local Time is: Sun Oct 13 09:48:13 2013 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 344) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 130040264 3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 11 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 [...] 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 170901043675205 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 11336138150 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 5864598746 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
The Seagate has more attributes than the Samsung SSD. If you read the list, some items might look alarming. For example, under TYPE you see some things listed as Pre-fail . This does not mean the drive is getting ready to fail; it means the drive has not "failed" that category. Also note that because an attribute might be listed as "failed" at some point, that does not mean the drive really has failed. Whether the drive has failed is defined by the drive manufacturer.
The first attribute, Raw_Read_Error_Rate , is the rate of hardware read errors that occurred when reading data from a drive. The value of the attribute is 117 , its worst value is 99 , and the threshold is 006 .
Does this mean the read error rate is 117 when the threshold is 6? Not necessarily, because the absolute values being examined are meaningless without knowing their definitions. What you should do is track that attribute and see when or if it changes.
Testing
One thing S.M.A.R.T. can do is run self-tests on drives. Even though the self-test details are manufacturer dependent, they can be useful, because at least they cause the drives to do some sort of testing to determine whether they are "OK" or not. S.M.A.R.T. performs two main tests – a short self-test and a long self-test – which you can schedule in a cron job.
Before I run self-tests, I want to set a couple of options on the drives to make sure I capture the data. To set the options, you use smartctl
(Listing 6):
smartctl -s on -S on /dev/sdb
Listing 6
Setting Options
[root@home4 ~]# smartctl -s on -S on /dev/sdb smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Enabled. SMART Attribute Autosave Enabled.
The first option "turns on" S.M.A.R.T., and the second stores the results between reboots. The lowercase -s on
enables S.M.A.R.T. on a particular device (/dev/sdb
in this case). The uppercase -S on
tells the device to save the S.M.A.R.T. attributes between power cycles.
If you look closely in the output of the S.M.A.R.T. attributes, you will see some estimated times for the short self-test and the long self-test. For the Seagate drive, the short self-test takes about one minute, and the long self-test takes about 344 minutes. These are only estimates, but in my experience, for a quiet system, they have proven to be pretty good estimates. Running the short self-test on the Seagate drive is simple (Listing 7):
smartctl -t short /dev/sdd
Listing 7
Short Self-Test
[root@home4 ~]# smartctl -t short /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 1 minutes for test to complete. Test will complete after Sun Oct 13 09:57:27 2013 Use smartctl -X to abort test.
You can check whether a test is done is by issuing the command
smartctl -l selftest /dev/sdd
(Listing 8). Note that the output tells you how much of the test remains to finish (20% in this case). When the test is done, the output looks like Listing 9. You really want to see the message Completed without error in the output.
Listing 8
Check for Finished Test
[root@home4 ~]# smartctl -l selftest /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Self-test routine in progress 20% 61 -
Listing 9
Finished Short Self-Test
[root@home4 ~]# smartctl -l selftest /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 61 -
Running the long self-test is very similar to running the short test (Listing 10):
smartctl -t long /dev/sdd
Listing 10
Long Self-Test
[root@home4 ~]# smartctl -t long /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 344 minutes for test to complete. Test will complete after Sun Oct 13 16:04:31 2013 Use smartctl -X to abort test.
Although the test did take quite some time, when it completed, the output was as shown in Listing 11. Again, the test completed without error.
Listing 11
Finished Long Self-Test
[root@home4 laytonjb]# smartctl -l selftest /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 67 - # 2 Short offline Completed without error 00% 61 -
The short and long self-tests are run immediately on issuing the commands, most likely via a cron job. An offline test, on the other hand, occurs at preset intervals, so the drive could be in use when the test is run, which could lead to performance degradation. The smartctl man page states:
This type of test can, in principle, degrade the device performance. The '-o on' option causes this offline testing to be carried out, automatically, on a regular scheduled basis. Normally, the disk will suspend offline testing while disk accesses are taking place, and then automatically resume it when the disk would otherwise be idle, so in practice it has little effect.
To run offline tests, you use the -o on
option with smartctl
, and it is up to you whether or not to run the tests. My recommendation is that if the storage devices are likely to be busy, as they would be in a storage system, then you might not want to run the offline tests. However, if the drive is in a compute node or a workstation with idle device usage periods, then you might want to run the tests.