SMART storage device monitoring
Distress Signals
S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) is a monitoring system for storage devices that provides information about the status of a device and allows for the running of self-tests. Administrators can use it to check on the status of their storage devices and periodically run self-tests to determine the state of the device.
IBM was the first company to add some monitoring and information capability to their drives in 1992. Other vendors followed suit, and Compaq led an effort to standardize the approach to monitoring drive health and reporting it. This push for standardization led to S.M.A.R.T. [1] (Although S.M.A.R.T. is the correct abbreviation, it's not nearly as easy to type, so I will be using SMART throughout the remainder of the article.)
Over time, SMART capability has been added to many drives, including PATA, SATA, and the many varieties of SCSI, SAS, and solid-state drives, as well as NVM Express (commonly referred to as NVMe) and even eMMC drives. The standard provides that the drives measure the appropriate health parameters and then make the results available for the operating system or other monitoring tools. However, each drive vendor is free to decide which parameters are to be monitored and their thresholds (i.e., the points at which the drive has "failed"). Note that I use "drive" as a generic term for a storage device in this article.
For a drive to be considered "SMART," all it has to have is the ability to signal between the internal drive sensors and the host computer. Nothing in the standard defines what sensors are in the drive or how the data is exposed to the user. However, at the lowest level, SMART provides a simple binary bit of information – the drive is OK or the drive has failed. This bit of information is called the SMART status. Many times the output DISK FAILING doesn't indicate that the drive has actually failed but that the drive might not meet its specifications.
It is fairly safe to assume that all modern drives have, in addition to the SMART status, SMART attributes. These attributes are completely up to the drive manufacturers and consequently are not standard. Therefore, each type of drive has to be scanned for various SMART attributes and possible values. In addition to SMART attributes, the drives can also contain some self-tests that store the results in the self-test log. These logs can be scanned or read to track the state of the drive, particularly over time. Moreover, you can also tell the drives to run self-tests that indicate whether the drive PASSED or FAILED the tests (more on this later).
SMART attributes might have lower values for better performance or higher values. You have to examine the attribute and decide which is true (or consult the drive manufacturer's specifications). The difficulty in reading SMART attributes is that the threshold values beyond which the drive will not pass under ordinary conditions might not be published by the manufacturer. Moreover, each attribute returns a raw measurement value that is determined by the drive manufacturer and a normalized value that has a value from 1 to 253. The "normal" attribute value is completely up to the manufacturer, as well, so you can see that it's not always easy getting SMART attributes from various drives or interpreting the values. Examples of some SMART attributes are listed in the article about SMART on Wikipedia [1], along with the typical meaning for their raw values.
SMART Attribute Drive Failure
One would think that you could predict failure with many of the SMART attributes. For example, if a drive was running too hot or if bad sectors were developing quickly, you might think the drive would be more susceptible to failure. Perhaps, then, you can use the attributes with some general models of drive failure to predict when drives might fail and then work to minimize the damage.
However the use of SMART attributes for predicting drive failure has been a difficult proposition. In 2007, a Google study [2] examined more than 100,000 drives of various types for correlations between failure and SMART values. The disks were a combination of consumer-grade drives (SATA and PATA) with speeds from 5,400 to 7,200rpm and capacities ranging from 80 to 400GB. Several drive manufacturers were represented in the population of drives, with at least nine different models in total. The data in the study was collected over a nine-month window.
In the study, the authors monitored the SMART attributes of the population of drives and noted which drives failed. Google chose the word "fail" to mean that the drive was not suitable for use in production, even if the drive tests were good (sometimes the drive would test fine but immediately fail in production). From their study, the authors concluded:
Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
However, despite the overall message that they had difficulty developing correlations, they did find some interesting trends:
- In discussing the correlation between SMART attributes and failure rates, one of the best summaries in the paper stated, "Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives."
- Temperature effects are interesting, in that high temperatures start affecting older drives (3-4 years old or older), but lower temperatures can also increase the failure rate of drives, regardless of age.
- A section of the final paragraph of the paper bears repeating here: "We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors."
- Summing all observed factors that contributed to drive failure, such as errors or temperature, they still missed about 36% of drive failures.
The paper provides some good insight into the drive failure rate of a large population of drives. As mentioned previously, they did observe some correlation of drive failure with scan errors, but that didn't account for all failures, a large fraction of which did not show any SMART error signals. It's also important to mention that the comment in the last paragraph mentions that, "… SMART models are more useful in predicting trends for large aggregate populations than for individual components." However, this should not deter you from watching the SMART error signals and attributes to track the history of drives in your systems. Again, there appears to be some correlation between scan errors and drive failure, which might be useful in your environment.
In a more recent 2016 study, Microsoft and Pennsylvania State University examined SSD failures in data centers [3]. Over nearly three years they examined about 500,000 SSDs from five very large data centers and several edge data centers. The drives were used in a variety of workloads, including big data analytics, content distribution caches, data center management software, and web search functions (indexing, multimedia, object store, advertisement, etc.). The big data analytics workload was more write than read heavy, and the other three workloads were more read than write heavy.
For all the drives, failure data was gathered, as well as other possible influencing factors, including design, provisioning and workload evolution data (read/write volumes, write amplification, etc.), fine spatial information (data center, rack, and server location), and temporal resolution. SMART attributes for the drives also were captured.
Some of the primary conclusions included:
- The annualized failure rate (AFR) for some drive models is much higher than quoted in SSD specifications – as much as 70% higher.
- Four SMART attributes are most important in determining drive failure:
- Data errors (uncorrectable and cyclic redundancy check [CRC])
- Sector reallocations
- Program/erase failures
- SATA downshift (a downgrade to a lower signaling rate with an increase in errors)
- Uncorrectable bit errors are at least an order of magnitude higher than the target rates.
- Symptoms captured by SMART are more likely to precede SSD failure, "with an intense manifestation preventing their survivability beyond a few months. However, our analysis shows that these symptoms are not a sufficient indicator for diagnosing failures."
- Drive symptoms (i.e., data errors and reallocated sectors) have a direct effect on failures.
- Design/provisioning factors (e.g., device model) can affect failure rates.
- Devices are more likely to fail in less than a month after their symptoms match failure signatures.
- The AFR increases two to four times with an increase in average writes per day for some drive types.
With the use of machine learning techniques, the researchers were able to rank the importance of SMART parameters (Table 1).
Table 1
SMART Parameter Ranking
Category | Feature | Importance |
---|---|---|
Symptom | DataErrors | 1 |
Symptom | ReallocSectors | 0.943 |
Device workload | TotalNANDWrite | 0.526 |
Device workload | HostWrite | 0.517 |
Device workload | TotalReads+Write | 0.516 |
Device workload | AvgMemory | 0.504 |
Device workload | AvgSSDSpace | 0.493 |
Device workload | UsagePerDay | 0.491 |
Getting to SMART Data and Self-Tests
Fortunately, Linux has a great tool, smartmontools [4], that takes advantage of the features and capabilities of SMART drives by allowing you to interact with storage devices that use the SMART protocol. Smartmontools lets you collect SMART attribute information, control self-tests on the drive, and obtain logs. Derived from and expanding on an earlier project, smartsuite , from the University of California at Santa Cruz, smartmontools incorporated later standards and additional features. The tool is compatible with all SMART features and supports ATA, ATAPI, and SATA-3 to -8 disks, as well as SCSI disks and tape, NVMe, solid-state, and eMMC devices. It also supports the major Linux RAID cards, which sometimes present difficulties because they require vendor-specific I/O control commands. Check the smartmontools web page for more details on your specific RAID card.
Smartmontools
is easy to build and easy to use. In the interest of brevity, I just downloaded the latest 64-bit binary from the website. In the smartmontools
package, the key binary is smartctl
, which allows interaction with the SMART attributes of drives. For this article, I tested a Samsung 840 SSD on an office desktop running Ubuntu 18.04.
The first thing to do once smartmontools
is installed is to scan each drive with the -i
(info) option (Listing 1), which checks to see whether the drive is SMART capable.
Listing 1
Checking SMART Capability
# smartctl -i /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Samsung based SSDs Device Model: Samsung SSD 840 Series Serial Number: S19HNSAD620520T LU WWN Device Id: 5 002538 5a0050931 Firmware Version: DXT08B0Q User Capacity: 120,034,123,776 bytes [120 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device TRIM Command: Available, deterministic, zeroed Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Aug 2 10:41:21 2020 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
Notice the Device is: In smartctl database …
line, which indicates that this drive is in the smartctl
database, so the SMART attributes and their corresponding values are known. To check on the details, use the -P show
option (Listing 2).
Listing 2
Device Details
# smartctl -P show /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org Drive found in smartmontools Database. Drive identity strings: MODEL: Samsung SSD 840 Series FIRMWARE: DXT08B0Q match smartmontools Drive Database entry: MODEL REGEXP: SAMSUNG SSD PM800 .*GB|SAMSUNG SSD PM810 .*GB [cut due to length] FIRMWARE REGEXP: .* MODEL FAMILY: Samsung based SSDs ATTRIBUTE OPTIONS: 170 Unused_Rsvd_Blk_Ct_Chip 171 Program_Fail_Count_Chip 172 Erase_Fail_Count_Chip 173 Wear_Leveling_Count 174 Unexpect_Power_Loss_Ct 187 Uncorrectable_Error_Cnt 191 Unknown_Samsung_Attr 195 ECC_Error_Rate 199 CRC_Error_Count 201 Supercap_Status 202 Exception_Mode_Status 234 Unknown_Samsung_Attr 235 POR_Recovery_Count 236 Unknown_Samsung_Attr 237 Unknown_Samsung_Attr 238 Unknown_Samsung_Attr 243 SATA_Downshift_Ct 244 Thermal_Throttle_St 245 Timed_Workld_Media_Wear 246 Timed_Workld_RdWr_Ratio 247 Timed_Workld_Timer 249 Unknown_Samsung_Attr 250 SATA_Iface_Downshift 251 NAND_Writes
Another useful smartctl
option is -c
(Listing 3), which only prints the generic SMART capabilities. The somewhat long output lists what SMART features are implemented and how the device will respond to some of the various SMART commands.
Listing 3
SMART Capabilities
# smartctl -c /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (49272) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 30) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported.
Remember that most drives that are SMART capable have self-tests. In the -c
output, notice that this drive is capable of both short
(two-minute) and long
(30-minute) tests.
Before running either the short or extended test, you should enable self-tests with the smartctl
command in Listing 4.
Listing 4
Enabling Self-Tests
# smartctl -s on -o on -S on /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Enabled. SMART Attribute Autosave Enabled. SMART Automatic Offline Testing Enabled every four hours.
Normally, before running self-tests, the health of the device is checked first with the -H
option (Listing 5). The result is a simple response: in this case, PASSED
.
Listing 5
Checking the Device Health
# smartctl -H /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
The first self-test to try is the short
test (Listing 6). Although the command kicks off the self-test, it does not display the results. To see the results, check the self-test output with the -l
option (Listing 7).
Listing 6
Short Self-Test
# smartctl -t short /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Sun Aug 2 10:57:54 2020 EDT Use smartctl -X to abort test.
Listing 7
Short Test Output
# smartctl -l selftest /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 12441 - # 2 Short offline Completed without error 00% 12441 -
In this case, the output reports Completed without error . Also notice from the additional entry in the output that the test has been run previously.
Listings 8 and 9 show running the long
, or extended, offline test and viewing its output. Remember, the command just kicks off the test; the long
test might take several minutes to complete. To check whether the test has completed, use the same -l selftest
option as before.
Listing 8
Long Self-Test
# smartctl -t long /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 30 minutes for test to complete. Test will complete after Sun Aug 9 09:50:56 2020 EDT Use smartctl -X to abort test.
Listing 9
Long Test Output
# smartctl -l selftest /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 12441 - # 2 Short offline Completed without error 00% 12441 - # 3 Short offline Completed without error 00% 12441 -
The output of the extended test is reported in the Extended offline line. It, too, was completed without error. If you run
smartctl -l selftest
and do not see any Extended offline output, the long test has not completed. Just wait a little longer and try the command again until you see the output.
In addition to self-tests, you can also search the SMART logs for errors with the simple command shown in Listing 10.
Listing 10
Self-Test Logs
# smartctl -l error -d sat /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Error Log Version: 1 No Errors Logged
This command is helpful if you run self-tests at periodic intervals. In this case, errors are found on the drive.
Once you know that a drive is good (no errors and all self-tests are passed), you can start to probe the drive a little further. Earlier, I used the -c
option to list the reporting capabilities of the drive. You can also use the -a
option (Listing 11) to list the vendor-specific SMART attributes. Although the output is rather long, it contains a great deal of information.
Listing 11
Vendor-Specific Attributes
# /usr/local/sbin/smartctl -a /dev/sda smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Samsung based SSDs Device Model: Samsung SSD 840 Series Serial Number: S19HNSAD620520T LU WWN Device Id: 5 002538 5a0050931 Firmware Version: DXT08B0Q User Capacity: 120,034,123,776 bytes [120 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device TRIM Command: Available, deterministic, zeroed Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Aug 2 10:57:50 2020 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: (49272) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 30) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 12441 12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 1482 177 Wear_Leveling_Count 0x0013 098 098 000 Pre-fail Always - 15 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 070 065 000 Old_age Always - 30 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 001 000 Old_age Always - 69 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 2909310200 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 12441 - # 2 Short offline Completed without error 00% 12441 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing 255 28175872 28241407 Read_scanning was never started Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
The first part of the output – the capabilities section – you've already seen with the -c
option, but the second part is new. The table starting with ID# ATTRIBUTE_NAME FLAG
contains the vendor-specific SMART attributes. The column to pay attention to is RAW_VALUE
. For SSDs, a good first attribute to examine is in the first row, Reallocated_Sector_Ct
, which is the count of reallocated sectors for the drive.
The drive has a pool of sectors held in reserve that can be used to swap out when it encounters bad sectors. If a bad sector is found, the drive allocates a sector from the reserve space, and the data is transferred to this new sector.
Notice that for this drive, the RAW_VALUE is 0. Recall that in the Microsoft study, the attribute is important (Table 1, ReallocSectors). Moreover, after the raw count increases, even by 1, the drive has a high probability of failing relatively soon.
Summary
Pretty much all drives today come with SMART attributes, and the easy-to-configure Linux smartmontools utilities can query these SMART attributes and perform self-tests. However, because SMART attributes are not standard, smartmontools might not be useful for your particular drive (or RAID card), so you should check on the smartmontools mailing list whether your drive is in the database.
A couple of studies have shown that SMART attributes alone cannot predict the failure of drives. Nonetheless, these attributes can be used as indicators to help determine the probability of a drive failing.
Infos
- S.M.A.R.T: https://en.wikipedia.org/wiki/S.M.A.R.T.
- Google study: https://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf
- SSD failures in datacenters: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf
- smartmontools: https://www.smartmontools.org/
Buy this article as PDF
(incl. VAT)