SMART Devices

Jeff Layton

08/14/2020 06:08 pm

Most storage devices have SMART capability, but can it help you predict failure? We look at ways to take advantage of this built-in monitoring technology with the smartctl utility from the Linux smartmontools package.

S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) is a monitoring system for storage devices that provides information about the status of a device and allows for the running of self-tests. Administrators can use it to check on the status of their storage devices and periodically run self-tests to determine the state of the device.

IBM was the first company to add some monitoring and information capability to their drives in 1992. Other vendors followed suit, and Compaq led an effort to standardize the approach to monitoring drive health and reporting it. This push for standardization led to S.M.A.R.T. (Although S.M.A.R.T. is the correct abbreviation, it’s not nearly as easy to type, so I will be using SMART throughout the remainder of the article.)

Over time, SMART capability has been added to many drives, including PATA, SATA, and the many varieties of SCSI, SAS, and solid-state drives, as well as NVM Express (commonly referred to as NVMe) and even eMMC drives. The standard provides that the drives measure the appropriate health parameters and then make the results available for the operating system or other monitoring tools. However, each drive vendor is free to decide which parameters are to be monitored and their thresholds (i.e., the points at which the drive has “failed”). Note that I use “drive” as a generic term for a storage device in this article.

For a drive to be considered “SMART,” all it has to have is the ability to signal between the internal drive sensors and the host computer. Nothing in the standard defines what sensors are in the drive or how the data is exposed to the user. However, at the lowest level, SMART provides a simple binary bit of information – the drive is OK or the drive has failed. This bit of information is called the SMART status. Many times the output DISK FAILING doesn’t indicate that the drive has actually failed but that the drive might not meet its specifications.

It is fairly safe to assume that all modern drives have, in addition to the SMART status, SMART attributes. These attributes are completely up to the drive manufacturers and consequently are not standard. Therefore, each type of drive has to be scanned for various SMART attributes and possible values. In addition to SMART attributes, the drives can also contain some self-tests that store the results in the self-test log. These logs can be scanned or read to track the state of the drive, particularly over time. Moreover, you can also tell the drives to run self-tests that indicate whether the drive PASSED or FAILED the tests (more on this later).

SMART attributes might have lower values for better performance or higher values. You have to examine the attribute and decide which is true (or consult the drive manufacturer’s specifications). The difficulty in reading SMART attributes is that the threshold values beyond which the drive will not pass under ordinary conditions might not be published by the manufacturer. Moreover, each attribute returns a raw measurement value that is determined by the drive manufacturer and a normalized value that has a value from 1 to 253. The “normal” attribute value is completely up to the manufacturer, as well, so you can see that it’s not always easy getting SMART attributes from various drives or interpreting the values. Examples of some SMART attributes are listed in the article about SMART on Wikipedia, along with the typical meaning for their raw values.

S.M.A.R.T. Attribute Drive Failure

One would think that you could predict failure with many of the SMART attributes. For example, if a drive was running too hot or if bad sectors were developing quickly, you might think the drive would be more susceptible to failure. Perhaps, then, you can use the attributes with some general models of drive failure to predict when drives might fail and then work to minimize the damage.

However the use of SMART attributes for predicting drive failure has been a difficult proposition. In 2007, a Google study examined more than 100,000 drives of various types for correlations between failure and SMART values. The disks were a combination of consumer-grade drives (SATA and PATA) with speeds from 5,400 to 7,200rpm and capacities ranging from 80 to 400GB. Several drive manufacturers were represented in the population of drives, with at least nine different models in total. The data in the study was collected over a nine-month window.

In the study, the authors monitored the SMART attributes of the population of drives and noted which drives failed. Google chose the word “fail” to mean that the drive was not suitable for use in production, even if the drive tests were good (sometimes the drive would test fine but immediately fail in production). From their study, the authors concluded:

Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

However, despite the overall message that they had difficulty developing correlations, they did find some interesting trends:

In discussing the correlation between SMART attributes and failure rates, one of the best summaries in the paper stated, “Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives.”
Temperature effects are interesting, in that high temperatures start affecting older drives (3–4 years old or older), but lower temperatures can also increase the failure rate of drives, regardless of age.
A section of the final paragraph of the paper bears repeating here: “We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors."
Summing all observed factors that contributed to drive failure, such as errors or temperature, they still missed about 36% of drive failures.

The paper provides some good insight into the drive failure rate of a large population of drives. As mentioned previously, they did observe some correlation of drive failure with scan errors, but that didn’t account for all failures, a large fraction of which did not show any SMART error signals. It’s also important to mention that the comment in the last paragraph mentions that, “... SMART models are more useful in predicting trends for large aggregate populations than for individual components.” However, this should not deter you from watching the SMART error signals and attributes to track the history of drives in your systems. Again, there appears to be some correlation between scan errors and drive failure, which might be useful in your environment.

In a more recent 2016 study, Microsoft and Pennsylvania State University, examined SSD failures in data centers. Over nearly three years they examined about 500,000 SSDs from five very large data centers and several edge data centers. The drives were used in a variety of workloads, including big data analytics, content distribution caches, data center management software, and web search functions (indexing, multimedia, object store, advertisement, etc.). The big data analytics workload was more write than read heavy, and the other three workloads were more read than write heavy.

For all the drives, failure data was gathered, as well as other possible influencing factors, including design, provisioning and workload evolution data (read/write volumes, write amplification, etc.), fine spatial information (data center, rack, and server location), and temporal resolution. SMART attributes for the drives also were captured.

Some of the primary conclusions included:

The annualized failure rate (AFR) for some drive models is much higher than quoted in SSD specifications – as much as 70% higher.
Four SMART attributes are most important in determining drive failure:

- Data errors (uncorrectable and cyclic redundancy check [CRC])
- Sector reallocations
- Program/erase failures
- SATA downshift (a downgrade to a lower signaling rate with an increase in errors)

Uncorrectable bit errors are at least an order of magnitude higher than the target rates.
Symptoms captured by SMART are more likely to precede SSD failure, “with an intense manifestation preventing their survivability beyond a few months. However, our analysis shows that these symptoms are not a sufficient indicator for diagnosing failures.”
Drive symptoms (i.e., data errors and reallocated sectors) have a direct effect on failures.
Design/provisioning factors (e.g., device model) can affect failure rates.
Devices are more likely to fail in less than a month after their symptoms match failure signatures.
The AFR increases two to four times with an increase in average writes per day for some drive types.

With the use of machine learning techniques, the researchers were able to rank the importance of SMART parameters (Table 1).

Table 1: SMART Parameter Ranking

Category	Feature	Importance
Symptom	DataErrors	1
Symptom	ReallocSectors	0.943
Device workload	TotalNANDWrite	0.526
Device workload	HostWrite	0.517
Device workload	TotalReads+Write	0.516
Device workload	AvgMemory	0.504
Device workload	AvgSSDSpace	0.493
Device workload	UsagePerDay	0.491
Device workload	TotalReads	0.475
Device workload	ReadsPerDay	0.469

Getting to SMART Data and Self-Tests

Fortunately, Linux has a great tool, smartmontools, that takes advantage of the features and capabilities of SMART drives by allowing you to interact with storage devices that use the SMART protocol. Smartmontools lets you collect SMART attribute information, control self-tests on the drive, and obtain logs. Derived from and expanding on an earlier project, smartsuite, from the University of California at Santa Cruz, smartmontools incorporated later standards and additional features. The tool is compatible with all SMART features and supports ATA, ATAPI, and SATA-3 to -8 disks, as well as SCSI disks and tape, NVMe, solid-state, and eMMC devices. It also supports the major Linux RAID cards, which sometimes present difficulties because they require vendor-specific I/O control commands. Check the smartmontools web page for more details on your specific RAID card.

Smartmontools is easy to build and easy to use. In the interest of brevity, I just downloaded the latest 64-bit binary from the website. In the smartmontools package, the key binary is smartctl, which allows interaction with the SMART attributes of drives. For this article, I tested a Samsung 840 SSD on an office desktop running Ubuntu 18.04.

The first thing to do once smartmontools is installed is to scan each drive with the -i (info) option (Listing 1), which checks to see whether the drive is SMART capable.

Listing 1: Checking SMART Capability

# smartctl -i /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 Series
Serial Number:    S19HNSAD620520T
LU WWN Device Id: 5 002538 5a0050931
Firmware Version: DXT08B0Q
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Aug  2 10:41:21 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Notice the Device is: In smartctl database … line, which indicates that this drive is in the smartctl database, so the SMART attributes and their corresponding values are known. To check on the details, use the -P show option (Listing 2).

Listing 2: Device Details

# smartctl -P show /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Drive found in smartmontools Database.  Drive identity strings:
MODEL:              Samsung SSD 840 Series
FIRMWARE:           DXT08B0Q
match smartmontools Drive Database entry:
MODEL REGEXP:       SAMSUNG SSD PM800 .*GB|SAMSUNG SSD PM810 .*GB

[cut due to length]

FIRMWARE REGEXP:    .*
MODEL FAMILY:       Samsung based SSDs
ATTRIBUTE OPTIONS:  170 Unused_Rsvd_Blk_Ct_Chip
                    171 Program_Fail_Count_Chip
                    172 Erase_Fail_Count_Chip
                    173 Wear_Leveling_Count
                    174 Unexpect_Power_Loss_Ct
                    187 Uncorrectable_Error_Cnt
                    191 Unknown_Samsung_Attr
                    195 ECC_Error_Rate
                    199 CRC_Error_Count
                    201 Supercap_Status
                    202 Exception_Mode_Status
                    234 Unknown_Samsung_Attr
                    235 POR_Recovery_Count
                    236 Unknown_Samsung_Attr
                    237 Unknown_Samsung_Attr
                    238 Unknown_Samsung_Attr
                    243 SATA_Downshift_Ct
                    244 Thermal_Throttle_St
                    245 Timed_Workld_Media_Wear
                    246 Timed_Workld_RdWr_Ratio
                    247 Timed_Workld_Timer
                    249 Unknown_Samsung_Attr
                    250 SATA_Iface_Downshift
                    251 NAND_Writes

Another useful smartctl option is -c (Listing 3), which only prints the generic SMART capabilities. The somewhat long output lists what SMART features are implemented and how the device will respond to some of the various SMART commands.

Listing 3: SMART Capabilities

# smartctl -c /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (49272) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

Remember that most drives that are SMART capable have self-tests. In the -c output, notice that this drive is capable of both short (two-minute) and long (30-minute) tests.

Before running either the short or extended test, you should enable self-tests with the smartctlcommand in Listing 4.

Listing 4: Enabling Self-Tests

# smartctl -s on -o on -S on /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
SMART Attribute Autosave Enabled.
SMART Automatic Offline Testing Enabled every four hours.

Normally, before running self-tests, the health of the device is checked first with the -H option (Listing 5). The result is a simple response: in this case, PASSED.

Listing 5: Checking the Device Health

# smartctl -H /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

The first self-test to try is the short test (Listing 6). Although the command kicks off the self-test, it does not display the results. To see the results, check the self-test output with the -l option (Listing 7).

Listing 6: Short Self-Test

# smartctl -t short /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Sun Aug  2 10:57:54 2020 EDT
Use smartctl -X to abort test.

Listing 7: Short Test Output

# smartctl -l selftest /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     12441         -
# 2  Short offline       Completed without error       00%     12441         -

In this case, the output reports Completed without error. Also notice from the additional entry in the output that the test has been run previously.

Listings 8 and 9 show running the long, or extended, offline test and viewing its output. Remember, the command just kicks off the test; the long test might take several minutes to complete. To check whether the test has completed, use the same -l selftest option as before.

Listing 8: Long Self-Test

# smartctl -t long /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 30 minutes for test to complete.
Test will complete after Sun Aug  9 09:50:56 2020 EDT
Use smartctl -X to abort test.

Listing 9: Long Test Output

# smartctl -l selftest /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     12441         -
# 2  Short offline       Completed without error       00%     12441         -
# 3  Short offline       Completed without error       00%     12441         -

The output of the extended test is reported in the Extended offline line. It, too, was completed without error. If you run

smartctl -l selftest

and do not see any Extended offline output, the long test has not completed. Just wait a little longer and try the command again until you see the output.

In addition to self-tests, you can also search the SMART logs for errors with the simple command shown in Listing 10.

Listing 10: Self-Test Logs

# smartctl -l error -d sat /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

This command is helpful if you run self-tests at periodic intervals. In this case, errors are found on the drive.

Once you know that a drive is good (no errors and all self-tests are passed), you can start to probe the drive a little further. Earlier, I used the -c option to list the reporting capabilities of the drive. You can also use the -a option (Listing 11) to list the vendor-specific SMART attributes. Although the output is rather long, it contains a great deal of information.

Listing 11: Vendor-Specific Attributes

# /usr/local/sbin/smartctl -a /dev/sda
smartctl 7.2 2020-07-11 r5076 [x86_64-linux-5.4.0-42-generic] (CircleCI)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 Series
Serial Number:    S19HNSAD620520T
LU WWN Device Id: 5 002538 5a0050931
Firmware Version: DXT08B0Q
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Aug  2 10:57:50 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: (49272) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       12441
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       1482
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       -       15
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   070   065   000    Old_age   Always       -       30
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   001   000    Old_age   Always       -       69
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       2909310200

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     12441         -
# 2  Short offline       Completed without error       00%     12441         -

SMART Selective self-test log data structure revision number 1
 SPAN   MIN_LBA   MAX_LBA  CURRENT_TEST_STATUS
    1         0         0  Not_testing
    2         0         0  Not_testing
    3         0         0  Not_testing
    4         0         0  Not_testing
    5         0         0  Not_testing
  255  28175872  28241407  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The first part of the output – the capabilities section – you’ve already seen with the -c option, but the second part is new. The table starting with ID# ATTRIBUTE_NAME FLAG contains the vendor-specific SMART attributes. The column to pay attention to is RAW_VALUE. For SSDs, a good first attribute to examine is in the first row, Reallocated_Sector_Ct, which is the count of reallocated sectors for the drive.

The drive has a pool of sectors held in reserve that can be used to swap out when it encounters bad sectors. If a bad sector is found, the drive allocates a sector from the reserve space, and the data is transferred to this new sector.

Notice that for this drive, the RAW_VALUE is 0. Recall that in the Microsoft study, the attribute is important (Table 1, ReallocSectors). Moreover, after the raw count increases, even by 1, the drive has a high probability of failing relatively soon.

Summary

Pretty much all drives today come with SMART attributes, and the easy-to-configure Linux smartmontools utilities can query these SMART attributes and perform self-tests. However, because SMART attributes are not standard, smartmontools might not be useful for your particular drive (or RAID card), so you should check on the smartmontools mailing list whether your drive is in the database.

A couple of studies have shown that SMART attributes alone cannot predict the failure of drives. Nonetheless, these attributes can be used as indicators to help determine the probability of a drive failing.

Tags: smartmontools , Storage , tools