SMARTMonitoring

Hard Disk health monitoring

Status

Introduction

This is a specification on how we could make use of the SMART diagnostic facility of modern hard disks.

Rationale

Hard disks are one of the most failure-prone components of modern computers, especially laptops. Anything we can do to warn the user of impending failure before it happens is obviously a win.

Scope and Use Cases

There are some instances where SMART data is not available and as result we can't help these users:

  • Older hard disks simply don't support SMART.
  • Some hard disks (e.g. those behind a Hardware RAID ARRAY, in a SAN, etc.) are "inaccessible" in terms of SMART data. Usually there are per-device/array (rather than per-drive) proprietary monitoring tools for these users.
  • The SATA driver in the kernel does not currently support SMART.

Where SMART data is available, there are the following use cases:

  • Inform the user when their disk has a problem detected by SMART
  • Provide access to SMART historical data
  • Provide a UI for running SMART diagnostics

Implementation Plan

Informing the user when their disk has a problem detected by SMART

Brian Sutherland has already started implemented software for this use case. It's available from http://www.pakistanopensource.org/projects/cassandra/

This tool will need further testing and extension. Additional functionality should include:

  • warnings when a SMART attribute VALUE changes significantly in a short period
  • alert when a SMART attribute VALUE approaches its THRESHOLD level

Provide access to SMART historical data

The UI will be complicated both by the variety of attributes available from SMART (e.g. 23 on a ATA drive) and the fact that smartctl provides the information in 2 different output formats (i.e. for ATA and SCSI).

We suggest the UI takes the form of percentage bars, scaled against the SMART attribute VALUE, with markers indicating the attribute's THRESHOLD.

The temperature attribute could be special cased to be listed first on the UI as there is evidence that even relatively small (+/- 5 degrees C) temperature shifts can have a significant impact on drive longevitiy.

SMART data as provided by the drive is a current snapshot only, not a useful history. As a result, to provide access to historical data, we would need to 'create' it, by recording the current snapshot data over time.

Provide a UI for running SMART diagnostics

This would need a UI that offers the following diagnostic modes:

  • immediate
  • short
  • long
  • conveyance

In addition, these diagnostics can either be run in 'background' mode or 'captive'/'foreground' mode. The former runs in the background and can be run at any time without interfering with system operation. The latter can not be used on a disk with mounted partitions as it demands exclusive uses of the disk. The UI would need to check if the disk had mounted partitions and only offer 'captive' mode if the disk was unused. If the user chooses 'captive' mode it would also need to warn the user that SMART self-test diagnostics can take a long time to run (depending on the size of the disk) and that the disk will be unavailable as long as they are.

[ATA uses the term 'offline' for 'background' mode - this is extremely confusing and counterintuitive and the GUI should avoid using the same terminology]

The tool will then have to poll the SMART data to determine when the test is complete and inform the user.

It will also need to provide a method for a non-'captive' self-test to be aborted.

The tool should have both 'test now' and 'enable periodic checks' functionality.

Data Preservation and Migration

N/A (This specification is designed to preventively aid in data preservation)

Packages Affected

  • smartmontools

User Interface Requirements

The User Interface must meet the GNOME HIG standards.

Outstanding Issues

  • Very modern laptops often come with a motion sensor that shuts down the hard drive in the eventuality of a sudden drop. It would be nice if we could display an after-the-fact alert for this. If nothing else, this would allow users to become aware of what kind of fall/movement is enough to upset their laptop. The only problem with this is that the interface to the motion sensor is rarely open and/or documented. Some of them are: see hdaps

  • There's a danger that implementing this feature will lead to a false sense of confidence for the user, i.e. "I don't need to do backups, Ubuntu will warn me before the hard disk dies."
  • Is there any evidence that SMART does actually warn about failure before it occurs? There are certainly a class of failures which SMART could not possibly predict (head crash etc.).
  • smartmontools does not provide a library - much of the functionality of the above tools will require grokking the output of smartctl.

  • Migration to SATA drives is well under way, especially in servers. The limitation that the SATA drivers don't pass the SMART commands is a problem for this project. Perhaps another project should be added for the SATA driver update which might rate a medium priority.

UDU BOF Agenda

N/A (no one turned up)

UDU Pre-Work

N/A

UbuntuDownUnder/BOFs/SMARTMonitoring (last edited 2008-08-06 16:37:07 by localhost)