BootDegradedRaid

Revision 14 as of 2008-05-30 21:27:28

Clear message

Summary

This specification defines a methodology for enhancing Ubuntu's boot procedures to configure and support booting a system dependent on a degraded RAID1 device.

Rationale

Ubuntu's installer currently supports installation to software RAID1 targets for /boot and /. When one of the mirrored disks fails, and mdadm marks the RAID degraded, it becomes impossible to reboot the system in an unattended manner.

Booting Ubuntu with a failed device in a RAID1 will force the system into a recovery console.

In some cases, this is the desired behavior, as a local system administrator would want to backup critical data, cleanly halt the system, and replace the faulty hardware immediately.

In other cases, this behavior is highly undesired--particularly when the system administrator is remotely located and would prefer a system with redundant disks tolerate a failed device even on reboot.

Use Cases

  • Kim uses software RAID since it is less expensive than hardware RAID, and since it is always available on any Ubuntu system with multiple disk devices. She particularly likes it on low-end and small-form-factor servers without built-in Hardware RAID, such as her arrays of blades and 1-U rack mount systems. She uses RAID1 (mirroring) as a convenient mechanism for providing runtime failover of hard disks in Ubuntu. Kim remotely administers her systems, where she has taken great care to use redundant disks in a RAID1 configuration. She absolutely wants to be able to tolerate a RAID degradation event and continue to boot such systems in an unattended manner until she is able to replace the faulty hardware. After rebooting her system following a primary drive failure, the system automatically boots from the secondary drive and brings up the RAID in degraded mode in a fully functional (though unprotected) configuration, since she specified in her system configuration to boot even if the RAID is degraded.
  • Steph also uses software RAID1 on her /boot and / filesystems. Steph is always physically present at the console when she reboots systems, and she always has spare disks on hand. Steph never wants to boot into a system with a degraded MD array. Steph configures her system to trap booting to a degraded array and instead deliver the system to a recovery console, thus more conservatively protecting her data.

Scope

The scope of this specification is to solve this problem within Ubuntu's software raid support and default bootloader within the Intrepid Ibex development cycle.

Design / Work Items

  • Bootloader

    • grub-install needs to detect (/boot on an md?) or be configured to install grub to multiple devices, thus rendering multiple disks bootable
    • should probably also document that manual BIOS changes may be required for disk boot failover to occur properly
  • MD Error Handling

    • more verbose error hook messages
    • teach md error handler how to bring up md in degraded mode
  • Root Filesystem Wait

    • reduce rootfs wait timeout to 30 seconds (DONE)
    • option to abort rootfs wait, seems non-trivial, but quite handy

Implementation

Will be documented here as implementation occurs

Outstanding Issues

References