BootDegradedRaid

Differences between revisions 46 and 47
Revision 46 as of 2008-07-27 13:13:57
Size: 14511
Editor: p5B035450
Comment:
Revision 47 as of 2008-07-27 13:31:48
Size: 14609
Editor: p5B035450
Comment:
Deletions are marked like this. Additions are marked like this.
Line 214: Line 214:
  1. init.d/mdadm-raid: Provide same approach in for devices that are needs to set up the fstab.

  Let /etc/init.d/mdadm loop-check if md devices the fstab depends on are degraded and try to start them after a timeout.

  Without the selective approach (with -ARs instead of --run /dev/mdX) any partly pluged in array members (removable disks) will get started (and degraded!), leading to unsynced disks etc. (It is more concise, when getting the hotplug event "new array available" allways means that all (active) members that are plugged in are working. wiki:Bug:244810 )
  1. init.d/mdadm-raid: Provide same checking-loop here for devices that are needed to set up the fstab.

  Let /etc/init.d/mdadm loop-check try to start those inactive md devices that have timed out.

  Without the selective approach (i.e. with -ARs instead of --run /dev/mdX) any partly pluged in array members (removable disks) will get started (and degraded!), leading to unsynced disks etc. (It is also more concise, when getting the hotplug event "new array available" allways means that all (active) members that are plugged in are actually working. wiki:Bug:244810 and may not be degraded as a sideffect of another array that needs to be run degraded during boot.)

Overview

Summary

This specification defines a methodology for enhancing Ubuntu's boot procedures to configure and support booting a system dependent on a degraded RAID1 device.

Rationale

Ubuntu's installer currently supports installation to software RAID1 targets for /boot and /. When one of the mirrored disks fails, and mdadm marks the RAID degraded, it becomes impossible to reboot the system in an unattended manner.

Booting Ubuntu with a failed device in a RAID1 will force the system into a recovery console.

In some cases, this is the desired behavior, as a local system administrator would want to backup critical data, cleanly halt the system, and replace the faulty hardware immediately.

In other cases, this behavior is highly undesired--particularly when the system administrator is remotely located and would prefer a system with redundant disks tolerate a failed device even on reboot.

Use Cases

  • Kim uses software RAID since it is less expensive than hardware RAID, and since it is always available on any Ubuntu system with multiple disk devices. She particularly likes it on low-end and small-form-factor servers without built-in Hardware RAID, such as her arrays of blades and 1-U rack mount systems. She uses RAID1 (mirroring) as a convenient mechanism for providing runtime failover of hard disks in Ubuntu. Kim remotely administers her systems, where she has taken great care to use redundant disks in a RAID1 configuration. She absolutely wants to be able to tolerate a RAID degradation event and continue to boot such systems in an unattended manner until she is able to replace the faulty hardware. After rebooting her system following a primary drive failure, the system automatically boots from the secondary drive and brings up the RAID in degraded mode in a fully functional (though unprotected) configuration, since she specified in her system configuration to boot even if the RAID is degraded.
  • Steph also uses software RAID1 on her /boot and / filesystems. Steph is always physically present at the console when she reboots systems, and she always has spare disks on hand. Steph never wants to boot into a system with a degraded MD array. Steph configures her system to trap booting to a degraded array and instead deliver the system to a recovery console, thus more conservatively protecting her data at the expense of yielding her system unbootable in an unattended manner.

  • HotplugRaid

Scope

The scope of this specification is to solve this problem within Ubuntu's software raid support and default bootloader within the Intrepid Ibex development cycle.

Design / Work Items

  • Bootloader

    • grub-install needs to detect (/boot on an md?) or be configured to install grub to multiple devices, thus rendering multiple disks bootable
    • should probably also document that manual BIOS changes may be required for disk boot failover to occur properly
  • Switch to use the mdadm --incremental assembly, supported by mdadm version in 8.04.

  • MD Error Handling

    • more verbose error hook messages
    • teach md error handler how to bring up md in degraded mode
      •  /usr/share/initramfs-tools/scripts/init-premount/mdadm:mountroot_fail() 

  • Root Filesystem Wait

    • reduce rootfs wait timeout to 30 seconds (DONE)
    • option to abort rootfs wait, seems non-trivial, but quite handy

Implementation

Phase 1: mdadm/initramfs modifications

The first phase of this specification involves modifications to mdadm and initramfs-tools. As of 2008-07-25, patches are attached to wiki:120375, and available in [https://launchpad.net/~kirkland/+archive kirkland's PPA].

Testing Instructions

The following are instructions for safely testing this in a KVM:

  1. Create a pair of smallish disk images, disk1.img and disk2.img
    • HOST:  seq 1 2 | xargs -i dd if=/dev/zero of=disk{}.img bs=1M count=1000 

  2. Obtain an Ubuntu Server Intrepid ISO
    • HOST:  wget http://cdimage.ubuntu.com/ubuntu-server/daily/current/intrepid-server-amd64.iso 

  3. Install Intrepid into a KVM
    • HOST:  kvm -hda disk1.img -hdb disk2.img -m 256 -cdrom intrepid-server-amd64.iso 

      • Use manual partitioning to create a pair of RAID partitions
      • create a RAID1 md device
      • format that device ext3
      • set that device to mount on /
  4. After installation, reboot and check that the RAID1 is set up properly, clean, and sync'd
    • KVM:  cat /proc/mdstat 

    • KVM:  mdadm --detail /dev/md0 

  5. Obtain the updated test packages from my PPA
    • KVM:  sudo echo deb http://ppa.launchpad.net/kirkland/ubuntu intrepid main >> /etc/apt/sources.list 

    • KVM:  sudo apt-get update 

    • KVM:  sudo apt-get upgrade 

  6. Check that you have my ppa packages
    • KVM:  dpkg -l | grep mdadm 

    • KVM:  dpkg -l | grep initramfs-tools 

  7. Reboot once to ensure that your working, standard, clean, sync'd RAID continues to boot
    • KVM:  sudo reboot 

    • KVM: ...
  8. Power down the virtual machine
    • KVM:  sudo poweroff 

  9. Start the KVM back up, without disk2.img
    • HOST:  kvm -hda disk1.img -m 256 

    • Allow it to boot to a busybox prompt, checking that the default behavior is preserved
  10. Start the KVM back up, without disk2.img
    • HOST:  kvm -hda disk1.img -m 256 

      1. press ESC to enter the grub menu
      2. press "e" to edit your kernel command options
      3. press DOWN to highlight your kernel line
      4. press "e" to edit your kernel line
      5. add one of
        • bootdegraded
        • bootdegraded=true
        • bootdegraded=1
        • bootdegraded=yes
      6. press ENTER
      7. press "b" to boot
    • After it times out looking for the root device (~30s), it should bring up the RAID in degraded mode, check this with
      • KVM:  cat /proc/mdstat 

      • KVM:  mdadm --detail /dev/md0 

    • Power down the virtual machine
      • KVM:  sudo poweroff 

  11. Start the KVM back up, this time with both disks
    • HOST:  kvm -hda disk1.img -hdb disk2.img -m 256 

    • Ensure that it boots, and check the status of your RAID
      • KVM:  cat /proc/mdstat 

      • KVM:  mdadm --detail /dev/md0 

    • Note that it should only have the first of the 2 devices active. This is because the system does not have any reason to trust the reliability of this second device, since it previously disappeared.
    • Manually add it back to the RAID
      • KVM:  sudo mdadm /dev/md0 --add /dev/sdb1 

    • And wait a couple of minutes until it finishes done re-syncing
      • KVM:  watch -n1 cat /proc/mdstat 

    • Once it's done, reboot the virtual machine, and check that it comes back up with both disks active and sync'd
      • KVM:  sudo reboot 

      • KVM: ...
      • KVM:  cat /proc/mdstat 

      • KVM:  mdadm --detail /dev/md0 

    • Now, test using the configuration file
      • KVM:  sudo echo "BOOT_DEGRADED=true" > /etc/initramfs-tools/conf.d/mdadm 

      • KVM:  sudo update-initramfs -u 

      • KVM:  sudo poweroff 

  12. Start the KVM back up, without disk2.img
    •  kvm -hda disk1.img -m 256 

    • ESC to the grub menu, and add "bootdegraded=false" to the kernel options
    • Test that the kernel option overrides the configuration file
    • It should boot to a busybox prompt
  13. Start the KVM up again without disk2.img
    •  kvm -hda disk1.img -m 256 

    • Do not edit the kernel options, let it find the BOOT_DEGRADED=true in the configuration file
    • It should boot with a degraded RAID again
      • KVM:  cat /proc/mdstat 

      • KVM:  mdadm --detail /dev/md0 

Phase 2: grub/installer modifications

No progress yet

Outstanding Issues

References

Comments

Howto boot degraded raids

In ubuntu md devices are set up by udev rules. The mdadm tool shiped with ubuntu supports the --incremental option since 8.04, but it is not used (Causing errors wiki:157981 wiki:244792).

It will be neccessary to adjust the following:

  1. package mdadm: Switch to mdadm --incremental (Degraded arrays hit many bugs in an environments that uses --non-degraded udev rules.)
    • Using the --incremental option in the udev rule will only start complete arrays, just as the "--assemble --no-degraded" before, but will more appropriately determine what to do with found array members. It will add devices to inactive or running arrays and will start the array in auto-readonly mode. This allows raid members that may come up later to join running arrays very smoothly without resyncing if nothing has been written to the array yet.
    • The mdadm --incremental option does not create device nodes (wiki:251663). Furthermore --increment defaults to expect the new device naming scheme for partitionable md devices. (As a workaround you can make your existing devices static by "cp -a /dev/mdX /lib/udev/devices/", and configuring the auto=md scheme in /etc/mdadm/mdadm.conf)

    • /etc/udev/rules.d/85-mdadm.rules:
      • The mdadm call needs to be changed to "mdadm --incremantal /dev/%k"
        • The command "watershed" is not installed by default, at least not in xubuntu 8.04. Is this a build-in of udev?
          • See:  /usr/lib/udev 

            • I can't find a watershed file there.
      • Rules for removal events are needed. wiki:244803

    • "mdadm --incremental" will save state in a map file under /var/run/mdadm/map, but in both initramfs and early boot this directory does not yet exist and the state is saved in /var/run/mdadm.map.new (Change man page, it says /var/run/mdadm.map). For early boot, there is a danger of racing with udev (i.e. hotplug). This can be fixed with a initramfs-tools init-top/mdadm script that creates /var/run/mdadm before udevd is run in init-premount/udev. (The initramfs /var/run is later copied to the real /var/run as part of the normal boot process).

      (With --incremental used in the udev rule, members appering while an array is degraded and not runnable won't trigger wiki:244792. When --incremantal / "auto-read" can be used with --run for selected arrays (Wish wiki:251646) members will even be able to join smoothly later. The --incremental option does not work with --run --scan though(wiki:244808).)

  2. To start arrays that are necessary to boot in degraded mode some extra attention is needed:
    • In general, do not just run any array that may be partially available (mdadm --assemble --scan), but only those that are actually needed at the particular point of time. (i.e. it is not a good idea to start for example the /home array in degraded mode, when we just need the rootfs to boot first. Some slower or external disks of a /home array may become available later.)
    • initramfs-tools/local: The mount script in the initramfs now checks if root is available, (wiki:83025) (Done):

              while [ ! -e "${ROOT}" ] || ! /lib/udev/vol_id "${ROOT}" >/dev/null 2>&1; do
                              /bin/sleep 0.1
                              slumber=$(( ${slumber} - 1 ))
                              [ ${slumber} -gt 0 ] || break
              done
      and failure hooks are called if the timeout has been reached. The failure hook and option parsing functions are implemented in current patches. The failure hooks provided by init-premount/mdadm will try to run even those arrays degraded, that are only available _after_ the non-coldpluggable local-top/cryptroot has opened new raid members and the full rootdelay has passed (no further looping in the failure hooks).
    • init-premount/mdadm: regular array degration when bootdegraded=yes.
              if the rootfs depends on /dev/mdX and /dev/mdX exsists # (crypted members may not be available, this case is handled by the failurehooks later)
                while md device inactive do
                              /bin/sleep 0.1
                              slumber=$(( ${slumber} - 1 ))
                              [ ${slumber} -gt 0 ] || break
              done
              if timeout reached "mdadm --run /dev/mdX"
    • init.d/mdadm-raid: Provide same checking-loop here for devices that are needed to set up the fstab. Let /etc/init.d/mdadm loop-check try to start those inactive md devices that have timed out.

      Without the selective approach (i.e. with -ARs instead of --run /dev/mdX) any partly pluged in array members (removable disks) will get started (and degraded!), leading to unsynced disks etc. (It is also more concise, when getting the hotplug event "new array available" allways means that all (active) members that are plugged in are actually working. wiki:244810 and may not be degraded as a sideffect of another array that needs to be run degraded during boot.)

    • Finally, when a disk fails in a running system, it doesn't stop by default, and probably should't. This is the reason for the mdadm notification functionality. So changing the default to bootdegraded=yes seems reasonable, once the patches are tested to work with enough configurations, and will be safe, if no other md devices than the ones neccessary to set up the fstab are touched.
  3. Desktop integration:
    • Show unstarted lonely array member partitions as icons with right-click "start degraded" option. Rules to stop md arrays when their filesystem is beeing unmounted, so that if members are removed after unmountig the filesystem they won't get set faulty.
    • A right-click "remove array member" option, to remove a (mirror) member from a running array.
    • Raid status monitoring frontend GUI to /proc/mdstat etc.

BootDegradedRaid (last edited 2010-04-21 10:02:37 by 188-194-18-172-dynip)