LinuxKernelCrashDumpSpec

Revision 5 as of 2006-06-27 21:15:31

Clear message

Summary

This specification outlines a plan to dump kernel crashes to disk, whenever possible.

Rationale

When a normal user uses Dapper and experiences a kernel panic, it isn't possible to capture the crash dump. The user is locked into a frozen X session and can only reboot. We then lose the crash dump, making debugging the kernel for this bug virtually impossible.

If we were able to dump most crash dumps to disk, we would be able to ask for the log file to be attached to a bug report in Malone. This debugging information can be used to track down the bug, either by Ubuntu or upstream kernel developers.

User stories

  • Alice is a kernel developer who is using her desktop when it suddenly freezes solid. She reboots and looks for the crash dump that's been deposited on disk.
  • Barry is an Ubuntu user whose personal desktop suddenly freezes. He reboots, goes online, and is instructed on how to file a bug report with the appropriate crash dump attached.
  • Cynthia is a malicioius user who wants to extract passwords on a multi-user machine. She inserts a bad piece of hardware which causes the kernel to panic. She looks for the crash dump when the system is rebooted, but cannot access it because she is not authorized to do so.
  • Daryl is a power user who wants to see a Brown Screen Of Death whenever his Ubuntu box panics. He turns on an option and whenever the kernel panics, it drops to a TTY and spits out its crash message.

Scope

This specification only covers the mechanism for dumping kernel crashes to disk. It does not cover any integration with other systems, like BugReportingToolSpec or [Malone].

Design

The crash dump infrastructure should trigger whenever the kernel panics. Kernel oopses can be ignored, as either they are benign and will be logged to /var/log/kern.log, or they will cause a kernel panic.

All captured dumps must eventually end up in /var/crash/ in a format consistent with that defined by AutomatedProblemReports. This way, BugReportingTool can automatically pick up this file. Only administrators with sudo access are allowed to read captured crash dumps. Regular users must be unable to read these dumps, or even identify their existance.

When a kernel panic occurs, the system must reset into a known good state without clearing the RAM. Then, it must verify that it can safely write to a blank region on disk, say the swap partition. If it cannot safely identify the disk and an empty region, it must abort. Once it is safe to write a crash dump, it must write out the kernel panic messages, as output to the console. It may write out a full kernel dump, with optional compression.

If a configuration option is set, the kernel may drop to a normal TTY on a kernel panic, clear the display, and dump out its crash message. This screen may be white text on a brown background. This should be able to grab control from X.

Implementation

We should base our implementation around kdump. kdump supports i386, x86_64 and ppc64 architectures and is actively maintained upstream.

We will create a separate crash dump kernel for each supported architecture that contains an initramfs that only knows about storage drivers. This minimizes the risk that a faulty driver in another subsystem would cause the emergency kernel to crash. As well, since it is modularized, the chance of driver conflicts is greatly reduced.

Outstanding issues

We have no kdump implementation for Sparc.

We also need to audit kdump to make sure it won't eat people's data.

See also

BoF agenda and discussion

ScottJamesRemnant: if there's no obvious and easy way, one hacky way occurs -- write it to the top of the swap disk with a simple to detect header and then pick it up on reboot and put it on the real filesystem -- I suspect there's far simpler ways though

SorenHansen: That's not quite safe, though. There's no way to know if the particular crash has messed up the kernel's perception of where the swap space starts. If it has, the dump could potentially overwrite actual data. If this could be done via a SysRq magic combo, the user could make an informed decision as to whether this is likely (based on stack dumps and whatnot) and based on that decide if he/she wants to write the dump.

SimonLaw: kdump has solved these problems, if you go and read the paper they presented at OLS 2005.


CategorySpec