CrashdumpRecipe

Differences between revisions 5 and 25 (spanning 20 versions)
Revision 5 as of 2012-07-26 09:35:09
Size: 5860
Editor: smb
Comment:
Revision 25 as of 2016-07-17 10:44:24
Size: 8090
Editor: localhost
Comment: Edited to point the reader to the actual file from where the crash kernel parameter is added when running update-grub.
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
= {i} Update not finished yet... = ||<tablestyle="float:right; font-size: 0.9em; width:40%; background:#F1F1ED; margin: 0 0 1em 1em;" style="padding:0.5em;"><<TableOfContents>>||
Line 5: Line 5:
= Linux Kernel Crash Dump (LKCD) = = Introduction =
Line 7: Line 7:
[[http://lkcd.sourceforge.net/|LKCD]] is a project that tries to enable enterprise style post-mortem crash analysis in Linux operating systems. It uses a special mode of kexec which allows to automatically boot a secondary kernel whenever a crash (Oops/panic) occurs. This secondary kernel will then save the state and memory of the primary kernel to a certain location of the filesystem (''/var/crash'' on newer releases).
This file can then be used by '''crash''' to gather detailed information about the problem.
The Ubuntu Kernel Crash Dump is a mechanism that enable enterprise style post-mortem crash analysis in Linux operating systems. It uses a special mode of kexec which allows to automatically boot a secondary kernel whenever a crash (Oops/panic) occurs. This secondary kernel will then save the state and memory of the primary kernel to a certain location of the filesystem (''/var/crash'' on newer releases). This file can then be used by '''crash''' to gather detailed information about the problem.
Line 10: Line 9:
For convenience, the kernel crash dump utility has been packaged in Ubuntu. It can be installed with the following command: = Installation =
Line 12: Line 11:
 {{{
 #> apt-get install linux-crashdump
 }}}
For convenience, the kernel crash dump utility has been packaged in Ubuntu. It can be installed with the following command: {{{
sudo apt-get install linux-crashdump }}}
Line 16: Line 14:
and reboot. This should automatically load the kernel used to boot as the secondary kernel used for crash dumps. Whether a kernel is loaded or not can be verified by: Newer versions of the package will automatically add an entry ''crashkernel=384M-2G:64M,2G-:128M'' to the kernel commandline in grub. However this may cause problems on systems with less than 2G of memory (see [[#Troubleshooting|troubleshooting]]).
Line 18: Line 16:
 {{{
 #> cat /sys/kernel/kexec_crash_loaded
 1
 }}}
== ppc64el installation ==
For those running ppc64el please follow the crash kernel recommendations found on the [[ ppc64el/Recommendations#Crash_Kernel_recommendations | ppc64el Recommendations page]].
Line 23: Line 19:
If this is ''0'', then something went wrong. = Verifying linux-crashdump installation =
Line 25: Line 21:
== Causing a test crash == For Trusty, please see [[https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html|here]].
Line 27: Line 23:
In order to test a crash, the simplest way is to use the sysrq mechanism. Causing a crash is done by either pressing ''<sysrq>+c'' or

 {{{
 #> echo c | sudo tee /proc/sysrq-trigger
 }}}

/!\ Note that this might be disabled in some releases. ''/proc/sys/kernel/sysrq'' needs to be set to 1 in order to let the sysrq keys all work.

If everything works, there should be some delay (depending on the memory size). Then the system reboots again into the normal mode. Usually ''apport'' kicks in and asks about reporting the issue. Alternatively the report file can be found under ''/var/crash'' and either placed somewhere else or be unpacked again by calling:

 {{{
 #> apport-unpack <report file> <target directory>
 }}}

== Inspecting the crash dump ==

=== Using crash ===
= Inspecting the crash dump using crash =
Line 49: Line 29:
/!\ Be aware that those packages are huge! {{{
sudo tee /etc/apt/sources.list.d/ddebs.list << EOF
deb http://ddebs.ubuntu.com/ $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com/ $(lsb_release -cs)-security main restricted universe multiverse
deb http://ddebs.ubuntu.com/ $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com/ $(lsb_release -cs)-proposed main restricted universe multiverse
EOF
Line 51: Line 37:
When installed, the debug kernel can be found under ''/usr/lib/debug/boot/'' and '''crash''' is started by: sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys ECDCAD72428D7C01
sudo apt-get update
sudo apt-get install linux-image-$(uname -r)-dbgsym
}}}
Line 53: Line 42:
 {{{
 #> crash <debug kernel> <crash dump>
 }}}
/!\ Be aware that those packages are huge! (~600 MB)

When installed, the debug kernel can be found under ''/usr/lib/debug/boot/'' and '''crash''' is started by: {{{
crash <debug kernel> <crash dump> }}}
Line 59: Line 49:
=== Using apport-retrace === = Inspecting the crash dump using apport-retrace =
Line 61: Line 51:
To get a local retrace, you need apport-retrace and then run:

{{{
 #> apport-retrace --stdout --rebuild-package-info /var/crash/linux-image*.crash
}}}
To get a local retrace, you need apport-retrace and then run: {{{
apport-retrace --stdout --rebuild-package-info /var/crash/linux-image*.crash }}}
Line 69: Line 56:
== Enabling various types of panics ==
Line 70: Line 58:
== Troubleshooting == To make Linux kernel to panic on different situations please use:
Line 72: Line 60:
=== Allocated memory for the crash kernel === {{{
echo 1 > /proc/sys/kernel/hung_task_panic # panic when hung task is detected
echo 1 > /proc/sys/kernel/panic_on_io_nmi # panic on NMIs from I/O
echo 1 > /proc/sys/kernel/panic_on_oops # panic on oops or kernel bug detection
echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi # panic on NMIs from memory or unknown
echo 1 > /proc/sys/kernel/softlockup_panic # panic when soft lockups are detected
echo 1 > /proc/sys/vm/panic_on_oom # panic when out-of-memory happens
}}}
Line 74: Line 69:
When testing crash dump sometimes the system just seems to lock up. The main issue there is how much memory was assigned for the crash kernel. When kexec starts the crash kernel it requires enough memory to fit the unpacked kernel, the compressed initrd and the uncompressed initrd (at least while unpacking). If there is not enough memory allocated, things usually go wrong without any hint. = Troubleshooting =
Line 76: Line 71:
To solve this one can either increase the allocation by changing ''crashkernel='' on the grub command line or in ''/boot/grub/grub.cfg'' (for grub2) or ''/boot/grub/menu.lst'' (for old grub). == Allocated memory for the crash kernel ==
Line 78: Line 73:
Or one tries to reduce the size of the initrd. By default this is set to include all the modules and firmware ever needed. This allows using the same initrd on any system but increases its size a lot. In order to limit it to the modules really required to boot on the current hardware, change the following in ''/etc/initramfs-tools/initramfs.conf'': When testing crash dump sometimes the system just seems to lock up. The main issue there is how much memory was assigned for the crash kernel. When kexec starts the crash kernel it requires enough memory to fit the unpacked kernel, the compressed initrd and the uncompressed initrd (at least while unpacking). If there is not enough memory allocated, things usually go wrong without any hint. To solve this there are the following options:
Line 80: Line 75:
 {{{  1. Increase the allocation by changing ''crashkernel='' on the grub command line or in ''/boot/grub/grub.cfg'' (for grub2) or ''/boot/grub/menu.lst'' (for old grub). To avoid loosing the settings when running '''update-grub''' the change can be made in ''/etc/default/grub.d/kexec-tools.cfg''.
 1. Reduce the size of the initrd. By default this is set to include all the modules and firmware ever needed. This allows using the same initrd on any system but increases its size a lot. In order to limit it to the modules really required to boot on the current hardware, change the following in ''/etc/initramfs-tools/initramfs.conf'': {{{
Line 83: Line 79:
 ...
 }}}
 ... }}}

== Crash kernel fails to load: Hang ==

This can be frustrating to debug, especially if you're unable to record the console messages from the new kexec kernel. A serial console attached to the system is best here to continue debugging. An easy troubleshooting step is to systematically eliminate the additional kernel parameters passed to the crash kernel and retrying. These arguments are kept in '''/etc/init.d/kdump''': {{{
...
        # Append kdump_needed for initramfs to know what to do, and add
        # maxcpus=1 to keep things sane.
        APPEND="$APPEND kdump_needed maxcpus=1 irqpoll reset_devices"

        # --elf32-core-headers is needed for 32-bit systems (ok
        # for 64-bit ones too).
        log_action_begin_msg "Loading crashkernel"
        kexec -p "$KERNEL_IMAGE" --initrd="$INITRD" --append="$APPEND"
        log_action_end_msg $?
... }}}

Leave '''$APPEND''' and '''kdump_needed'''. Start by removing '''reset_devices''' and then
install the new kexec crash kernel configuration: {{{
sudo service kdump start }}}

Then retest; if that doesn't work, remove the next argument, rinse and repeat.

== ACPI memory hotplug issues ==

If you see the following call trace from your serial console after kexecing into the crash kernel you may need to append 'acpi_no_memhotplug' to the crashdump kernel cmdline.

{{{
Call Trace:
 dump_stack+0x45/0x57
 warn_alloc_failed+0xf2/0x140
 __alloc_pages_nodemask+0x2e4/0xa10
 vmemmap_alloc_block+0xb5/0xbf
 vmemmap_alloc_block_buf+0x15/0x3b
 vmemmap_populate+0xb3/0x20c
 sparse_mem_map_populate+0x29/0x38
 sparse_add_one_section+0x71/0x16e
 __add_pages+0xb9/0x280
 arch_add_memory+0x71/0xf0
 add_memory+0xdf/0x210
 acpi_memory_device_add+0x1ab/0x282
 acpi_bus_attach+0xe3/0x196
 acpi_bus_scan+0x70/0x8f
 acpi_scan_init+0x89/0x1d3
 acpi_init+0x272/0x28f
 do_one_initcall+0xb3/0x200
 kernel_init_freeable+0x17b/0x21a
 kernel_init+0xe/0xe0
 ret_from_fork+0x3f/0x70
}}}

Edit KDUMP_CMDLINE_APPEND in /etc/default/kdump-tools such that it is un-commented and contains 'acpi_no_memhotplug' as well. Then restart the kdump service.
Line 90: Line 136:
 * [[https://bugs.launchpad.net/ubuntu/+source/kexec-tools/+bug/988512|Bug #988512: Missing /boot/vmcoreinfo-{version} file is breaking kdump]]<<BR>>
 Because of some kernel code changes, the vmcoreinfo file cannot be generated. However, the required information can now be obtained from the kernel on doing the dump. But ''/etc/init/kdump'' and ''/usr/???'' still require it. [TDB: add hotfix patch to bug]
 * [[https://bugs.launchpad.net/ubuntu/+source/kexec-tools/+bug/785394|Bug 785394: Hard-coded crashkernel=... memory reservation in /etc/grub.d/10_linux is insufficient]]<<BR>>
 The default allocation for systems below 2G is not enough for the current initrd size. Manually adapting the size allows to use the crash kernel.
 * The current (1.3.7-2) version of makedumpfile reports to be incompatible with the 3.2 kernel. The dumps created seem to be ok.
Line 93: Line 140:
== Ubuntu 10.04 "Lucid Lynx" ==

 * --([[https://bugs.launchpad.net/ubuntu/+source/apport/+bug/533565|Bug #533565 in apport (Ubuntu): "Strings missing from the apport template"]])--<<BR>>
 ''This bug was fixed in the package python-distutils-extra - 2.19'' (in lucid it's 2.18bzr1)
 * [[https://bugs.launchpad.net/ubuntu/+source/apport/+bug/592239|Bug #592239 in apport (Ubuntu): "apport-retrace - IndexError: list index out of range"]]

== Ubuntu 9.04 "Jaunty Jackalope" ==

This page describes a recipe for enabling crash dump vmcore analysis on your Jaunty x86/x86_64 platform. Much of the information was gleaned from the kernel source tree files in Documentation/kdump.

  * 'apt-get install linux-crashdump'
    This is a meta package that installs all of the tools necessary to acquire and analyse a crash-dump vmcore.

  * Add 'crashkernel=64M@16M' to the kernel command line in /boot/grub/menu.lst.
    You'll also probably want to remove 'quiet splash'.

  * Reboot the system (into the ordinary kernel). The section of RAM above will now be reserved for the crashkernel (and not available to the normal system).

  * Make note of your root partition, e.g., /dev/sda1
    '''kexec -p /boot/vmlinuz-{{{`uname -r`}}} --initrd=/boot/initrd.img-{{{`uname -r`}}} --append="root=<ROOT_PARTITION> irqpoll maxcpus=1"'''
    This loads the crash-dump kernel into the reserved memory, in preparation for a panic.

  Now your kernel is ready to acquire a post-crash vmcore.
== Ubuntu 15.10 "Wily Werewolf" and later ==
 * [[https://bugs.launchpad.net/ubuntu/+source/kexec-tools/+bug/1496317|Bug 1496317: kexec fails with OOM killer with the current crashkernel=128 value]]<<BR>>
The current allocation for the crashkernel value is too low to correctly load the default initrd.img. This means that the OOM killer will break the crash dump capture procedure. While the bug is being worked on, you can increase the value of crashkernel to something more than 150Mb to work around the bug.

Introduction

The Ubuntu Kernel Crash Dump is a mechanism that enable enterprise style post-mortem crash analysis in Linux operating systems. It uses a special mode of kexec which allows to automatically boot a secondary kernel whenever a crash (Oops/panic) occurs. This secondary kernel will then save the state and memory of the primary kernel to a certain location of the filesystem (/var/crash on newer releases). This file can then be used by crash to gather detailed information about the problem.

Installation

For convenience, the kernel crash dump utility has been packaged in Ubuntu. It can be installed with the following command:

sudo apt-get install linux-crashdump 

Newer versions of the package will automatically add an entry crashkernel=384M-2G:64M,2G-:128M to the kernel commandline in grub. However this may cause problems on systems with less than 2G of memory (see troubleshooting).

ppc64el installation

For those running ppc64el please follow the crash kernel recommendations found on the ppc64el Recommendations page.

Verifying linux-crashdump installation

For Trusty, please see here.

Inspecting the crash dump using crash

In order to use the generated crash dump with crash one needs the vmlinux file which has the debugging information. This is part of the kernel ddeb package which can be found at:

http://ddebs.ubuntu.com/pool/main/l/linux/

sudo tee /etc/apt/sources.list.d/ddebs.list << EOF
deb http://ddebs.ubuntu.com/ $(lsb_release -cs)          main restricted universe multiverse
deb http://ddebs.ubuntu.com/ $(lsb_release -cs)-security main restricted universe multiverse
deb http://ddebs.ubuntu.com/ $(lsb_release -cs)-updates  main restricted universe multiverse
deb http://ddebs.ubuntu.com/ $(lsb_release -cs)-proposed main restricted universe multiverse
EOF

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys ECDCAD72428D7C01
sudo apt-get update
sudo apt-get install linux-image-$(uname -r)-dbgsym

Warning /!\ Be aware that those packages are huge! (~600 MB)

When installed, the debug kernel can be found under /usr/lib/debug/boot/ and crash is started by:

crash <debug kernel> <crash dump> 

Unfortunately the tool does not allow to look at a 32bit dump on a 64bit system and the other way round. Also it tends to be quite picky about matching up kernel and dump.

Inspecting the crash dump using apport-retrace

To get a local retrace, you need apport-retrace and then run:

apport-retrace --stdout --rebuild-package-info /var/crash/linux-image*.crash 

Warning /!\ Again, this can take a while because it needs to download the kernel debug package.

Enabling various types of panics

To make Linux kernel to panic on different situations please use:

echo 1 > /proc/sys/kernel/hung_task_panic          # panic when hung task is detected
echo 1 > /proc/sys/kernel/panic_on_io_nmi          # panic on NMIs from I/O
echo 1 > /proc/sys/kernel/panic_on_oops            # panic on oops or kernel bug detection
echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi # panic on NMIs from memory or unknown
echo 1 > /proc/sys/kernel/softlockup_panic         # panic when soft lockups are detected
echo 1 > /proc/sys/vm/panic_on_oom                 # panic when out-of-memory happens

Troubleshooting

Allocated memory for the crash kernel

When testing crash dump sometimes the system just seems to lock up. The main issue there is how much memory was assigned for the crash kernel. When kexec starts the crash kernel it requires enough memory to fit the unpacked kernel, the compressed initrd and the uncompressed initrd (at least while unpacking). If there is not enough memory allocated, things usually go wrong without any hint. To solve this there are the following options:

  1. Increase the allocation by changing crashkernel= on the grub command line or in /boot/grub/grub.cfg (for grub2) or /boot/grub/menu.lst (for old grub). To avoid loosing the settings when running update-grub the change can be made in /etc/default/grub.d/kexec-tools.cfg.

  2. Reduce the size of the initrd. By default this is set to include all the modules and firmware ever needed. This allows using the same initrd on any system but increases its size a lot. In order to limit it to the modules really required to boot on the current hardware, change the following in /etc/initramfs-tools/initramfs.conf:

     ...
     MODULES=dep
     ... 

Crash kernel fails to load: Hang

This can be frustrating to debug, especially if you're unable to record the console messages from the new kexec kernel. A serial console attached to the system is best here to continue debugging. An easy troubleshooting step is to systematically eliminate the additional kernel parameters passed to the crash kernel and retrying. These arguments are kept in /etc/init.d/kdump:

...
        # Append kdump_needed for initramfs to know what to do, and add
        # maxcpus=1 to keep things sane.
        APPEND="$APPEND kdump_needed maxcpus=1 irqpoll reset_devices"

        # --elf32-core-headers is needed for 32-bit systems (ok
        # for 64-bit ones too).
        log_action_begin_msg "Loading crashkernel"
        kexec -p "$KERNEL_IMAGE" --initrd="$INITRD" --append="$APPEND"
        log_action_end_msg $?
... 

Leave $APPEND and kdump_needed. Start by removing reset_devices and then install the new kexec crash kernel configuration:

sudo service kdump start 

Then retest; if that doesn't work, remove the next argument, rinse and repeat.

ACPI memory hotplug issues

If you see the following call trace from your serial console after kexecing into the crash kernel you may need to append 'acpi_no_memhotplug' to the crashdump kernel cmdline.

Call Trace:
 dump_stack+0x45/0x57
 warn_alloc_failed+0xf2/0x140
 __alloc_pages_nodemask+0x2e4/0xa10
 vmemmap_alloc_block+0xb5/0xbf
 vmemmap_alloc_block_buf+0x15/0x3b
 vmemmap_populate+0xb3/0x20c
 sparse_mem_map_populate+0x29/0x38
 sparse_add_one_section+0x71/0x16e
 __add_pages+0xb9/0x280
 arch_add_memory+0x71/0xf0
 add_memory+0xdf/0x210
 acpi_memory_device_add+0x1ab/0x282
 acpi_bus_attach+0xe3/0x196
 acpi_bus_scan+0x70/0x8f
 acpi_scan_init+0x89/0x1d3
 acpi_init+0x272/0x28f
 do_one_initcall+0xb3/0x200
 kernel_init_freeable+0x17b/0x21a
 kernel_init+0xe/0xe0
 ret_from_fork+0x3f/0x70

Edit KDUMP_CMDLINE_APPEND in /etc/default/kdump-tools such that it is un-commented and contains 'acpi_no_memhotplug' as well. Then restart the kdump service.

Release specific notes

Ubuntu 12.04 "Precise Pangolin"

Ubuntu 15.10 "Wily Werewolf" and later

The current allocation for the crashkernel value is too low to correctly load the default initrd.img. This means that the OOM killer will break the crash dump capture procedure. While the bug is being worked on, you can increase the value of crashkernel to something more than 150Mb to work around the bug.

Kernel/CrashdumpRecipe (last edited 2021-11-04 14:04:59 by tomreyn)