DebuggingKernelOops

Revision 1 as of 2007-11-14 03:05:01

Clear message

There are times when the code in the kernel will have errors and result in a system fault. This is known as a kernel Oops. The kernel Oops report is typically dumped to the console as well as output to a log file like /var/log/dmesg . Although a system fault may not render a machine useless, it may require a reboot as the machine may be left in an unusable/inconsistent state. The information contained here may be helpful with debugging a kernel Oops report. It will also help educate a bug reporter in gathering appropriate information to supply to the kernel team to help resolve system faults.

Kernel Oops Report

An example of a kernel Oops report looks like the following:

 BUG: unable to handle kernel NULL pointer dereference at virtual address 00000008
  printing eip:
 c022a7b5
 *pde = 00000000
 Oops: 0000 [#1]
 SMP 
 Modules linked in: thinkpad_acpi ppdev speedstep_lib cpufreq_conservative cpufreq_userspace cpufreq_ondemand cpufreq_stats cpufreq_powersave freq_table video bay dock ac sbs button container battery lp irtty_sir sir_dev pcmcia parport_pc parport snd_cs46xx gameport snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss nsc_ircc snd_pcm snd_seq_dummy irda crc_ccitt snd_seq_oss psmouse i2c_piix4 snd_seq_midi snd_rawmidi snd_seq_midi_event serio_raw pcspkr snd_seq i2c_core snd_timer snd_seq_device snd soundcore snd_page_alloc shpchp pci_hotplug intel_agp yenta_socket rsrc_nonstatic pcmcia_core agpgart evdev ext3 jbd mbcache sg sr_mod cdrom sd_mod uhci_hcd usbcore ata_piix ata_generic libata scsi_mod e100 mii thermal processor fan fuse apparmor commoncap
 CPU:    0
 EIP:    0060:[<c022a7b5>]    Not tainted VLI
 EFLAGS: 00010202   (2.6.22-12-generic #1)
 EIP is at acpi_ns_internalize_name+0xd/0x83
 eax: 00000008   ebx: 00000000   ecx: 00000000   edx: c7879e54
 esi: d0b980c0   edi: c7879e54   ebp: c7879e70   esp: c7879de8
 ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
 Process modprobe (pid: 4467, ti=c7878000 task=ce5c94c0 task.ti=c7878000)
 Stack: 00000000 00000000 d0b97e60 00008080 c01c4390 d0b97e60 00000000 00000000 
        d0b980c0 00000000 c7879e70 c022a85c d0b97e60 c795d030 c7c604e0 c01c44ef 
        00000004 d0b97e60 c7acea18 c01c3884 00008080 00000004 00000004 00000080 
 Call Trace:
  [<c01c4390>] __sysfs_new_dirent+0x20/0x50
  [<c022a85c>] acpi_ns_get_node+0x31/0x93
  [<c01c44ef>] sysfs_make_dirent+0x2f/0x50
  [<c01c3884>] sysfs_add_file+0x74/0x90
  [<d0b910b7>] drv_acpi_handle_init+0x37/0x90 [thinkpad_acpi]
  [<c0231aef>] acpi_ut_release_mutex+0x5b/0x63
  [<c0233ac0>] acpi_method_notify_enable+0x15/0x34
  [<d0b5ba32>] cmos_init+0x52/0x70 [thinkpad_acpi]
  [<d0b5c21f>] thinkpad_acpi_module_init+0x27f/0x69a [thinkpad_acpi]
  [<c014a811>] sys_init_module+0x151/0x1a00
  [<c01fb8cf>] prio_tree_insert+0x1f/0x250
  [<c01041d2>] sysenter_past_esp+0x6b/0xa9
  =======================
 Code: c7 44 24 14 01 00 00 00 8b 54 24 14 8d 04 96 e9 f2 fe ff ff 83 c4 18 89 d0 5b 5e 5f 5d c3 55 57 89 d7 56 53 83 ec 1c 85 c0 74 67 <80> 38 00 74 62 85 d2 74 5e 89 04 24 89 e0 e8 b5 fb ff ff 8b 4c 
 EIP: [<c022a7b5>] acpi_ns_internalize_name+0xd/0x83 SS:ESP 0068:c7879de8

Debugging a Kernel Oops

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000008

The above type of Oops is one that is commonly encountered. This means that we've dereferenced an invalid pointer which resulted in the kernel Oops. The rest of the Oops report displays the status of the system at the time of the fault.

Oops: 0000 [#1]

Many times one Oops will trigger multiple Oopses. When debugging multiple Oops reports, be sure to start with the very first one.

EIP is at acpi_ns_internalize_name+0xd/0x83

The most relevant information we want to gather from the Oops in this situation will be the EIP and address of the bad call. The EIP usually indicates where the problem occurred. In this case we can see that at 0xd bytes into the function acpi_ns_internalize_name we hit the Oops. This will give a good indication of where to start looking for the bad code.

EIP:    0060:[<c022a7b5>]    Not tainted VLI

This is also a good time to mention Tainted Kernels. Some oops reports contain the string 'Tainted: ' after the EIP. This indicates that the kernel has been tainted by some mechanism. The string is followed by a series of position-sensitive characters, each representing a particular tainted value.

  1. 'G' if all modules loaded have a GPL or compatible license
  2. 'P' if any proprietary module has been loaded. Modules without a MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by insmod as GPL compatible are assumed to be proprietary.
  3. 'F' if any module was force loaded by "insmod -f", ' ' if all modules were loaded normally.
  4. 'S' if the oops occurred on an SMP kernel running on hardware that hasn't been certified as safe to run multiprocessor. Currently this occurs only on various Athlons that are not SMP capable.
  5. 'R' if a module was force unloaded by "rmmod -f", ' ' if all modules were unloaded normally.
  6. 'M' if any processor has reported a Machine Check Exception, ' ' if no Machine Check Exceptions have occurred.
  7. 'B' if a page-release function has found a bad page reference or some unexpected page flags.
  8. 'U' if a user or user application specifically requested that the Tainted flag be set, ' ' otherwise.
  9. 'D' if the kernel has died recently, i.e. there was an OOPS or BUG.

The primary reason for the 'Tainted: ' string is to tell kernel debuggers if this is a clean kernel or if anything unusual has occurred. Tainting is permanent: even if an offending module is unloaded, the tainted value remains to indicate that the kernel is not trustworthy.

Stack: 00000000 00000000 d0b97e60 00008080 c01c4390 d0b97e60 00000000 00000000 
        d0b980c0 00000000 c7879e70 c022a85c d0b97e60 c795d030 c7c604e0 c01c44ef 
        00000004 d0b97e60 c7acea18 c01c3884 00008080 00000004 00000004 00000080 
 Call Trace: 
  [<c01c4390>] __sysfs_new_dirent+0x20/0x50 
  [<c022a85c>] acpi_ns_get_node+0x31/0x93 
  [<c01c44ef>] sysfs_make_dirent+0x2f/0x50 
  [<c01c3884>] sysfs_add_file+0x74/0x90 
  [<d0b910b7>] drv_acpi_handle_init+0x37/0x90 [thinkpad_acpi] 
  [<c0231aef>] acpi_ut_release_mutex+0x5b/0x63 
  [<c0233ac0>] acpi_method_notify_enable+0x15/0x34 
  [<d0b5ba32>] cmos_init+0x52/0x70 [thinkpad_acpi] 
  [<d0b5c21f>] thinkpad_acpi_module_init+0x27f/0x69a [thinkpad_acpi] 
  [<c014a811>] sys_init_module+0x151/0x1a00 
  [<c01fb8cf>] prio_tree_insert+0x1f/0x250 
  [<c01041d2>] sysenter_past_esp+0x6b/0xa9 

Sometimes you'll need to derive even more information from the Call Stack. This trace shows you how you got to where the code broke. You'll notice that [thinkpad_acpi] trails some of the Call Trace output. This indicates the module for which the code of the calling routine resides in. This is often helpful in narrowing down which module resulted in an issue.