ApportFreezeHooks

In Lucid we made a lot of progress at sorting out how to capture debug data automatically when X froze with the -intel driver. For MM we want to build on this work, and to expand it to -ati and -nouveau as well.

Intel Freeze Hook Improvements

TODO: See discussion on ubuntu-x about this in Feb/Mar 2010. A few notes follow...

  • Consider adding a table of known issues for automatic duping purposes
    • 8xx -> fdo 26345

  • Detect if the currently executing batchbuffer is completely missing; don't file bug in these cases.
  • Set title more precisely:
    • Scan dmesg output for "Hangcheck timer elapsed... GPU hung" and if present indicate it in the title
    • Scan dmesg output for "page table error", and where present indicate it in the title as "GPU page table error"
    • In other cases, simply say "GPU error"
  • Possibly add some message in the apport-script that says that while we are recording the logs of the incident, they don't tell us how the reporter experienced the problem. We get a lot of descriptions that only says things like "problem happened" and we don't know if the computer hung and needed a reboot or if the computer recovered all by itself and the only thing the user notices is that apport asks it to report a problem he/she was unaware of.

One option would be to carry the record-GPU-error-state kernel patch http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git;a=commit;h=9df30794f609d9412f14cfd0eb7b45dd64d0b14e until some time before release and capture the i915_error_state a little later. This would need some testing though, so our best option may be to simply leave it at the status quo and ask the reporters of the most promising automatic reports to test a drm-intel-next kernel and get a manual dump.

Need to detect cases where GPU is reset before we can capture the data. i965 and GM45 and newer get reset, and IntelGpuDump.txt often is a dump of a freshly initialized GPU, which isn't helpful. The best sign is that the HEAD is right in the beginning of the ringbuffer, i.e. it just got started. The other sign is that ACTHD and IPEHR are different from the ones recorded in i915_error_state. With drm.debug=0x02 as kernel parameter, we can also see that the GPU is being reset in dmesg output (see [1] for an example from LP # 516909). The code that triggers the reset is i915_error_work_func in drivers/gpu/drm/i915/i915_irq.c [2]. The actual reset happens in 965_reset in i915_drv.c [3].

[1]: https://bugs.freedesktop.org/attachment.cgi?id=34126&action=edit [2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/i915/i915_irq.c;h=5388354da0d176df4ff2a3b7c33de069abff12da;hb=HEAD [3]:http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/i915/i915_drv.c;h=1b2e95455c05d0cce04d17483c7bd4ff9f218fe0;hb=HEAD

On how the udev events are triggered:

The udev events are sent from i915_error_work_func mentioned above. When a GPU reset happens, there are three events being sent. Once is at the beginning of the function, when we know that an error has been detected, one right before the reset and one after. The two last ones only happen on i965 and above, so we don't want to listen for them. The first happens whether the GPU is wedged or not (as defined by dev_priv->mm.wedged). There is no uevent that is triggered for all chipsets, but only if the GPU is wedged, which may be what we want.

The i915_error_work_func is called from the end of i915_handle_error (also in i915_irq.c), which takes care of recording the error state to i915_error_state in debugfs first, so it's fine to grab this file on the first udev event also in the cases where the GPU will be reset (I was worried about this in previous emails). i915_handle_error is called from two places. One is when a bit in the error register EIR gets set, which triggers an interrupt. The other is when the hangcheck timer ellapses, i.e. EIR is not set, but the GPU makes no progress. In the latter case "Hangcheck timer elapsed... GPU hung\n" is logged. In both cases i915_handle_error prints "render error detected, EIR: 0x%08x\n" (i.e the EIR register is printed), but this will probably change in drm-intel-next soon, so that this only is printed when a bit in EIR is set [4]

[4]: http://lists.freedesktop.org/archives/intel-gfx/2010-March/006150.html

On what upstream wants:

Chris Wilson says that they would prefer dumps from kernels with the i915_error_state dumping patch [5]. IntelGpuDump.txt usually lacks some important information.

[5]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9df30794f609d9412f14cfd0eb7b45dd64d0b14e

There is /sys/kernel/debug/dri/0/i915_wedged on Lucid now the .33 drm is included [6]. Attaching this file automatically may aid in deciphering what's going on sometimes.

[6]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f3cd474bb235f2331c1a6f579bdbf892386e5c7c

Nouveau Freeze Hook

This should use mmiotrace

Ati Radeon Freeze Hook

This should use radeontool for R5xx and earlier, or avivotool for R600 and newer.

X/Blueprints/ApportFreezeHooks (last edited 2010-03-24 19:20:22 by pool-74-107-129-37)