Diff for "X/Troubleshooting/Freeze"

Freeze

Differences between revisions 30 and 31

Contents

How It Works
Reporting GPU lockup Bugs
Problem: Freezes occur when idle and screensaver is set to random settings or OpenGL
Problem: Freezes when screensaver or video player changes DPMS settings
Problem: Log shows "[mi] EQ overflowing" and X freezes
Problem: Log shows something about ring buffers and I830WaitLpRing (-intel only)
Problem: Freeze began after a system update
Stock Reply for "random freeze" bugs
Note: Intel 8xx Chipsets

Symptoms

X stops responding to input (sometimes mouse cursor can still move, but clicking has no effect)
The screen displays but does not update. Sometimes there is screen corruption too, but usually there isn't.
Often, X cannot be killed; only a reboot clears the state
The system operates fine over SSH but not on the graphical console
Error messages such as "GPU lockup" are (sometimes) present in your dmesg output

Non-Symptoms

A backtrace appears in Xorg.0.log - most of the time this indicates a crash, not a freeze. Collect a full backtrace
X seems to be working, but the monitor appears to just be "off" (See X/Troubleshooting/BlankScreen instead)
The caps lock key blinks - this indicates a kernel failure, not X
X CPU or memory load is high, making system laggy or freeze up. This usually indicates a client application has lost its marbles.
Screen still updates (look at clock), but can't be interacted with - probably is an input bug, not a GPU freeze
System freezes for a period but then comes back. Real freezes never come back.

How It Works

In general, most freezes are due to Graphical Processor Unit (GPU) lockups. GPU's have registers that the driver interacts with to produce graphical effects; if the driver interacts incorrectly, the GPU can get stuck and require power cycling to come back.

Some GPU lockups are caused by triggerable conditions, and are easily reproducible. Others are situational and "tend" to occur with certain programs loaded, certain load levels, or certain periods of time passing. Still others are seemingly random and impossible to tie to any definite set of preconditions. Knowing which of these three classes your bug fits in can provide a clue, as different types of driver errors can lead to the different classes.

GPU lockups are always handled as driver-specific bugs. Typically the source of the error is the handling of memory or command registers, graphics state, or other parameters of the hardware that the driver is responsible for. Often with the open source drivers (-intel, -ati, and -nouveau) the bug requires fixed in the kernel's drm driver, thus many "X freeze" bugs technically are actually kernel bugs.

Reporting GPU lockup Bugs

Reproduce the freeze, and with your system frozen ssh into it (over ethernet) and collect:

dmesg > dmesg.txt
/var/log/Xorg.0.log
/sys/kernel/debug/dri/0/i915_error_state [Intel graphics only]

Important questions to answer in your report:

Have you experienced just one lockup, or have you had a series of these lockups?
- If you've had several, how often does it occur? Every few hours? Once or twice a day? Couple times a week?
When did you first notice it?
- Shortly after upgrading?
- After updating?
- After changing compositing (Desktop Effects) settings?
Under what conditions does it seem most likely to reproduce?
- Only at boot time?
- When resuming from suspend or hibernate?
- Only when using compositing (Unity, et al)
- When changing resolution or enabling/disabling monitors
- When the screensaver (or power saving mode) kicks in
- Visiting particular web pages or loading particular files
- Switching between desktops
- When performing a specific sequence of actions (List them!)

The drm-intel-next mainline kernel builds beginning with 2010-02-24 available here have a kernel patch which will automatically record the failed batchbuffer information in /sys/kernel/debug/dri/0/i915_error_state when the GPU resets.

See this page for more information about mainline kernels, including information about how to install them.

Sometimes adjusting settings (such as reducing video memory) can make a freeze more (or less) easily reproduced. This can be instrumental in helping debug the problem.

Problem: Freezes occur when idle and screensaver is set to random settings or OpenGL

A lot of freezes occur in the 3D code, and go unnoticed by users that don't otherwise use 3D stuff, except when an OpenGL screensaver activates.

One common situation is when the screensavers are set to "random", and allowed to mix in OpenGL 3D screensavers with the regular 2D ones. Not all 3D screensavers will trigger the freeze, and some will trigger it only a portion of the time.

An obvious workaround would be to set the screensaver to blank screen (or avoid the problematic OpenGL screensaver). Alternatively, you can disable DRI support in your xorg.conf (see below).

Problem: Freezes when screensaver or video player changes DPMS settings

Display Power Management (DPMS) allows controlling the standby, suspend, and off time for your video monitor. Various software apps utilize this to do things like prevent the screensaver from turning on while watching a movie, or to power off the monitor after it has been idle a while.

Freezes that occur when the machine has been idle, that aren't due to OpenGL screensavers, may indicate bugs in how DPMS is working. Freezes that seem associated with exiting applications or switching to or from full screen, can suggest possibly the app tried to poke at DPMS and triggered a bug.

You can manually invoke and control DPMS using the xset command line tool:

 sleep 1; xset s activate

Or to turn the screensaver off:

 sleep 1; xset dpms force off

You can also use commands standby, suspend, or on instead of off.

Another workaround is to disable DPMS in xorg.conf, by adding an option to your Monitor section:

Section "Monitor"
...
        Option  "DPMS" "Off"
EndSection

Problem: Log shows "[mi] EQ overflowing" and X freezes

This message indicates that the server has noticed that the GPU is locked up.

This is a particularly common failure-mode for the nouveau driver. The nouveau kernel module has minimal GPU hang checking, and a hang will often result in the X driver spinning until the server dies with the EQ overflow. These backtraces tend to look like

[mi] EQ overflowing. The server is probably stuck in an infinite loop.

Backtrace:
0: /usr/bin/X (xorg_backtrace+0x28) [0x4a3248]
1: /usr/bin/X (mieqEnqueue+0x1f4) [0x4a2ac4]
2: /usr/bin/X (xf86PostMotionEventP+0xc4) [0x47cea4]
3: /usr/bin/X (xf86PostMotionEvent+0xa9) [0x47d049]
4: /usr/lib/xorg/modules/input/synaptics_drv.so (0x7fc8e718a000+0x39d4) [0x7fc8e718d9d4]
5: /usr/lib/xorg/modules/input/synaptics_drv.so (0x7fc8e718a000+0x5f48) [0x7fc8e718ff48]
6: /usr/bin/X (0x400000+0x6fca7) [0x46fca7]
7: /usr/bin/X (0x400000+0x11d1f3) [0x51d1f3]
8: /lib/libpthread.so.0 (0x7fc8ec192000+0xf8f0) [0x7fc8ec1a18f0]
*** Begin section common to nouveau GPU hang traces ***
9: /lib/libc.so.6 (ioctl+0x7) [0x7fc8eaf4a197]
10: /lib/libdrm.so.2 (drmIoctl+0x23) [0x7fc8e94fb5b3]
11: /lib/libdrm.so.2 (drmCommandWrite+0x1b) [0x7fc8e94fb83b]
12: /lib/libdrm_nouveau.so.1 (0x7fc8e8ebd000+0x2fbd) [0x7fc8e8ebffbd]
13: /lib/libdrm_nouveau.so.1 (nouveau_bo_map_range+0xfc) [0x7fc8e8ec01bc]
14: /usr/lib/xorg/modules/drivers/nouveau_drv.so (0x7fc8e90c3000+0x577d) [0x7fc8e90c877d]
*** End section common to nouveau GPU hang traces ***
15: /usr/lib/xorg/modules/libexa.so (0x7fc8e824b000+0x44d7) [0x7fc8e824f4d7]
16: /usr/lib/xorg/modules/libexa.so (0x7fc8e824b000+0x71db) [0x7fc8e82521db]
17: /usr/lib/xorg/modules/libexa.so (0x7fc8e824b000+0x4b31) [0x7fc8e824fb31]
18: /usr/bin/X (0x400000+0xa5788) [0x4a5788]
19: /usr/bin/X (ChangeWindowAttributes+0x2fd) [0x45601d]
20: /usr/bin/X (0x400000+0x305e4) [0x4305e4]
21: /usr/bin/X (0x400000+0x30c3c) [0x430c3c]
22: /usr/bin/X (0x400000+0x261aa) [0x4261aa]
23: /lib/libc.so.6 (__libc_start_main+0xfd) [0x7fc8eae8ac4d]
24: /usr/bin/X (0x400000+0x25d59) [0x425d59]

dmesg is likely to have a message like

[drm] nouveau 0000:02:00.0: PFIFO_DMA_PUSHER - Ch 1

Neither of these is particularly helpful. Reporter-supplied details of when the freeze occurs are necessary for a useful bug report.

Problem: Log shows something about ring buffers and I830WaitLpRing (-intel only)

The ring buffer is the chunk of memory that contains commands we send down to the GPU. A WaitLpRing bug is generally a GPU hang, which can be caused by sending the GPU a bad instruction or address.

Problem: Freeze began after a system update

Regression bugs that freeze the system reliably are ironically the most productive to solve.

If you updated your system and it started to freeze, start reverting package updates backwards until the freezes stop occurring.

The first thing to try is booting an earlier kernel. Hold down the left shift key during boot, so that the grub bootloader menu comes up. Look through the set of available kernels for one you used prior to the update; boot that and attempt to reproduce the freeze. If it does not freeze, then you now have a "Good" and "Bad" kernel and can proceed with a kernel bisection search to isolate what patch caused the failure. (Yes, compiling kernels sounds intimidating and time consuming, but stick with it - the process is well documented and it has a *very* high likelihood of narrowing it to a specific cause!)

If booting an earlier kernel didn't do the trick, /var/log/dpkg.log lists other packages it has updated in date order. Older versions of debs are often cached for a time in your /var/cache/apt/archives/ directory, or can be located on launchpad with a little digging. Go to https://launchpad.net/ubuntu/+source/<suspect-package>/, click on "View full publishing history", find and click on the version you want, under Builds click on the appropriate architecture for your system, then under "Built files" download and install any/all .deb files you need.

While a variety of packages *might* cause a freeze regression, the most likely culprits are: kernel, mesa, libdrm, xserver-xorg-video-[intel|ati|nouveau], xorg-server, compiz, unity, nux, and kwin. Generally you can rule out anything not graphical or X-related in nature.

Stock Reply for "random freeze" bugs

Thanks for reporting this bug and helping to make Ubuntu better. What you have described is a generic X freeze. It could be caused by any number of things, and you need to take some additional steps to provide a complete report.

When did you upgrade to this version of Ubuntu? When did you first notice the freezes occurring?

How frequently do the freezes occur? How many per day would you say you experience?

List the applications you typically have open at the time of the freeze.

Think back to the last few times it froze. What activities were you doing in each of those times?

Do you have compiz enabled? Does the issue go away if you disable it?

If your system is a laptop, do you suspend/resume it? Had you resumed at some point prior to the freezes?

Finally, and most importantly, please collect a GPU dump after reproducing the bug. Packages to install and directions for doing this are available at:

https://wiki.ubuntu.com/X/Troubleshooting/Freeze#How%20to%20Get%20a%20Batchbuffer%20Dump%20(-intel%20only)

With the GPU dump in hand, we will be able to upstream this bug.

For more tips on troubleshooting freeze bugs, please refer to this link:

https://wiki.ubuntu.com/X/Troubleshooting/Freeze

Note: Intel 8xx Chipsets

Please see this page for additional information about GPU freezes with Lucid on Intel 8xx chipsets.

X/Troubleshooting/Freeze (last edited 2017-11-18 18:07:07 by penalvch)

-  ⇤ ← Revision 30 as of 2010-09-12 23:40:18 → 
  Size: 17050
  Editor: 2
  Comment: add SSH idea from sconklin
+   ← Revision 31 as of 2011-05-02 22:31:48 → ⇥
  Size: 12033
  Editor: static-50-53-79-63
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 9:
- * It can be hard to tie the problem to an exact test case; it seems to occur "randomly"
+ * Error messages such as "GPU lockup" are (sometimes) present in your `dmesg` output
 Line 12:
- * A backtrace appears in Xorg.0.log or elsewhere - most of the time this indicates a crash, not a freeze
 * Screen blanks to a solid color (See [[X/Troubleshooting/BlankScreen]] instead)
+ * A backtrace appears in Xorg.0.log - most of the time this indicates a crash, not a freeze.  Collect a [[https://wiki.ubuntu.com/X/Backtracing|full backtrace]]
 * X seems to be working, but the monitor appears to just be "off" (See [[X/Troubleshooting/BlankScreen]] instead)
 Line 15:
- * CPU load is high.  This indicates a performance issue rather than a freeze
+ * X CPU or memory load is high, making system laggy or freeze up.  This usually indicates a client application has lost its marbles.
 Line 21:
-X can freeze for a number of different reasons.  Unfortunately, the symptoms are extremely similar, so it can be hard to determine if your freeze is the same as someone else's.  Many bugs get incorrectly duped as a result.
+In general, most freezes are due to Graphical Processor Unit (GPU) lockups.  GPU's have registers that the driver interacts with to produce graphical effects; if the driver interacts incorrectly, the GPU can get stuck and require power cycling to come back.
 Line 23:
-It is also not unusual to have two or more different freeze bugs.  This makes debugging hard.  You may find a fix for the first freeze, but since the second freeze still happens, you can't easily tell that you fixed one.
+Some GPU lockups are caused by triggerable conditions, and are easily reproducible.  Others are situational and "tend" to occur with certain programs loaded, certain load levels, or certain periods of time passing.  Still others are seemingly random and impossible to tie to any definite set of preconditions.  Knowing which of these three classes your bug fits in can provide a clue, as different types of driver errors can lead to the different classes.
 Line 25:
-In general, most freezes are due to Graphical Processor Unit (GPU) lockups.  The GPU is the hardware chip that does graphics processing.  GPU's have registers that the driver interacts with to produce graphical effects; if the driver interacts incorrectly, the GPU can get stuck into an error state that it cannot escape except by power cycling.
+GPU lockups are always handled as driver-specific bugs.  Typically the source of the error is the handling of memory or command registers, graphics state, or other parameters of the hardware that the driver is responsible for.  Often with the open source drivers (-intel, -ati, and -nouveau) the bug requires fixed in the kernel's drm driver, thus many "X freeze" bugs technically are actually kernel bugs.
 Line 27:
-GPU layouts vary from model to model, and errors typically occur because a change was not adequately tested across a range of models before it was committed.  So for instance, if a fix solves a bug by poking the Frobnitz register at address 0x11111111 on chipset A, but on chipset B Frobnitz is at address 0x22222222, unless the fix is limited to only be done on chipset A it could cause a GPU lockup on B.
+== Reporting GPU lockup Bugs ==
 Line 29:
-Often, freezes seem to occur randomly, but in truth they're hardly even truly random.  Freezes represent a tangible bug that exists in some particular section of code; that code is executed under certain conditions.  Often there are methods to bypass the code in question (which can also give good clues for where to look for the bug) such as disabling DRI or Compiz, turning off DPMS or the screensaver, or avoiding use of Xv when playing video.
+Reproduce the freeze, and with your system frozen ssh into it (over ethernet) and collect:
 Line 31:
-== Reporting Freeze Bugs ==
+ * `dmesg > dmesg.txt`
 * /var/log/Xorg.0.log
 * /sys/kernel/debug/dri/0/i915_error_state [Intel graphics only]
-Line 33:
+Line 35:
- * When did you first notice it?  Did you change any settings (Desktop Effects?) or update your system prior to first noticing it?
+Important questions to answer in your report:
-Line 35:
+Line 37:
- * What frequency does it occur?  Just once?  Hourly?  Daily?
+ * Have you experienced just one lockup, or have you had a series of these lockups?
   * If you've had several, how often does it occur?  Every few hours?  Once or twice a day?  Couple times a week?
-Line 37:
+Line 40:
- * Try to determine actions which reproduce it or make it more/less likely to reproduce.
   * Open lots of applications?
   * Heavy switching between desktops or vts
   * Suspend/resume
+ * When did you first notice it?
   * Shortly after upgrading?
   * After updating?
   * After changing compositing (Desktop Effects) settings?

 * Under what conditions does it seem most likely to reproduce?
   * Only at boot time?
   * When resuming from suspend or hibernate?
   * Only when using compositing (Unity, et al)
   * When changing resolution or enabling/disabling monitors
   * When the screensaver (or power saving mode) kicks in
-Line 42:
+Line 52:
- * Include a Batchbuffer Dump.  On -intel this is REQUIRED.

=== Advanced: How do I tell if my GPU has really locked up? ===
A non-updating screen is not an infallible sign of an X freeze.  Problems in a compositing manager (compiz, gnome-shell, KWin, mutter, etc) can also result in a non-updating screen without being an X problem.  This is particularly true for OpenGL compositing managers like Compiz, KWin, or Mutter, since they can also rely on sync-to-vblank notifications to trigger screen updates.

If you are able to SSH into the machine when screen is frozen it is possible to gather some extra information to pin down the problem as a GPU hang.

==== Intel ====
The Intel drivers have the most comprehensive GPU hang detection, reporting, and recovery system.  If apport is enabled (see [[https://wiki.ubuntu.com/X/Troubleshooting/Freeze#How to Get a Batchbuffer Dump (-intel only)|this section]] below), then a GPU hang will be automatically detected and a bug-report created.

To manually discover whether an Intel GPU has hung, the files in {{{/sys/kernel/debug/dri/0/}}} contain this information.  Specifically:
 * {{{/sys/kernel/debug/dri/0/i915_error_state}}} contains the last GPU error detected.
 * {{{/sys/kernel/debug/dri/0/i915_wedged}}} indicates whether the driver has detected that the GPU is “wedged”, which means the GPU is in a state where it cannot make any progress.
 * {{{/sys/kernel/debug/dri/0/i915_ringbuffer_info}}} contains information about the buffer of commands waiting for the GPU to execute.  The ACTHD field contains the address of the most recently completed command - if this doesn't change it indicates that the GPU is not executing new instructions.

==== Radeon ====
The Radeon drivers also have a GPU hang detection and recovery system.  When the driver discovers that the GPU hasn't made progress on its instructions it attempts to reset the card and prints a “GPU lockup” message to {{{dmesg}}}.

==== Nouveau ====
The Nouveau drivers have only a basic GPU hang detection and recovery system.  In some cases the nouveau driver can detect a GPU hang and attempt to reset the card.  However, the kernel module will mostly either not detect GPU hangs or not recover from them, and X will notice (see the [[https://wiki.ubuntu.com/X/Troubleshooting/Freeze#Problem:  Log shows "[mi] EQ overflowing" and X freezes|[mi] EQ Overflowing]] section below).
+   * Switching between desktops
   * When performing a specific sequence of actions (List them!)
-Line 65:
+Line 56:
-=== How to Get a Batchbuffer Dump (-intel only) ===

In Lucid, apport will automatically detect and report freeze bugs and take care of collecting a batchbuffer dump and other assorted information that upstream needs.  By default Apport is disabled in stable Ubuntu releases so it has to be activated temporarily:
{{{
sudo service apport start force_start=1
}}}

Apport may not be able to capture a dump on every freeze, so try reproducing the freeze a few times and see if apport will collect it.

For kernels 2.6.32 (Lucid) and 2.6.33, the GPU resets itself after a hang and therefore a batchbuffer dump is not useful since it only captures the reset state.
The drm-intel-next mainline kernel builds beginning with 2010-02-24 available [[http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/|here]] have a kernel patch which will automatically record the failed batchbuffer information in `/sys/kernel/debug/dri/0/i915_error_state` when the GPU resets. Install and boot with one of these kernels, reproduce the error, and attach a copy of the `/sys/kernel/debug/dri/0/i915_error_state` file to the bug report. In order to have consistent logs, we would also like attached `/var/log/Xorg.0.log` and the output of `dmesg` from the same time. We hope to capture this information automatically soon.
+The drm-intel-next mainline kernel builds beginning with 2010-02-24 available [[http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/|here]] have a kernel patch which will automatically record the failed batchbuffer information in `/sys/kernel/debug/dri/0/i915_error_state` when the GPU resets.
-Line 79:
+Line 60:
-See [[http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git;a=commit;h=9df30794f609d9412f14cfd0eb7b45dd64d0b14e|this page]] for more information about the batchbuffer dump patch included in the mainline kernels.

=== Narrow Subsystem it Occurs in ===

Often lockups occur due to code in a specific subsystem within the xserver or video driver.  You can sometimes narrow the problem down usefully by testing with different things turned off.  This is done via your xorg.conf.  Common things to test include (try each one at a time!):

 * Option "AccelMethod" "xxx" - Try "XAA", "EXA" (ignored on -intel > 2.8.0)
 * Option "Accel" "Off" - turns off the 2D acceleration (ignored on -intel except for i810 and i815 chipsets)
 * Option "DRI" "Off" - turns off the 3D acceleration
 * Option "AIGLX" "Off" - turns off OpenGL indirect rendering acceleration
 * Option "PM" "Off" - turns off power management events
 * Option "NoMTRR" - turns off Memory Type Range Register support, which greatly improves performance so is usually on, but some hardware has buggy support for it

Other options can be found in {{{man xorg.conf}}}, {{{man intel}}}, {{{man radeon}}}, et al.
-Line 96:
+Line 62:
-== Problem:  Freezes right after entering login credentials ==

By default, Ubuntu is typically configured to use Desktop Effects in your logged in session.  It does not enable Desktop Effects for the login screen itself, though.  Thus, if you never see freezes with the login screen itself, but always right after logging in, that can suggest that you're experiencing a freeze bug in the 3D system triggered by compiz or kwin coming on.

The standard way to disable Desktop Effects in the menu is via {{{System>Preferences>Appearance}}}, however if your freeze happens 100% of the time, it may be tricky to get to this menu!  Also, it only applies for that user account.

A crude but effective brute force method is to just make compiz non-executable:

{{{
 sudo chmod a-x /usr/bin/compiz
}}}

You can always re-enable it later like this:

{{{
 sudo chmod a+x /usr/bin/compiz
}}}

Note that if you get a compiz update when updating Ubuntu, the update may fix the permissions.  A better long-term work-around in such a case would be to uninstall compiz entirely.

Another approach is to leave compiz as is, and just disable compositing or DRI in xorg.conf (see below).
-Line 211:
+Line 156:
-If you updated your system and it started to freeze, start reverting package updates backwards until the freezes stop occurring.  {{{/var/log/dpkg.log}}} lists the packages it has updated in order.  Older versions of debs are often cached for a time in your /var/cache/apt/archives/ directory.
+If you updated your system and it started to freeze, start reverting package updates backwards until the freezes stop occurring.
-Line 213:
+Line 158:
+The first thing to try is booting an earlier kernel.  Hold down the left shift key during boot, so that the grub bootloader menu comes up.  Look through the set of available kernels for one you used prior to the update; boot that and attempt to reproduce the freeze.  If it does not freeze, then you now have a "Good" and "Bad" kernel and can proceed with a [[https://wiki.ubuntu.com/Kernel/KernelBisection|kernel bisection search]] to isolate what patch caused the failure.  (Yes, compiling kernels sounds intimidating and time consuming, but stick with it - the process is well documented and it has a *very* high likelihood of narrowing it to a specific cause!)
-Line 214:
+Line 160:
-== Problem:  Freeze began after upgrading from an older version of Ubuntu ==
+If booting an earlier kernel didn't do the trick, {{{/var/log/dpkg.log}}} lists other packages it has updated in date order.  Older versions of debs are often cached for a time in your /var/cache/apt/archives/ directory, or can be located on launchpad with a little digging.  Go to https://launchpad.net/ubuntu/+source/<suspect-package>/, click on "View full publishing history", find and click on the version you want, under Builds click on the appropriate architecture for your system, then under "Built files" download and install any/all .deb files you need.
-Line 216:
+Line 162:
-Due to the large number of packages updated in an upgrade, it's impractical to revert packages step by step as above.  Also, with these bugs it is more likely the error was introduced upstream.

If the freeze occurs whether 3D is enabled or not, then the problem may be in your video driver.  If it only occurs when 3D is used, then the bug may be in {{{mesa}}}.

These regressions can be analyzed through ''bisection''.  Build and install the older, working version of the video driver or mesa and verify that when downgrading the broken system to those versions that the problem goes away.  From there, use [[X/Bisecting]] techniques to narrow in on the specific change that caused the regression.
+While a variety of packages *might* cause a freeze regression, the most likely culprits are:  kernel, mesa, libdrm, xserver-xorg-video-[intel|ati|nouveau], xorg-server, compiz, unity, nux, and kwin.  Generally you can rule out anything not graphical or X-related in nature.

Ubuntu Wiki