FixingCrashes

Differences between revisions 3 and 4
Revision 3 as of 2009-11-22 06:05:44
Size: 9705
Editor: pool-74-107-129-37
Comment:
Revision 4 as of 2010-01-04 19:34:38
Size: 8267
Editor: pool-74-107-129-37
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
I often think with bugs about the "Payoff Ratio" - how much time it will take me, and how much benefit it brings to users. Crash bugs have a really nice payoff ratio. Once you learn the techniques they don't take long to solve, and since X crashing is a major drag for users experiencing it, the fix brings a nice juicy chunk of value to Ubuntu. To fix an X crash is technically challenging, but once you've mastered it can be a very time efficient way to bring lots of benefit to Ubuntu. X crashes are particularly troublesome to users, so your fix will kill a lot of people's angst about Linux.
Line 3: Line 3:
It looks like I probably won't have as much time to focus on these going forward since other projects demand attention, but I thought I'd walk you through how the work is done. Maybe you will find these bugs as rewarding to work on as I have, and can help make the contributions that make Ubuntu's X as robust and reliable as it ought to be.

== "How much do I need to learn?" ==

To fix X crashes, you don't really need to know much about X.org, believe it or not. But you do need to have a working knowledge of C and *nix. Most especially, you need to have a solid understanding of how pointers work in C, and how to create patches using diff. If you don't, hit google and come back when you're ready.

In this tutorial, I'll walk you though how to understand backtraces, common kinds of pointer failures that cause crashes, and where to locate the crashing code in the X codebase. I'll go through how to code up a fix, and how to prepare and submit patches to ubuntu and upstream, and how to put your patch into a PPA for others to test. If you know some of this already, great!
In this tutorial, I'll walk you though how to understand backtraces, common kinds of pointer failures that cause crashes, and where to locate the crashing code in the X codebase. I'll go through how to code up a fix, and how to prepare and submit patches to ubuntu and upstream, and how to put your patch into a PPA for others to test.
Line 12: Line 6:
== "What's the payoff?" == == What you need to know ==
Line 14: Line 8:
The payoff here is nicely intrinsic - you know your effort will directly make X.org more robust, which will make user experience with Linux more pleasurable. From a user perspective, an X crash is just as severe as a kernel crash, and it plants that seed of worry, "Maybe I can't depend on this operating system..." Your fix kills that seed before it sprouts.

In addition, once you've mastered this technique, you'll find it helps directly towards fixing crashes in just about any Ubuntu software. X is a bit more sophisticated than the average piece of software, so if you can fix a crash in X, you can probably learn to fix a crash in anything.
 * Working knowledge of C (particularly about how pointers work).
 * Creating patches using diff
Line 21: Line 14:
Let's dive in with both feet. The meat and potatoes of crash fixing is backtraces, so lets take a look at a typical one: The meat and potatoes of crash fixing is backtraces, so lets take a look at a typical one:

To fix an X crash is technically challenging, but once you've mastered it can be a very time efficient way to bring lots of benefit to Ubuntu. X crashes are particularly troublesome to users, so your fix will kill a lot of people's angst about Linux.

In this tutorial, I'll walk you though how to understand backtraces, common kinds of pointer failures that cause crashes, and where to locate the crashing code in the X codebase. I'll go through how to code up a fix, and how to prepare and submit patches to ubuntu and upstream, and how to put your patch into a PPA for others to test.

What you need to know

  • Working knowledge of C (particularly about how pointers work).
  • Creating patches using diff

Anatomy of a backtrace

The meat and potatoes of crash fixing is backtraces, so lets take a look at a typical one:

.
Thread 1 (process 2152):
#0  miSpriteComputeSaved (pDev=0x8e84044, pScreen=<value optimized out>)
    at ../../mi/misprite.c:1061
        y = 544
        pCursorInfo = (miCursorInfoPtr) 0x8e84a80
#1  0x08123896 in miSpriteSaveUnderCursor (pDev=0x8e83f60, pScreen=0x8cfe428)
    at ../../mi/misprite.c:971
        pScreenPriv = (miSpriteScreenPtr) 0x8cde438
        pCursorInfo = <value optimized out>
#2  0x08123e64 in miSpriteBlockHandler (i=0, blockData=0x0, 
    pTimeout=0xbf858dac, pReadmask=0x81f7440) at ../../mi/misprite.c:511
        pScreen = (ScreenPtr) 0x8cfe428
        pDev = (DeviceIntPtr) 0x8e83f60
        pCursorInfo = <value optimized out>
#3  0x001deb9f in I810BlockHandler (i=0, blockData=0x0, pTimeout=0xbf858dac, 
    pReadmask=0x81f7440) at ../../src/i810_video.c:1163
        pScreen = <value optimized out>
        pI810 = (I810Ptr) 0x8cda588
        pPriv = (I810PortPrivPtr) 0x8d41544
#4  0x0817c3f3 in AnimCurScreenBlockHandler (screenNum=0, blockData=0x0, 
    pTimeout=0xbf858dac, pReadmask=0x81f7440) at ../../render/animcur.c:222
        pScreen = (ScreenPtr) 0x8cfe428
        as = (AnimCurScreenPtr) 0x8d47428
        dev = (DeviceIntPtr) 0x0
        now = 0
        soonest = 4294967295
#5  0x08144c2e in compBlockHandler (i=0, blockData=0x0, pTimeout=0xbf858dac, 
    pReadmask=0x81f7440) at ../../composite/compinit.c:158
        pScreen = (ScreenPtr) 0x8cfe428
        cs = (CompScreenPtr) 0x8d48c30
#6  0x080914e8 in BlockHandler (pTimeout=0xbf858dac, pReadmask=0x81f7440)
    at ../../dix/dixutils.c:384
        i = 1

more backtrace for bug 485460.

A backtrace is essentially a snapshot made at the point of crash which shows the function that the crash occurred in at (or near) the top, then the function that called that function, and then the function that called *that* function, and so on up to the top of the stack. Usually the main() routine is the top of the stack, but not always.

In this example stacktrace, taken from bug 485460, the crash happens in the first function listed, #0 miSpriteComputeSaved.

We often refer to the function names as "symbols". In some backtraces the function names aren't provided, and are shown as just question marks. For example:

StacktraceTop:
 ?? ()
 ?? ()
 ?? ()
 ?? () from /usr/lib/xorg/modules/drivers//intel_drv.so
 ?? ()

When that is all you see, it means we have to ask the user to obtain a full backtrace.

Fortunately that's not the case with our example. So let's take a closer look at the first symbol:

#0  miSpriteComputeSaved (pDev=0x8e84044, pScreen=<value optimized out>)
    at ../../mi/misprite.c:1061
        y = 544
        pCursorInfo = (miCursorInfoPtr) 0x8e84a80

TODO

Many times a crash handler will kick in, so the first few symbols in the stack trace relate to that. Basically, if you see symbols with "raise", "abort", "assert", "SigHandler", or "backtrace" in their names, those are typically calls related to the crash handler. Look up the stack for the symbol prior to these calls. For example,

StacktraceTop:__kernel_vsyscall ()
*__GI_raise (sig=6)
*__GI_abort () at abort.c:92
*__GI___assert_fail (
drm_mmInit () from /usr/lib/libdrm_intel.so.1

We would ignore the first four symbols and start by examining drm_mmInit().

Where to find crash bugs

Types of X crashes

For sake of our understanding, let's split all the different types of X crashes into three broad categories:

  1. Naughty pointers
  2. Bad/hard software configurations
  3. Bad/hard hardware configurations
  4. Corrupted X logic or other issue

That's roughly the order of "easiness to fix" too. We'll be mostly focusing on #1 in this guide, since those are the kind of bugs that can be fixed with hardly any X.org experience. But let's discuss the others briefly so you know what to avoid.

Bad software configuration crashes typically happen because the user has either some weird combination of software package versions. For instance, maybe they are running newish xserver with an ancient kernel. A really common case is where they are running virtualbox and have installed video drivers compiled against the wrong version of X. Another common case is where they have previously installed the -nvidia or -fglrx driver, perhaps uninstalling the driver but not completely purging it; this can result in really weird crashes since they still have some proprietary bits floating around trying to handle function calls they aren't able to handle. In fact, if you see any evidence in the backtrace or bug report of the user having installed a proprietary driver, just move on to another bug report - even in a best case situation where you could find the cause of the crash, since the proprietary drivers are closed source you wouldn't be able to make a patch to fix it.

Bad hardware configurations include running on extremely old hardware, extremely new hardware, or hardware being operated in extreme conditions like overheating. Often the work to solve these is beyond just a simple X patch. Generally if they describe something so old or so new that neither you nor google recognizes it, it's probably going to be challenging to work on.

Some X crashes are caused by other unrelated problems in X, that then propagate for a while with X in an sickly state, until eventually the gears lock up and a crash occurs. These types of problems are unfortunately quite intricate to sort out, usually requiring more info than just a crash dump. Often you also need to be able to reproduce the issue yourself, so you can walk through it in gdm and see what's gone wrong - so you need to be experienced with gdb.

Here's a checklist of things to watch out for:

  • Description indicates proprietary kernel module? (-fglrx, -nvidia, -psb video drivers)
  • Description shows old kernel version, or a non-standard kernel like the -rt, -pae, server kernel, etc.
  • Description shows VirtualBox or VMWare in use

  • XorgConf.txt shows any driver in use other than {intel, radeon, ati, nv, nouveau} or any "weird" settings

  • Extremely old or new hardware you don't recognize and can't find via google
  • Evidence of hardware issues (overheating, bad memory, etc.)

While bugs that match the above conditions are certainly legitimate issues, they're going to need some understanding of X.org beyond the scope of this tutorial, so just skip these bugs.

Understanding the crash

  • Null pointer dereference
  • Stack overflow
  • Assertion failed

Some issues to be aware of or look for

  • Were there changes to the code recently? Maybe they introduced a corner case that can lead to a crash
  • Are there any fixes to this section of code upstream but not included in the version of the package the user is on?
  • Looking through the function, does it appear that there are any sections of code which are using pointers before verifying they are not NULL?

Coding a fix

Preparing the patch

  • Creating the patch
  • Packaging the patch
  • Creating a PPA

Soliciting testing

  • Requesting user test PPA
  • Sending upstream for comment

Appendix A: Apport crash hooks

==

X/FixingCrashes (last edited 2010-01-04 19:34:38 by pool-74-107-129-37)