UnwindStacktraces

Ubuntu Open Week - Unwinding Stacktraces - Emmet Hikory - Sat, May 3, 2008

=== ompaul changed the topic of #ubuntu-classroom to: Ubuntu Open Week | Information and Logs: https://wiki.ubuntu.com/UbuntuOpenWeek | How to ask questions: https://wiki.ubuntu.com/UbuntuOpenWeek/Rules | Ask questions in #ubuntu-classroom-chat, prefaced with "QUESTION:" |See https://wiki.ubuntu.com/UbuntuOpenWeek/JoiningIn to filter out channel noise | Next up: Unwinding Stacktraces - Emmet Hikory (15:00, UTC)

[16:00] <jcastro> Ok, let's get started

[16:00] <jcastro> welcome back everyone to the last day of OpenWeek

[16:00] <jcastro> thanks for coming out on a Saturday! Smile :)

[16:00] <jcastro> The last set of sessions are listed here:

[16:00] <jcastro> https://wiki.ubuntu.com/UbuntuOpenWeek/

[16:00] <jcastro> First up we have Emmet Hikory with "Unwinding Stacktraces"

[16:00] <jcastro> persia: take it away!

[16:01] <persia> Welcome to Unwinding Stacktraces.

[16:01] <persia> The intent of this discussion is to describe a stacktrace, explain how to find bugs with stacktraces, and walk through the process of investigation to be able to find (and fix) the bug.

[16:01] <persia> Feel free to interrupt with specific questions or requests for clarification at any time, and there will be a more general question period at the end.

[16:02] <persia> As a program executes, it typically involved calling some sequence of functions, each of which will call further functions, and so on.

[16:03] <persia> The current state of the running program is stored in the stack, and a stack trace is simply a dump of all the information in the stack at the time the trace is taken.

[16:03] <persia> For the interesting stacktraces, this is usually the point at which the program crashes, but it can be any point.

[16:04] <persia> s an example, at the point where we print the second 'l' in the following simple program: int main() { printf("Hello World\n"); } the stack consists of three interesting frames (and some initialisation)

[16:04] <persia> At the top (#0), we have the actual character display, with a local variable holding 'l'.

[16:04] <persia> The next frame (#1) is printf(), with the local variable "Hello World".

[16:04] <persia> The third (#2) is main(), with no local variables, reprenting our program.

[16:05] <persia> A more complex example may be found from http://launchpadlibrarian.net/9409147/Stacktrace.txt, which is a crash in w3m: a text-mode browser.

[16:06] <persia> As you can see, this stack is a little longer, as w3m is a bit more complicated than the previous example.

[16:06] <persia> Also, this stacktrace represents a real crash, so instead of knowing we are at 'l', we have to discover our location.

[16:07] <jcastro> < wolfger> QUESTION: for us noobs, how is a stack trace taken?

[16:07] <persia> I find it easiest to read these from the bottom, as this goes from the general environment (starting the program), and becomes more specific.

[16:07] <persia> wolfger: The common way to take a stacktrace is within GDB, although in Ubuntu the apport tool helps generate them for bug reports.

[16:08] <persia> I'll talk a little about apport later, but https://wiki.ubuntu.com/Backtrace provides a quick guide to doing it manually.

[16:09] <jcastro> < BonesolTeraDyne> QUESTION: Is a stacktrace the same as the backtrace that the KDE crash handler gives?

[16:09] <persia> BonesolTeraDyne: I'm not actually familiar with the KDE crash handler, but I expect that if it's not the same, it's very close.

[16:10] <persia> So, looking at the w3m example, from the bottom, we can find out what is happening. I've not looked closely at the w3m code, but the following would be my guess:

[16:10] <persia> At frame 15, the program is trying to start

[16:10] <persia> At frame 14 glibc is processing the start request

[16:11] <persia> At frame 13, the main() function starts, and we're in the code

[16:11] <persia> At frame 12, it calls a function to load a file, so I'm guessing this is opening a local file on the machine

[16:11] <persia> At frame 11 is calls "loadSomething", which is probably a generic helper function for the file load

[16:12] <persia> At frame 10, it appears to have discovered this was an HTML file, and so is loading HTML

[16:12] <persia> At frame 9, it is processing the HTML data as a data stream

[16:12] <persia> At frame 8 it is processing a single line of the data

[16:13] <persia> At frame 7, it seems to be processing a table

[16:13] <persia> (At this point, we can say "Aha!", as while it's obvious a browser would be reading HTML, it being a table tells us more specifically where the problem is happening)

[16:14] <persia> Frames 6, 5, 4, and 3 appear to be more table calls. Maybe it's nested tables, or maybe the code is just highly recursive. Determining which usually means reading the code or the data.

[16:14] <persia> Frame 2 appears to be rendering an image within the table

[16:14] <persia> Frame 1 appears to be handling the description of the image

[16:15] <persia> And frame 0 crashes with the call to Strnew_size with the passed argument of (n=256).

[16:15] <persia> From this, we can safely say that w3m has crashed while processing an image nested in a table, which is a lot more information then we had before.

[16:17] <jcastro> < wolfger> QUESTION: how did we determine that about Frame #1? That step lost me

[16:18] <persia> wolfger: I guessed it based on the local variables. I could be completely mistaken: to check for sure, one should check the code of w3m around line 2917 of file.c.

[16:19] <persia> The part that makes me think it is related to handing the various attributes of the image is all the local variables (anchor, img, bold, etc.).

[16:21] <persia> if one is using a stacktrace to debug something, one will likely pass back and forth over two or three frames, most commonly #0 - #2, but sometimes as high as #50 (if the program is compiled against a library with extensive internal signal handling, such as GTK+2)

[16:22] <persia> From the point of view of getting an idea of what is happening, the specifics right near the problem aren't as important, as one will want to look at the code: it's mostly a matter of seeing how far one can get towards #0 before getting lost to save time in reviewing it.

[16:23] <persia> So, in Ubuntu, the majority of stacktraces available for analysis are provided in launchpad by Apport. The apport tool captures the results of every program crash on every user's workstation, and offers the opportunity for the user to report the crash back to Ubuntu.

[16:24] <persia> This is paired with the apport retracing service which processes these crash reports, and generates the nice stacktraces with the symbol names, and the line numbers of the corresponding code.

[16:24] <persia> At some point in the apport process, the apport-crash tag is typically added to the bug, so that looking through all the bugs with the apport-crash tag is often a good way to find stacktraces.

[16:25] <persia> This list is most easily accessible from the Tags list on https://launchpad.net/ubuntu/+bugs

[16:25] <persia> When reviewing this myself, I tend to skip down until I find a package a use a fair bit, as it's a lot easier to understand a problem with a program that one uses than with a program one has to learn at the same time one is debugging.

[16:26] <persia> The reports will have several different attachments that may be useful in understanding the problem, but the "Stacktrace.txt (retraced)" is typically the right one to review.

[16:27] <persia> In cases where there are multiple threads and may be timing issues or interthread communication issues, the "ThreadStacktrace.txt (retraced)" may also be useful, but most of the time even when one reviews this, the problem is actually in Stacktrace.txt.

[16:29] <persia> That about covers the definition, reading, and availability of stacktraces. Are there any questions about these before we look at a debugging example?

[16:30] <persia> Great then!

[16:31] <persia> So, I've picked one I did before to look at, specifically that the hydrogen frun machine crashed when a certain (broken) drumkit was used. This was bug #147476, and the Stacktrace is available from http://launchpadlibrarian.net/9612459/Stacktrace.txt

[16:31] <jcastro> < bran_damage> QUESTION: Are the stacktraces similar no matter what language the code was written in ?

[16:32] <persia> bran_damage: Typically, although they may look a little different.

[16:33] <persia> In an interpreted language like python, the symbols are typically available at runtime, and so the trace can be presented to the user directly. In a compiled language like C, the symbols are typically opimised out, so the stacktrace has to be retraced to make sense.

[16:33] <persia> An unretraced stacktrace looks like http://launchpadlibrarian.net/9032770/Stacktrace.txt

[16:34] <persia> While the codes are meaningful with the debug symbols, without them, the function names are unknown, and the locals are hidden.

[16:36] <persia> Going back to hydrogen, and the process of looking at a stacktrace for debugging.

[16:37] <persia> I usually start around frame #3 or #4, unless it's a lot of signal handlers, in which case, I'll keep going back until I get to some function that looks like it was written for the program that crashed.

[16:38] <persia> For this stacktrace, the mouse movement handling around frames 7&8 provide a nice boundary: anything earlier than this is the user doing something else, not likely related to the crash.

[16:39] <persia> Also, a stacktrace doesn't actually keep a history of what was done, but only the nesting of function calls, so there's not enough information about what else the user did with the mouse before clicking on the button at 0x107bd40 at frame 7.

[16:41] <persia> Frames 5 & 6 look like the callbacks to handle the mouse click, and frames 3&4 look like some QT function to call the actual handler for the button click (and here, I'm just guessing based on the name: if you're unsure, it's always safe to check the code)

[16:42] <persia> This takes us to Frame #2 which is pretty clearly the loader that handles the loadDrumkitBtn click.

[16:45] <persia> When debugging, you'll want to get an idea of the context for the local variables. In this case, we're looking around line 168 of DrumkitManager.cpp (available from http://paste.ubuntu.com/9766/ for those without source handy)

[16:46] <persia> Now since frame #1 is in the Hydrogen::loadDrumkit function, it's actually line 167 of the pastebin.

[16:46] <persia> Looking at what happens previously in this function, we can see that some name gets set (line 160)

[16:46] <persia> It does a loop over the list of drumkits

[16:47] <persia> Compares the selected name to each of the names

[16:47] <persia> When it matches, it sets a cursor to tell the user to wait, and calls loadDrumkit, passing the information about he drumkit from the drumkit list.

[16:49] <persia> Going back to the stacktrace, the intersting parts are drumkitInfo and sselectedDrumkitName, which are the defined local variables. Since this matches what we see in the code, and nothing is obviously wrong, we go to the next frame.

[16:49] <persia> (code for this is available from http://paste.ubuntu.com/9756/ )

[16:50] <persia> Here, we're looking at the function starting at line 2469

[16:51] <persia> Going down, it crashes at an assignment, which ttrial and error showed to be line 2494.

[16:51] <persia> Now, any questions about comparing the code and the stacktrace to debug a problem?

[16:52] <jcastro> no questions so far

[16:54] <persia> Well then, a few random notes, before we're out of time (but please interrupt if you have a question).

[16:54] <persia> I find it typically takes an hour or two to go through a stacktrace carefully, and figure out the problem in the code. The first few can take longer.

[16:56] <persia> Also, sometimes the code is in a language that you might not know well. Don't worry about this: it's better to get a rough idea of what is happening. Even if you can't figure out how to fix it, being able to pinpoint the problem (it crashes at this point, and I think it's because this variable didn't get set the way we wanted), can help someone who does know that language to fix it,

[16:57] <persia> Most of all, be confident, and make a lot of guesses. As the stacktrace shows each line of code, and each function called, it's hard to get lost, even if you guess wrong, but there's usually more detail than you need.

[16:58] <persia> Anyway, thanks for attending. Enjoy the other sessions today, and I'll hope to see you in #ubuntu-bugs or #ubuntu-motu looking at stacktraces and fixes for crashes.

[16:58] <jcastro> thanks persia!

MeetingLogs/openweekhardy/UnwindStacktraces (last edited 2008-08-06 16:41:36 by localhost)