Stacktracing

Dev Week -- Interpreting Stacktraces -- persia -- Fri, Jan 29

UTC

   1 [20:02] <persia> OK.  That means it's time for Interpreting Stacktraces :)
   2 [20:03] <persia> I get distracted with -classroom and -classroom-chat, and really prefer interactive discussions, so I encourage anyone to ask questions in this channel at any time (I prefer to be interrupted)
   3 [20:05] <persia> As an example, we're going to look at a crash in dot.  Anyone wanting to follow closely would benefit from getting a local copy of the cairo and graphviz sources (because we'll be looking at them later)
   4 [20:05] <persia> Either karmic or lucid sources should be OK.
   5 [20:07] <persia> A stacktrace is the overview of the current function call stack for a program, and is extremely useful when determining why a crash happens.
   6 [20:07] <persia> Thanks pleia2 :)
   7 [20:07] <pleia2> welcome :)
   8 [20:08] <persia> When a program executes, the various function calls and instructions in the program are run, typically in some sort of order, which we can see from the source code.
   9 [20:08] <persia> In most programs, there is a main() function, which controls the entire program. In executing it, otehr functions will be called to do various things.
  10 [20:09] <persia> These functions will then call other functions, and so on, in a nested fashion.
  11 [20:10] <persia> Each function called is placed on the top of the stack when it is called, and when it, in turn, calls another function, that function is placed on the top of the stack.
  12 [20:11] <persia> As a result, at any point, the function on the top of the stack is the currently executing code, and the rest of the stack contains all the nested functions needed to unwind to get back to the operating system starting the program.
  13 [20:12] <persia> An example stacktrace (and the one we'll use for debugging today) is http://launchpadlibrarian.net/37459596/Stacktrace.txt
  14 [20:13] <persia> The stack is commonly described in terms of frames.  Frame #0 is the currently executing function, Frame #1 the function that called that function, etc.  In our example, Frame #8 is main() which was called by the operating system when starting the program (sometimes you can see this, depending on the stacktrace)
  15 [20:14] <persia> At each Frame, the stacktrace contains the address of the function, the name of the function, the values of (most of) the arguments to the function, and often some avtive variable values in the vicinity.
  16 [20:14] <Pendulum> < vishalrao_vbox> QUESTION: who/how generates a textual stacktrace from, say, a core dump?
  17 [20:15] <persia> Well, there's lots of ways to do that :)  The two most common ways to generate stacktraces as used for crashes in Ubuntu are with gdb and apport.
  18 [20:15] <Pendulum> < michae28> QUESTION : what is vicinity ?
  19 [20:16] <persia> "vicinity" is a state of being near in space or relationship.  Things in the same vicinity are close to each other.
  20 [20:16] <persia> Could also be nearby, surrounding, etc.
  21 [20:16] <Pendulum> < vishalrao_vbox> QUESTION: can you briefly mention the commandlines for gdb and apport to generate the stacktraces?
  22 [20:17] <persia> The process of generating a stacktrace with gdb is described at https://wiki.ubuntu.com/Backtrace
  23 [20:17] <persia> For apport, there are three ways to do it, all dependent on the apport-retracer infrastructure.
  24 [20:18] <persia> The first being to enable apport crash tracking, which automatically uploads the crash information to launchpad and files a bug when a crash happens.
  25 [20:18] <persia> The second being to use apport-bug to file a bug with a previously produced .crash report
  26 [20:19] <persia> The third being to use apport-retrace manually to collect information.
  27 [20:19] <persia> If the details in the crash report are uploaded to launchpad, the apport-retracers will typically use them to generate textual stacktraces with debugging information.
  28 [20:20] <persia> So, by using the stacktrace, we can see exactly what the program is doing at the time the stacktrace is taken.
  29 [20:21] <persia> In our example, the program started, did some stuff which resulted in a call to gvRenderJobs(), which did something and called emit_graph(), which did something and called emit_background(), and so forth.
  30 [20:21] <Pendulum> < vishalrao_vbox> QUESTION: if i try to generate a backtrace in gdb and it just shows numbers instead of symbols, how do i get it to show me the function names etc?
  31 [20:22] <persia> To determine the parts that aren't described, we'll need to review the source code along with the stacktrace.
  32 [20:23] <persia> For most programs, the debugging symbols have been stripped out into ddebs.  These contain mappings between the function call addresses and the code.
  33 [20:23] <persia> I can never find the right wiki page that explains how to install them, but it should be navigable from the links in the /topic of #ubuntu-bugs (or maybe someone else can dig it up and share)
  34 [20:24] <persia> The example stacktrace I've shown comes from Bug #503503
  35 [20:24] <ubottu> Bug 503503 on http://launchpad.net/bugs/503503 is private
  36 [20:24]  * persia fixes that
  37 [20:24] <persia> Right.  Now bug #503503 should be public
  38 [20:24] <ubottu> Launchpad bug 503503 in graphviz "dot crashed with SIGSEGV in cairo_set_dash()" [Medium,New] https://launchpad.net/bugs/503503
  39 [20:25] <persia> This bug was filed using apport, which I find much easier to search.  If you're looking for other crash bugs, using the apport-crash tag is great way to get a list from which to work.
  40 [20:27] <persia> The title of the bug tells us that there was a segmentation fault when calling the function cairo_set_dash().  These tend to be my favorite sort of crash bugs, because they are usually fairly easy to track down.
  41 [20:27] <persia> http://en.wikipedia.org/wiki/SIGSEGV has a little more information about segmentation faults.
  42 [20:28] <persia> Other than the title, the rest of the bug report is mostly uninteresting at this point.  Depending on what we find, we may want to investigate using some of the other provided information (for instance, if we're crashing in translations, the fact that it's reported with de_LU.UTF8 is important)
  43 [20:29] <persia> Well, there's one other interesting thing, actually, which is the package against which the bug is filed.  This tells us which package we need to get source from in order to interpret the stacktrace.
  44 [20:31] <persia> Also, reviewing the stacktrace, one might notice that the source file identified at frame #0 is /build/buildd/cairot-1.8.8/src/cairo.c rather than some local file.  This notation is used to identify code from other packages.  In this case, it's the cairo package (version 1.8.8), so that's why we wanted to download graphviz and cairo sources earlier (and I'll now assume that you've all done that)
  45 === vishalrao_web is now known as vishalrao
  46 [20:32] <persia> The first step when reviewing a stacktrace is to find somewhere interesting to start.  One could always just look at frame #0 where the crash is actually happening, but this is rarely enough context to actually understand the bug.
  47 [20:33] <persia> Alternately, one could start from the very bottom of the stack, but usually the first several calls are far too general, and it's easy to get frustrated or distracted reading that much source.
  48 [20:33] <persia> It's never possible to know the right point in advance, but I usually pick something near the middle, preferably something that has a lot of vairable references defined.
  49 [20:35] <persia> So in this case, frame #4 looks like a sensible place to start.  From here, we'll read up to frame #0 to understand things.  If the problem has already happened at frame #4, we may end up backing up (perhaps even to frame #0) to fix it, but at least we'll understand the targets to search when we look at the more general code.
  50 [20:36] <persia> So, frame #4 is the gvrender_box call at line 851 of gvrender.c .  Since there are no funny pathnames, this is in the graphviz source.
  51 [20:36] <persia> I usually use `find . -name ${SOURCE} -print` from within the source directory to quickly identify the right file.
  52 [20:37] <persia> Running `find . -name gvrender.c -print` shows ./lib/gvc/gvrender.c
  53 [20:37] <persia> Please open that file in your favorite text editor
  54 [20:38] <persia> At line 851 is a gvrender_polygon call, which is giving us trouble.  Unless we've previously worked with this source, we probably don't know what this means, although the name may help us guess.
  55 [20:38] <persia> So, it's best to scroll up to the beginning of the function that is calling gvrender_polygon() and read down again to line 851 to get some understanding.
  56 [20:39] <persia> The top of the function is at line 840, and the entire function consists of setting the values of variable A (type pointf) based on values from variable B (type boxf).
  57 [20:40] <persia> So in our call, we are just passing through the job pointer and the filled boolean, along with A and a number.
  58 [20:41] <persia> Next, we look at frame #3, which is at line 812 of gvrender.c
  59 [20:42] <persia> Again, we scroll up to the beginning of the function (at line 805), and read down.  We know the inputs from the last function we read, so we can understand this a bit better.
  60 [20:44] <persia> The function assigns an engine from the job, apparently succeeds (because of the if), apparently confirms the job has the GVRENDER_DOES_TRANSFORM flag set, and calls gvre_polygon using all the same values that were passed in previously.
  61 [20:44] <persia> Moving to frame #2, we need to switch source files, to gvrender_pango.c
  62 [20:44] <persia> THis is in ./plugin/pango\gvrender_pango.c
  63 [20:45] <persia> Line 277 is in the cairogen_polygon function, which isn't quite the name that was called, but seems to take the same inputs we had previously.
  64 [20:46] <persia> Between the start of the function and the call in the stacktrace, we see only that *obj, *cr, and i are defined (and *obj and *cr are initialised).
  65 [20:47] <persia> We then call cairogran_set_penstyle with job and cr.  Looking at the stacktrace, we can see that job is optimised out (but had been0x90945d0 in previous calls, so is probably the same)
  66 [20:47] <persia> And cr is 0x0.
  67 [20:48] <persia> since we know that cr is a pointer to a variable of type cairo_t, it shouldn't have a value of 0 (this is a null pointer).
  68 [20:48] <persia> Which probably means that cairo_t *cr = (cairo_t *) job->context; failed.
  69 [20:49] <persia> Moving to frame #1 is just moving to line 235
  70 [20:50] <persia> Here the function sets *obj again, and based on the value of *obj, selects a way to call cairo_set_dash().  In our case, it's the neither a dashed nor a dotted pen.
  71 [20:50] <persia> But we pass dashed anyway :)
  72 [20:51] <persia> For frame #0, we need to look at the cairo source code.  This is in src/cairo.c (it's easier to tell with foreign source files)
  73 [20:51] <persia> Looking at line 1033, and up to the start of the funtion, we can see that we're right at the beginning.
  74 [20:52] <persia> Before doing anything else real, the code checks the status of cr.  Since cr is a 0, the expected structure isn't present, so cr->status isn't valid, which causes the segfault.
  75 [20:52] <persia> This is another thing to fix,because it's best for a library not to crash, even when the wrong data is passed.
  76 [20:53] <Pendulum> < vocx> QUESTION: are stack traces really important nowdays? I feel like much development is done now in programming languages such as Python and C#/Mono, which use their own runtime, and so don't "crash" in the traditional way, and don't produce a stacktrace.
  77 [20:53] <persia> So, after reading the stacktrace, we've identified two things to fix, either one of which would solve the bug.
  78 [20:54] <persia> 1) test to make sure one has a valid value for cr in cairogen_polygon(), and 2) put an exception handler around the status check in cairo_set_dash()
  79 [20:54] <Pendulum> < vocx> QUESTION: in your experience, where are the crashes most often found? The program itself, or the supporting libraries, like Gtk+, Qt, glib, cairo, poppler, etc.?
  80 [20:55] <persia> Python produces something called a Traceback, which is essentially the same as a stacktrace, except presented in the opposite order (frame #0 is at the bottom).
  81 [20:55] <Pendulum> < michae28> QUESTION : <value optimized out> does it mean that the real value cannot be displayed ?
  82 [20:55] <persia> But the vast majority of system libraries end up being written in compiled languages, so if there is a crash in a library, a stacktrace is likely interesting (even if the call comes from python or mono bindings)
  83 [20:56] <persia> I've yet to find a crash that couldn't be worked around by type checking, value checking, return code checking, or exception handling in client code.
  84 [20:57] <persia> But many crashes *also* expose an issue with the underlying library (as we saw in this case, where both graphviz and cairo could use code improvements).
  85 [20:58] <persia> When the value is opimised out, it usually means that the value isn't relevant in some way, or is redundant.  In our example case, the value for job was passed a couple times, and then dropped.  We could still look at older uses, but it no longer mattered.
  86 [20:58] <Pendulum> < SevenMachines> [QUESTION] Often you see argument memory addresses cropping up in stacktraces, is there a good way to get more information in the stacktrace on the actual object on the heap that was passed?
  87 [20:58] <persia> Not really.  I tend to try to assemble a mental model of the object by passing through several frames.
  88 [20:59] <persia> If the intial guess for starting was wrong, one sometimes needs to go back further to understand the data being passed.
  89 [20:59] <persia> But most of the time, crashes seem to be from a miscommunication between the library author and the application author, where one is expecting something different than what the other is providing.
  90 [21:00] <persia> And these tend to be obvious (like the one we examined, where the code never checked to see if a valid value was being passed to cairo, and when cairo couldn't complete the request, didn't trap the return)
  91 [21:01] <persia> Whether SIGSEGV is the best thing for cairo to return is a different question, but that it wasn't checked doesn't help.
  92 [21:02] <persia> cairo should probably have trapped the SIGSEGV, and returned someting with _cairo_set_error, and graphviz should have checked the value of _cairo_set_error (as well as having checked to make sure cr was a valid pointer before using it)
  93 [21:03] <persia> Are there more questions?  I seem to be out of time, but there's no following sessions, so I'm happy to keep answering questions for a bit.
  94 [21:04] <Pendulum> < michae28> QUESTION : to have all this stack informations, the program have been compiled with -g option ?
  95 [21:05] <persia> I actually don't know.  I've gotten stacktraces just using `gcc foo.c -o foo`, but this may have been from some default options that were set.
  96 [21:07] <persia> Well, thanks everyone for attending.  Feel free to catch me in #ubuntu-bugs if you're digging through any traces and have questions in the future.
  97 [21:08] <persia> And thanks a lot for participating in Ubuntu Developer Week.  I hope some of you will be presenting next time :)
  98 [21:08] <persia> Pendulum: Thanks especially for forwarding questions.
  99 [21:09] <Pendulum> persia: you're welcome :) Thank you for presenting :)

MeetingLogs/devweek1001/Stacktracing (last edited 2010-02-01 17:54:22 by unassigned)