ErrorTracker

Revision 17 as of 2011-10-06 10:13:06

Rationale

To help Ubuntu reach a standard of quality similar to competing operating systems, developers should spend less time asking for information on individual bug reports, and more time fixing the bugs that affect users most often.

To determine which bugs those are, we should collect crash and hang reports from as many people as possible, before and after release. This means not requiring people to sign in to any Web site, enter any text, submit hundreds of megabytes of data, receive e-mail, or do anything more complicated than clicking a button. An automated system should then analyze which problems are caused by the same bug. If developers need more information about a particular kind of crash, they should be able to configure the system to automatically retrieve that information when the problem next occurs.

Statistics collected by Microsoft show that a bug reported by their Windows Error Reporting system “is 4.5 to 5.1 times more likely to be fixed than a bug reported directly by a human”, that fixing the right 1 percent of bugs addresses 50 percent of customer issues, and that fixing 20 percent of bugs addresses 80 percent of customer issues.

Prior art

Windows Error Reporting is probably the most advanced crash reporting system. As described in “Debugging in the (very) large: Ten years of implementation and experience” (PDF), it uses progressive data collection, where developers can request more than the “minidump” if necessary to understand particular problems. It also automatically notifies users if a software update fixes their problem. And hardware vendors can see crash reports specific to their hardware.

Mac OS X has a CrashReporter system that submits crash data to Apple. As described in Technical Note TN2123, “There is currently no way for third party developers to access the reports submitted via CrashReporter”.

As a result, some Mac software developers have created their own crash tracking systems, such as Adobe’s and Adium’s.

Mozilla uses Breakpad to collect and submit minidumps on the client side, and Socorro to analyze and present them on the server side. Anyone can access crash data at crash-stats.mozilla.com. Laura Thomson has written some blog posts about it.

Android uses Google Feedback: http://android-developers.blogspot.com/2010/05/google-feedback-for-android.html

There is a google project for cross platform crashdump capturing.

There is a django project called 'sentry' for web server error analysis (that has a cassandra backend I'm told).

The Launchpad SOA has an active discussion around their requirements at https://dev.launchpad.net/LEP/OopsDisplay. They plan to split out various crash report tools from Launchpad into reusable python modules. It is unknown at this point if they'll be generic enough to fit Ubuntu's needs. Launchpad suspects this could fulfill needs for: Ubuntu One, Landscape, Canonical ISD (SSO etc.), Ubuntu; possibly also Drizzle, OpenERP, OpenStack.

Client design

When there is an error, an alert should appear with text and buttons depending on the situation.

	The problem can be reported	Your admin has blocked problem reporting
Part of the OS crashes
An application thread crashes		(no alert shown)
An application hangs
An application crashes
The state of the “Send an error report to help fix this problem” checkbox should persist across errors and across Ubuntu sessions.
If you choose “Show Details”, it should change to “Hide Details” while a text field containing the error report appears below the primary text.

If you choose to send an error report, the alert should disappear immediately. Reports should be sent in the background, with no progress or success/failure feedback. If you are not connected to the Internet at the time, reports should be queued. Any queued reports should be sent when you next agree to send an error report while online.

Future work: If a software update is known to fix the problem, replace the primary alert with the software update alert (or progress window, depending on the update policy), with customized primary text. Or point them at a web page (not a wiki page!) with details if a workaround exists, but no fix is available yet.

Future work: Automate the communication with the user to facilitate things like leak detection in subsequent runs, without requiring additional interaction with the user. Our current process requires us to ask people who are subscribed to the bug to try a specially-instrumented build, with a traditionally very long feedback loop between the developer and the bug subscribers. We should make it entirely automatic. Just wait for the next user who sees the bug to click one "yes, I'd like to help make this product better" button.

Server design

Robert Collins says he has a draft implementation of a core db server using Cassandra for Launchpad's own crash reporting requirements, scalable to a high volume of reports (e.g. 1M/day).

Developer client

The developer client program views the data stored in the backend server, which package maintainers, upstream developers, and other interested technical folk can use to interact with the data. This should include:

Graphs
Tables
Detail views of particular error instances
Querying "Which crash reports are related to this bug?"
Statistics
- "Top Changers" for spotting issues early
- "Rate of crashes per user"

The backend provides a strong API for retrieving data of interest. This permits construction of adhoc queries, custom analysis, etc. beyond the client program's capabilities (such as automated scripting).

Requirements

Must be attentive to privacy issues [Need to elaborate further on this]
A collection of files are gathered client-side and inserted into the crash database record.
Processed versions of files (i.e. retracer output) can be added subsequently.
Some files must be kept private (i.e. core dumps)
Traces from multiple crash reports are algorithmically compared to find exact-dupes and likely-dupes.
Crash reports can be grouped by package, by distro release, or by both.
Statistics are generated to show number of [exact|exact+likely] dupes for each type of crash. Statistics can be provided by package, by distro release, by date range, or a combination.
Bug report(s) can be associated with a given set of crashes.
The user should have some way to check back on the status of their crash report; e.g. have some report ID they can look at to see statistics and/or any associated bug #. E.g. provide a serial number at time of filing that they can load via a web page later on.
For X and kernel crashes (at least), these reports need to be indexable by hardware. That is, we want to be able to answer both "how prevalent are GPU hangs on Intel hardware?" and "on what hardware does this GPU hang appear?". Probably either DMI data or PCIIDs or both are needed for this.

Types of errors to handle:

Actual C-style crashes, with core.
Unhandled exceptions, such as you'd get from Python et al
Kernel oops and panics
Intel GPU dump output
dmesg & Xorg.0.log, triggered by GPU hangs

Tasks

Install Breakpad locally and experiment. Evaluate its capabilities and limitations. Does it meet a sufficient number of our requirements? What customizations/modifications would we need to make?
- svn checkout http://socorro.googlecode.com/svn/trunk/ socorro-read-only
- Needs packaged for Ubuntu (See http://code.google.com/p/socorro/wiki/Requirements)
- Breakpad uses minidump rather than core dump files, and it appears this requires linking a special breakpad library to each application we want to QA. Would it be feasible to convert it to produce/capture/use core dump files instead?
- Breakpad includes a tool for generating symbols files for applications. Are these consistent with the symbols files we already generate? If not, would this imply we'd need to generate and maintain yet another set of symbols for each app?
Brainstorm further about what info/capabilities should be provided by the developer interface
The Windows Error Report (WER) system has the ability to consider application hangs as a bug. Through the Windows Shell it detects if a program fails to respond for five seconds. We have hang detection for GPU lockups but not for application freezes. Compiz already has the ability to grey an unresponsive UI. Investigate if this is something we could hook up?
WER can detect rootkits and hardware failures such as corrupt memory. Could we do this as well?