ErrorTracker

Revision 12 as of 2011-08-29 21:18:15

Rationale

To help Ubuntu reach a standard of quality similar to competing operating systems, developers should spend less time asking for information on individual bug reports, and more time fixing the bugs that affect users most often.

To determine which bugs those are, we should collect crash and hang reports from as many people as possible, before and after release. This means not requiring people to sign in to any Web site, enter any text, submit hundreds of megabytes of data, receive e-mail, or do anything more complicated than clicking a button. An automated system should then analyze which problems are caused by the same bug. If developers need more information about a particular kind of crash, they should be able to configure the system to automatically retrieve that information when the problem next occurs.

Statistics collected by Microsoft show that a bug reported by their Windows Error Reporting system “is 4.5 to 5.1 times more likely to be fixed than a bug reported directly by a human”, that fixing the right 1 percent of bugs addresses 50 percent of customer issues, and that fixing 20 percent of bugs addresses 80 percent of customer issues.

Prior art

Windows Error Reporting is probably the most advanced crash reporting system. As described in “Debugging in the (very) large: Ten years of implementation and experience” (PDF), it uses progressive data collection, where developers can request more than the “minidump” if necessary to understand particular problems. It also automatically notifies users if a software update fixes their problem. And hardware vendors can see crash reports specific to their hardware.

Mac OS X has a CrashReporter system that submits crash data to Apple. As described in Technical Note TN2123, “There is currently no way for third party developers to access the reports submitted via CrashReporter”.

As a result, some Mac software developers have created their own crash tracking systems, such as Adobe’s and Adium’s.

Mozilla uses Breakpad to collect and submit minidumps on the client side, and Socorro to analyze and present them on the server side. Anyone can access crash data at crash-stats.mozilla.com.

Android uses Google Feedback: http://android-developers.blogspot.com/2010/05/google-feedback-for-android.html

There is a google project for cross platform crashdump capturing.

There is a django project called 'sentry' for web server error analysis (that has a cassandra backend I'm told).

The Launchpad SOA has an active discussion around their requirements at https://dev.launchpad.net/LEP/OopsDisplay. They plan to split out various crash report tools from Launchpad into reusable python modules. It is unknown at this point if they'll be generic enough to fit Ubuntu's needs. Launchpad suspects this could fulfill needs for: Ubuntu One, Landscape, Canonical ISD (SSO etc.), Ubuntu; possibly also Drizzle, OpenERP, OpenStack.

Client design

When there is an error, an alert should appear with text and buttons depending on the situation.

	The problem can be reported	Your admin has blocked problem reporting
Part of the OS crashes
An application thread crashes		(no alert shown)
An application crashes
An application hangs

If you choose “Report…”, a secondary dialog should appear on top of the alert.

If you choose “Send”, the “Send more information automatically if requested” checkbox and “Send” button should become insensitive, and the “Privacy Policy” button should be replaced by a progress bar extending from the left margin to just before the “Cancel” button. If you have left “Send more information automatically if requested” checked, progress of the progress bar should be allocated amongst the subtasks of sending the initial report, asymptotically waiting for analysis, and sending any further information requested.

Once the problem report is cancelled or completed, the secondary dialog should close. In the primary alert, if the problem report was completed, “You can help fix the problem by submitting an error report.” should change to “Thanks for reporting this problem.”. If you subsequently click “Report…” again, the “Send” button and the “Send more information automatically if requested” checkbox should be insensitive.

Future work: If a software update is known to fix the problem, replace the primary alert with the software update alert (or progress window, depending on the update policy), with customized primary text.

Server design

Robert Collins says he has a draft implementation of a core db server using Cassandra for Launchpad's own crash reporting requirements, scalable to a high volume of reports (e.g. 1M/day).

Developer client

The developer client program views the data stored in the backend server, which package maintainers, upstream developers, and other interested technical folk can use to interact with the data. This should include:

Graphs
Tables
Detail views of particular error instances
Querying "Which crash reports are related to this bug?"

The backend provides a strong API for retrieving data of interest. This permits construction of adhoc queries, custom analysis, etc. beyond the client program's capabilities (such as automated scripting).

Requirements

Must be attentive to privacy issues [Need to elaborate further on this]
A collection of files are gathered client-side and inserted into the crash database record.
Processed versions of files (i.e. retracer output) can be added subsequently.
Some files must be kept private (i.e. core dumps)
Traces from multiple crash reports are algorithmically compared to find exact-dupes and likely-dupes.
Crash reports can be grouped by package, by distro release, or by both.
Statistics are generated to show number of [exact|exact+likely] dupes for each type of crash. Statistics can be provided by package, by distro release, by date range, or a combination.
Bug report(s) can be associated with a given set of crashes.
The user should have some way to check back on the status of their crash report; e.g. have some report ID they can look at to see statistics and/or any associated bug #. E.g. provide a serial number at time of filing that they can load via a web page later on.
For X and kernel crashes (at least), these reports need to be indexable by hardware. That is, we want to be able to answer both "how prevalent are GPU hangs on Intel hardware?" and "on what hardware does this GPU hang appear?". Probably either DMI data or PCIIDs or both are needed for this.

Types of errors to handle:

Actual C-style crashes, with core.
Unhandled exceptions, such as you'd get from Python et al
Kernel oops and panics
Intel GPU dump output
dmesg & Xorg.0.log, triggered by GPU hangs

Tasks

Investigate if Breakpad could be installed and operated on our own systems. If so, does it sufficiently meet our requirements?
Brainstorm further about what info/capabilities should be provided by the developer interface
Identify owner(s) for the development efforts