ErrorTracker

Revision 24 as of 2011-11-24 14:47:48

Rationale

To help Ubuntu reach a standard of quality similar to competing operating systems, developers should spend less time asking for information on individual bug reports, and more time fixing the bugs that affect users most often.

To determine which bugs those are, we should collect crash and hang reports from as many people as possible, before and after release. This means not requiring people to sign in to any Web site, enter any text, submit hundreds of megabytes of data, receive e-mail, or do anything more complicated than clicking a button. An automated system should then analyze which problems are caused by the same bug. If developers need more information about a particular kind of crash, they should be able to configure the system to automatically retrieve that information when the problem next occurs.

Statistics collected by Microsoft show that a bug reported by their Windows Error Reporting system “is 4.5 to 5.1 times more likely to be fixed than a bug reported directly by a human”, that fixing the right 1 percent of bugs addresses 50 percent of customer issues, and that fixing 20 percent of bugs addresses 80 percent of customer issues.

Prior art

Windows Error Reporting is probably the most advanced crash reporting system. As described in “Debugging in the (very) large: Ten years of implementation and experience” (PDF), it uses progressive data collection, where developers can request more than the “minidump” if necessary to understand particular problems. It also automatically notifies users if a software update fixes their problem. And hardware vendors can see crash reports specific to their hardware.

Mac OS X has a CrashReporter system that submits crash data to Apple. As described in Technical Note TN2123, “There is currently no way for third party developers to access the reports submitted via CrashReporter”.

As a result, some Mac software developers have created their own crash tracking systems, such as Adobe’s and Adium’s.

Mozilla uses Breakpad to collect and submit minidumps on the client side, and Socorro to analyze and present them on the server side. Anyone can access crash data at crash-stats.mozilla.com. Laura Thomson has written some blog posts about it.

Android uses Google Feedback: http://android-developers.blogspot.com/2010/05/google-feedback-for-android.html

There is a google project for cross platform crashdump capturing.

There is a django project called 'sentry' for web server error analysis (that has a cassandra backend I'm told).

The Launchpad SOA has an active discussion around their requirements at https://dev.launchpad.net/LEP/OopsDisplay. They plan to split out various crash report tools from Launchpad into reusable python modules. It is unknown at this point if they'll be generic enough to fit Ubuntu's needs. Launchpad suspects this could fulfill needs for: Ubuntu One, Landscape, Canonical ISD (SSO etc.), Ubuntu; possibly also Drizzle, OpenERP, OpenStack.

Client design

When there is an error, an alert should appear with text and buttons depending on the situation.

	The problem can be reported	Your admin has blocked problem reporting
Part of the OS crashes
An application thread crashes		(no alert shown)
An application hangs
An application crashes
The state of the “Send an error report to help fix this problem” checkbox should persist across errors and across Ubuntu sessions.
If you choose “Show Details”, it should change to “Hide Details” while a text field containing the error report appears below the primary text.
If you choose to send an error report, the alert should disappear immediately. Reports should be sent in the background, with no progress or success/failure feedback. If you are not connected to the Internet at the time, reports should be queued. Any queued reports should be sent when you next agree to send an error report while online.
If you are using a pre-release version of Ubuntu, and the error report matches an existing Launchpad bug report, a further alert box should appear explaining its status and letting you open the bug report.

Future work: If a software update is known to fix the problem, replace the primary alert with the software update alert (or progress window, depending on the update policy), with customized primary text. Or point them at a web page (not a wiki page!) with details if a workaround exists, but no fix is available yet.

Future work: Automate the communication with the user to facilitate things like leak detection in subsequent runs, without requiring additional interaction with the user. Our current process requires us to ask people who are subscribed to the bug to try a specially-instrumented build, with a traditionally very long feedback loop between the developer and the bug subscribers. We should make it entirely automatic. Just wait for the next user who sees the bug to click one "yes, I'd like to help make this product better" button.

Client implementation

The apport client will write a .upload file alongside a .crash file to indicate that the crash should be sent to the crash database. A small C daemon (currently reporterd) will set up an inotify watch on the /var/crash directory, and any time one of these .upload files appears, it will upload the .crash file. It will do this if and only if there is an active Internet connection, as determined by watching the NetworkManager DBus API for connectivity events, otherwise it will add it to a queue for later processing.

We will ensure NetworkManager brings up the interfaces as early as possible, to enable us to file crash reports during boot.

This needs to be a daemon, rather than another path of the apport client code, to account for there not being an Internet connection at the time of the crash and for crashes during boot, when we cannot assume the user will get back to a known-good state to file the report.

The canonical example here is the scenario posed in Microsoft’s Windows Error Reporting paper, where a piece of malware was causing the core desktop application (explorer.exe) to crash. They were still able to receive crash reports, as their client software still submitted reports very early on in the boot process.

The apport crash file will be parsed into an intermediate data structure (currently a GHashTable), with the core dump stripped out, and then converted into BSON to be transmitted in a HTTP POST operation. The server will reply with a UUID for subsequent operations and, optionally, a command for further action. Initially, this will just be a command to upload the core dump.

A new field is being added to the apport crash file, StacktraceAddressSignature. The server will check for this field, and if it already has a retraced core dump generated from the same signature, it will reply with just the UUID of the crash report entry in the database, indicating that a core dump need not be submitted.

If, however, the server does reply with a request to upload the core dump, it will be sent as zlib compressed data in an HTTP POST operation.

The URLs for posting will be of the form:

- http://crashes.ubuntu.com/submit - http://crashes.ubuntu.com/550e8400-e29b-41d4-a716-446655440000/submit-core

Crash reports will be cleaned up after 14 days, as the system may never be connected to the Internet.

If the reporter daemon crashes, it will write a crash file like any other application. Its upstart job will have the respawn flag set, and a limit put in place so it doesn't go crazy.

If the reporter daemon moves to using apport-unpack to process the crash files, it should gracefully handle -ENOSPC.

Server design

We will use Robert Collins’ oops-repository as the foundation for our crash database. It has been suggested that this can meet Launchpad’s crash reporting requirements, scaling to a high volume of reports (e.g. 1M/day). We will also use the OOPS dictionary format for our crashes.

This will make integrating with Launchpad’s longer-term plans of this as a service for all projects an easier challenge. Launchpad’s offering may be implemented as one big Cassandra cluster in a multi-tenant fashion, or on a per-project basis, feeding to an API.

oops-repository will also provide the API for interacting with the database. This will include operations to post a new crash and potentially ask for more information, upload additional information (such as the core dump), get the full data for a crash out (a privileged operation), and update an existing crash report (a partially privileged operation) with the retraced data.

We will build a small Django web user interface for management functions on top of this API. The initial implementation will not allow regular developers to access the crash data, as we will not have time in this cycle to address the security concerns around this. Canonical IS will be the interim arbiter of who is able to access this system, inclusive of at least the release manager.

We will also evaluate Mozilla’s Socorro, to see if it requires less work to meet our longer-term needs, but this will be done as time allows.

Retracing

When a new core dump is submitted to the crash database, it will be written to a column family and the UUID will be added to a RabbitMQ queue for the matching architecture. The queue will also be written in Cassandra, in the event the RabbitMQ service fails.

Retracing daemons for each architecture will pull UUIDs off their respective RabbitMQ queues, get the core dump for the UUID from Cassandra, then feed it through apport-retrace.

When a complete trace is generated, it will be added as a row in the crash column family for the relevant UUID. It will also be added to an index column family where the key is the crash signature (StacktraceAddressSignature) and the value is the UUID in the crash column family. In the future, we may expand this to a more complex bucketing algorithm, as necessary.

The retracing daemon systems will each keep a large cache of the debug symbol packages.

Future work

Upstart has inotify job support on its roadmap. If this is implemented, it may allow us to move from an always-running daemon to something spawned by upstart itself as-needed.

The system could be designed either with one single central server instance, to which all error collecting tools for all projects submit data, or could be distributed to separate server instances for each project. There are pros and cons to each approach and it's unclear which is best. Having multiple servers provides flexibility, which could be particularly important for private project use cases, and might make it easier to roll out project-specific customizations or configurations.

Eventually, retracing will be moved entirely into the crash database and provide as a web service for Launchpad to consume. This will remove the need for submitting core dumps to Launchpad at all.

Launchpad will be mined for bugs that share the same signature as crashes in the database. These will be linked into the crash. Once this is in place, oops-repository will be modified to provide an "update available that fixes this issue" response when the respective bug is closed by an upload.

We will investigate using Datastax's Brisk/Enterprise with Pig or Hive to query over existing crash reports.

Hardware information

Upon first successful connection to the Internet, the system will send a basic hardware profile, keyed on a SHA1 of the system UUID and a SHA1 of the DMI tables themselves.

This information will be submitted to one of the existing hardware databases. Queries will be possible across the crash database and hardware database. For example, it may be desireable to know what the top compiz crashes are for a particular piece of graphics hardware.

Constant measurement

We will follow the “if it moves, measure it” principle from Etsy, and will employ the Twisted port of their popular StatsD daemon for collecting metrics.

Some examples of data points we may want to capture:

- How long it takes to submit a crash? - How long does it take to retrace a crash? - The queue size of the retracer architecture pools.
- The number of rows in each ColumnFamily.

As many Canonical projects are moving from Tuolumne to Graphite, we will follow suit and implement the graphing of these statistics on Graphite.

Performance testing

A variety of performance tests will be constructed to validate the architecture of this service. We will answer questions like, “how long does it take to bring up 400 large core dumps and map/reduce over them?”

We will optimize for latency. We will ask Canonical IS’ load testing expert to review this system.

General testing

We will have a complete set of unit tests for every part of this system, as well as system tests, using the Canonicloud to bring up test copies of the components.

We will maintain a staging server like Ubuntu One and Launchpad.

Deployment

Core dump reporting will not be enabled when the service is first deployed, to test the scalability of the overall system.

A fractional deployment strategy will be crafted, using a time-based, random, or machine fingerprint key to determine whether the reporting system should begin submitting crash reports.

Once the system is running effectively on a released version of Ubuntu, the client will be backported to the previous version of Ubuntu. If that undertaking is successful, it will then be backported to the previous LTS.

Developer client

The developer client program views the data stored in the backend server, which package maintainers, upstream developers, and other interested technical folk can use to interact with the data. This should include:

Graphs
Tables
Detail views of particular error instances
Querying "Which crash reports are related to this bug?"
Statistics
- "Top Changers" for spotting issues early
- "Rate of crashes per user"

The backend provides a strong API for retrieving data of interest. This permits construction of adhoc queries, custom analysis, etc. beyond the client program's capabilities (such as automated scripting).

Requirements

Must be attentive to privacy issues [Need to elaborate further on this]
A collection of files are gathered client-side and inserted into the crash database record.
Processed versions of files (i.e. retracer output) can be added subsequently.
Some files must be kept private (i.e. core dumps)
Traces from multiple crash reports are algorithmically compared to find exact-dupes and likely-dupes.
Crash reports can be grouped by package, by distro release, or by both.
Statistics are generated to show number of [exact|exact+likely] dupes for each type of crash. Statistics can be provided by package, by distro release, by date range, or a combination.
Bug report(s) can be associated with a given set of crashes.
The user should have some way to check back on the status of their crash report; e.g. have some report ID they can look at to see statistics and/or any associated bug #. E.g. provide a serial number at time of filing that they can load via a web page later on.
For X and kernel crashes (at least), these reports need to be indexable by hardware. That is, we want to be able to answer both "how prevalent are GPU hangs on Intel hardware?" and "on what hardware does this GPU hang appear?". Probably either DMI data or PCIIDs or both are needed for this.

Types of errors to handle:

Actual C-style crashes, with core.
Unhandled exceptions, such as you'd get from Python et al
Kernel oops and panics
Intel GPU dump output
dmesg & Xorg.0.log, triggered by GPU hangs

Tasks

Install Breakpad locally and experiment. Evaluate its capabilities and limitations. Does it meet a sufficient number of our requirements? What customizations/modifications would we need to make?
- svn checkout http://socorro.googlecode.com/svn/trunk/ socorro-read-only
- Needs packaged for Ubuntu (See http://code.google.com/p/socorro/wiki/Requirements)
- Breakpad uses minidump rather than core dump files, and it appears this requires linking a special breakpad library to each application we want to QA. Would it be feasible to convert it to produce/capture/use core dump files instead?
- Breakpad includes a tool for generating symbols files for applications. Are these consistent with the symbols files we already generate? If not, would this imply we'd need to generate and maintain yet another set of symbols for each app?
Brainstorm further about what info/capabilities should be provided by the developer interface
The Windows Error Report (WER) system has the ability to consider application hangs as a bug. Through the Windows Shell it detects if a program fails to respond for five seconds. We have hang detection for GPU lockups but not for application freezes. Compiz already has the ability to grey an unresponsive UI. Investigate if this is something we could hook up?
WER can detect rootkits and hardware failures such as corrupt memory. Could we do this as well?