ErrorTracker

Differences between revisions 2 and 111 (spanning 109 versions)
Revision 2 as of 2011-07-05 10:31:57
Size: 3276
Editor: mpt
Comment: rationale and cases to sketch
Revision 111 as of 2012-11-13 18:41:30
Size: 31964
Editor: ev
Comment: Clean up the simple anatomy of a crash
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from CrashTracker

Ubuntu’s error tracker explains crashes, hangs, and other severe errors to end users; lets them report an error with the click of a button; and collects these reports from millions of users to show Ubuntu developers [[https://errors.ubuntu.com/|which errors are most common]]. It’s all open source, and [[#contributing|you can help]].

<<TableOfContents()>>
Line 3: Line 9:
To help Ubuntu reach a standard of quality similar to competing operating systems, developers should spend less time asking for information on individual bug reports, and more time fixing those bugs that affect users most often.

To determine which bugs those are, we should collect crash reports from as many people as possible, before and after release. This means ''not'' requiring them to sign in to any Web site, enter any text, submit hundreds of megabytes of data, receive e-mail, or do anything more complicated than clicking a button. An automated system should then analyze which problems are caused by the same bug. If developers need more information about a particular kind of crash, they should be able to configure the system to automatically retrieve that information when the problem next occurs.
To help Ubuntu reach a standard of quality similar to competing operating systems, developers need to know the answers to two questions:

 1. '''How reliable is Ubuntu right now?''' (Compared with yesterday, compared with the previous version, or compared with what it would be if everyone had installed every update.)

 2. '''What’s the best thing I can do right now to help improve its quality?'''

We can better answer both of those questions if we collect '''all the information we need''', for as '''many types of problems''' as we can, from '''a large representative sample''' of people. This means not requiring people to sign in to any Web site, enter any text, submit hundreds of megabytes of data, receive e-mail, or do anything more complicated than clicking a button. It means collecting problem reports both before and after release. And it means analyzing and bucketing problems automatically, with developers able to configure the system to automatically retrieve more information about a particular kind of problem when it next occurs.
Line 9: Line 19:
=== Prior art ===

Windows Error Reporting is perhaps the most advanced crash reporting system. As described in [[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.148.716&rep=rep1&type=pdf|K Glerum, K Kinshumann, S Greenberg, et al.: “Debugging in the (very) large: Ten years of implementation and experience”]] (PDF), it uses progressive data collection where developers can request more than the “minidump” if necessary to understand particular problems, and automatically notifies users if a software update fixes their problem. Hardware vendors can [[http://msdn.microsoft.com/en-us/windows/hardware/gg487440|see crash reports specific to their hardware]].

{{attachment:windows-app-progress.gif}} {{attachment:windows-app.gif}} {{attachment:windows-os.png}}

Mac OS X has a Crash``Reporter system that submits crash data to Apple. As described in [[http://developer.apple.com/library/mac/technotes/tn2004/tn2123.html|Technical Note TN2123]], “There is currently no way for third party developers to access the reports submitted via CrashReporter”.

{{attachment:mac-app.png}} {{attachment:mac-plugin.png}} {{attachment:mac-os.png}} {{attachment:mac-hang.png}}

As a result, some Mac software developers have created their own crash tracking systems, such as [[http://unexpectedlyquit.com/|Adobe’s]] and [[http://www.flickr.com/photos/jfpoole/143205824/|Adium’s]].

Mozilla uses [[https://wiki.mozilla.org/Breakpad/Design|Breakpad]] to collect and submit minidumps on the client side, and [[https://wiki.mozilla.org/Socorro|Socorro]] to analyze and present them on the server side. Anyone can access crash data at [[https://crash-stats.mozilla.com/|crash-stats.mozilla.com]].
The client interface for the error tracker also serves a purpose which is less important for developers, but ''more'' important for end users: '''explaining why something weird just happened'''. In previous Ubuntu release versions, when most programs crashed there was no explanation of why the window had disappeared.
Line 25: Line 23:
||<tablestyle="width:100%"> ||'''The problem can be reported''' ||'''Your admin has blocked problem reporting'''||
||'''Part of the OS crashes'''||{{attachment:os-crash-reportable.png}}||{{attachment:os-crash-unreportable.png}} ||
||'''An application crashes'''||
||'''An application hangs''' ||
||'''Report is submitted''' ||
<<Anchor(settings)>>
=== Privacy settings ===

In System Settings, the “Security & Privacy” panel should contain a “Diagnostics” tab for error and metrics collection.

{{attachment:settings-privacy-diagnostics.png}}

In a backport to Ubuntu 11.10 and earlier, a standalone “Privacy” window should contain equivalent controls for just the error collection.

{{attachment:settings-privacy-old-versions.png}}

In both cases, the “People using this computer can…” and following controls should be insensitive whenever you have not unlocked them as an administrator.

In a new Ubuntu installation (or an upgrade to a version that introduces these settings), “Send error reports to Canonical” should be checked by default. But “Send a report automatically if a problem prevents login” and “Send occasional system information to Canonical”, when present, should be unchecked by default.

(Error reports are also accessible to trusted Ubuntu developers who are not employed by Canonical. This is covered in the privacy policy.)

<<Anchor(error)>>
=== When there is an error ===

When there is an error that prevents login, and “Send a report automatically if a problem prevents login” is checked, the error should be sent automatically.

As soon as possible after any other type of error occurs, an alert should appear with text and buttons depending on the situation. The Esc and Enter keys should ''not'' do anything in these alerts, because you may have been just about to press one of those in the program that has the problem.

|| ||<v>'''You are an admin, or error reporting is allowed'''||<v>'''Your admin has blocked error reporting'''||<v>'''Implemented in Ubuntu'''||
||<^><<Anchor(os-crash)>>[[#os-crash|#]]'''An OS package crashes''' (including kernel oopses) for the first time this version<<BR>>,,Test case: sudo pkill -SEGV zeitgeist,,||<^>{{attachment:os-error-reportable.png}}||<^>{{attachment:os-error-unreportable.png}}||12.04||
||<^>'''An OS package crashes''' a subsequent time||<^>{{attachment:os-error-reportable-subsequent.png}}||<^>{{attachment:os-error-unreportable-subsequent.png}}||12.04||
||<-3 style="border:none;">“Ignore future problems of this type” means ignore future crashes of the same version of the same package.||
||<^><<Anchor(thread)>>[[#thread|#]]'''An application thread crashes''' for the first time this version||<^>{{attachment:app-thread-reportable.png}}||<^>{{attachment:app-thread-unreportable.png}}||<(|2>(in 12.04, shows “closed unexpectedly” error instead, bug Bug:1033902)||
||<^>'''An application thread crashes''' a subsequent time||<^>{{attachment:app-thread-reportable-subsequent.png}}||{{attachment:app-thread-unreportable-subsequent.png}}||
||<-3 style="border:none">For most other error types, the alert shouldn’t offer to be silent next time — because it still needs to appear to explain what’s happened, and (in the application hang case) to let you stop/relaunch the application:||
||<^><<Anchor(app-requested)>>[[#app-requested|#]]'''An application has a developer-specified error'''||<^>{{attachment:app-requested-reportable.png}}||<^>{{attachment:app-requested-unreportable.png}}||
||<^(|2><<Anchor(app-hang)>>[[#app-hang|#]]'''An application is hung for at least 30 seconds'''<<BR>>,,Test case: eog & sleep 5 && pkill -STOP eog && sleep 20 && pkill -CONT eog then wait for 30 seconds,,||<^>{{attachment:app-hang-reportable.png}}||<^>{{attachment:app-hang-unreportable.png}}||<|2>(targeted for 12.10)||
||<^-2>This alert should be modal to the unresponsive window. (It appears much later than, because it is more intrusive than, the greying-out of the window after 5 seconds.) If that window becomes responsive again, the alert’s contents should become insensitive for one second (to ignore misclicks) before the alert closes.||
||<^(|2><<Anchor(close-hang)>>[[#close-hang|#]]'''An application is hung for at least 5 seconds after you try to close its window'''<<BR>>,,Test case: eog & sleep 5 && pkill -STOP eog & pkill -TERM eog && sleep 5,,||<^>{{attachment:close-hang-reportable.png}}||<^>{{attachment:close-hang-unreportable.png}}||<|2>(targeted for 12.10)||
||<^-2>This alert should be modal to the unresponsive window, and should therefore have no title of its own. If that parent window becomes responsive again, the alert’s contents should become insensitive for one second (to ignore misclicks) before the alert and the window both close.||
||<^><<Anchor(app-crash)>>[[#app-crash|#]]'''An application crashes'''<<BR>>,,Test case: eog & pkill -SEGV eog,,||<^>{{attachment:app-crash-reportable.png}}||<^>{{attachment:app-crash-unreportable.png}}||12.04||
||<^><<Anchor(kernel-crash)>>[[#kernel-crash|#]]'''Ubuntu restarts after a kernel crash'''||<^>{{attachment:kernel-oops-reportable.png}}||<^>{{attachment:kernel-oops-unreportable.png}}||(targeted for 12.04 SRU)||
||<^><<Anchor(install-fails)>>[[#install-fails|#]]'''A package fails to install or update'''||<^>{{attachment:package-error-reportable.png}}||<^>{{attachment:package-error-unreportable.png}}||(targeted for 12.04 SRU)||
||<-3 style="border:none">With non-application software crashing, we can’t tell programmatically whether it’s something you need to care about or not. So if you aren’t going to report the errors, we might as well let you ignore future errors:||
||<^><<Anchor(non-app-crash)>>[[#non-app-crash|#]]'''Third-party non-application software crashes''' for the first time this version<<BR>>,,Test case: sh -c 'kill -SEGV $$',,||<^>{{attachment:nas-crash-reportable.png}}||<(|2>(no alert shown)||12.04||
||<^>'''Third-party non-application software crashes''' a subsequent time||<^>{{attachment:nas-crash-reportable-subsequent.png}}||12.04||
||<-3 style="border:none">For all cases where the “Send an error report to help fix this problem” checkbox is present, its state should persist across errors and across Ubuntu sessions.||
||<^><<Anchor(details)>>[[#details|#]]'''Any type of error, if you choose “Show Details”'''||{{attachment:app-crash-reportable-details.png}}|| ||12.04||
||<-3 style="border:none">If you choose “Show Details”, it should change to “Hide Details” while a text field containing the error report appears below the primary text. If necessary, a spinner and the text “Collecting information…” should appear centered inside the text field while the information is collected (other than the process name and version, which should appear instantly), pausing whenever the collection system is waiting for you to answer any questions. If no details are available (for example, the crash file is unreadable), below the process name should appear the paragraph “No details were recorded for this error.” Regardless, the field contents should end with the paragraph “Other information may be sent if Ubuntu developers request it.”||
||<-3 style="border:none">If you choose to send an error report, the alert should disappear immediately. Data should be collected (if it hasn’t been already), and reports should be sent in the background, with ''no'' progress or success/failure feedback. If you are not connected to the Internet at the time, reports should be queued. Any queued reports should be sent when you next agree to send an error report while online.||
||<^ style="border:none">If you are using a pre-release version of Ubuntu, and the error report matches an existing Launchpad bug report, a further alert box should appear explaining its status and letting you open the bug report.||{{attachment:bug-report.png}}<<BR>>,,Enter = “OK”,,|| ||(targeted for 12.10)||

'''''Future work:''' Ensure that if there is a delay in displaying a crash, we adjust the text of the dialog to reflect this. As an example, if X crashes and the user has to log in again or reboot the computer.''

'''''Future work:''' If a software update is known to fix the problem, replace the primary alert with [[SoftwareUpdates#alert|the software update alert]] (or progress window, depending on the update policy), with customized primary text. Or point them at a web page (not a wiki page!) with details if a workaround exists, but no fix is available yet.''

'''''Future work:''' Automate the communication with the user to facilitate things like leak detection in subsequent runs, without requiring additional interaction with the user. Our current process requires us to ask people who are subscribed to the bug to try a specially-instrumented build, with a traditionally very long feedback loop between the developer and the bug subscribers. We should make it entirely automatic. Just wait for the next user who sees the bug to click one "yes, I'd like to help make this product better" button.''

<<Anchor(multiple)>>
=== When there are multiple simultaneous errors ===

To guard against the case where multiple errors of the same type cause a flood of alert boxes, there should be '''aggregate alert boxes''' for the two most likely cases, internal errors and application crashes.

If an alert box for a single error is open and unfocused, when another error of the same type happens, that alert box should morph into the aggregate version.

||<^>'''Multiple OS packages crash'''||<^>{{attachment:os-error-reportable-multiple.png}}|| ||(targeted for 12.04.1)||
||<^>'''Multiple applications crash'''||<^>{{attachment:app-crash-reportable-multiple.png}}|| ||(targeted for 12.04.1)||

In these cases, the “Show Details” box should show details of all the errors, with separators between them.

<<Anchor(memory)>>
=== When there is not enough memory for a core dump ===

||<^>'''When the kernel does not have enough memory'''||<^>An error report should still be sent (or queued for sending) as normal for accounting purposes, just without the core dump.|| ||

<<Anchor(updates)>>
=== When an update is available to fix a crash ===

When a crash (whether of an application or OS package) occurs, “Send error reports to Canonical” is checked, and Ubuntu hasn’t previously submitted this particular crash signature, it should send a basic crash signature to the server.

If it does not do this, ''or'' within five seconds it does not receive a response that the problem is fixed by a software update ''and'' it did not know from a previous submission that it is fixed by a software update, then the error alert should appear as normal.

Otherwise, the usual error alert should change:
 * the secondary text should be “The good news is, a software update is available to fix this problem.”
 * the checkbox label should be “Send an error report anyway”.
 * An extra “Install Updates…” button should be present on the leading side of the trailing group (even if you are not an admin).

{{attachment:app-crash-reportable-updateable.png}}

Choosing “Install Updates…” should [[SoftwareUpdates#launch-manual|launch Software Updater]] (leaving the application closed, if it was an application crash).

''We also considered changing the Software Updater UI to appear unprompted sooner, or to have custom text, when updates are known to fix problems users on your system have submitted. We decided against it because it would be less obvious.''

<<Anchor(debconf)>>
=== When there is a debconf prompt ===

debconf prompts for user-installed software in Ubuntu are, overwhelmingly, programming mistakes. Therefore, they should be presented as error alerts.

||<(^|2>[[#debconf|#]]'''A Debconf prompt'''||<^>{{attachment:debconf-reportable.png}}||<^>{{attachment:debconf-unreportable.png}}||(targeted for 12.10 — [[/Contributing/Debconf|how to contribute]])||

If the `TITLE` command is used, that string should be used as the title of the window instead of “Ubuntu”.

The primary text should depend on the type of package and the situation:

|| ||'''Application'''||'''Non-application package'''||
||'''During installation'''||{Application Name} needs your help to finish installing.||The package “{package name}” needs your help to finish installing.||
||'''During [[http://www.debian.org/doc/debian-policy/ch-maintainerscripts.html#s-removedetails|postinst abort-remove]]'''||{Application Name} needs your help to finish its removal.||The package “{package name}” needs your help to finish its removal.||

Controls should be included in the alert depending on the type of prompt.

||<tablestyle="float:left;"><<Anchor(debconf-string)>>[[#debconf-string|#]]'''Type “string”'''||
||{{attachment:debconf-string.png}}||

||<tablestyle="float:left;"><<Anchor(debconf-boolean)>>[[#debconf-boolean|#]]'''Type “boolean”'''||
||{{attachment:debconf-boolean.png}}||

||<-2 tablestyle="float:left;"><<Anchor(debconf-select)>>[[#debconf-select|#]]'''Type “select”'''||
||If there are six or fewer choices:||If there are more than six choices:||
||{{attachment:debconf-select-few.png}}||{{attachment:debconf-select-many.png}}||

||<-2 tablestyle="float:left;"><<Anchor(debconf-multiselect)>>[[#debconf-multiselect|#]]'''Type “multiselect”'''||
||If there are six or fewer choices:||If there are more than six choices:||
||{{attachment:debconf-multiselect-few.png}}||{{attachment:debconf-multiselect-many.png}}||

||<tablestyle="float:left;"><<Anchor(debconf-note)>>[[#debconf-note|#]]'''Type “note”'''||
||{{attachment:debconf-note.png}}||

||<tablestyle="float:left;"><<Anchor(debconf-text)>>[[#debconf-text|#]]'''Type “text” or “error”'''||
||{{attachment:debconf-text.png}}||

||<tablestyle="float:left;"><<Anchor(debconf-password)>>[[#debconf-password|#]]'''Type “password”'''||
||{{attachment:debconf-password.png}}||

||<tablestyle="clear:both;" style="border:none;">||

<<Anchor(debconf-progress)>>
=== Presenting debconf progress ===

''This has nothing to do with error tracking, but is included here because the error tracker provides all the rest of debconf’s graphical UI.''

When a maintainer script requests progress presentation (`db_progress`), the progress text and proportion should be shown in a progress window. Ideally, this progress window should morph to and/or from any consecutive debconf prompts.

{{attachment:debconf-progress.png}}

The window title should be of the form ‘Installing “package-name”’, ‘Reinstalling “package-name”’, ‘Updating “package-name”’, or “Removing ‘package-name’” as appropriate, or “Ubuntu” if the type of operation is unknown. The “Skip” button should be present if the operation is skippable.

<<Anchor(metrics)>>
=== Invitation for metrics collection ===

For any administrator, after the ''first'' time only that they respond to an error alert, a second alert should appear to invite them to opt in to metrics collection. (The “Esc” key should activate “Don’t Send” in this alert, but the “Enter” key should not do anything.)

{{attachment:privacy-settings-alert.png}}

The “Privacy…” button should open System Settings to the Privacy panel. Choosing “Send” should be equivalent to checking “Send occasional system information to Canonical” in the Privacy settings.

<<Anchor(previous)>>
== Accessing previous reports ==

Choosing “Show Previous Reports” in [[#settings|the settings interface]] should open a Web page listing those reports.

{{attachment:previous-reports.png}}

To avoid end users getting lost in developer material, the page should have no global navigation.

To avoid privacy problems, it should be impossible to share the URL of the page. ''How?''

Error reports should be listed in the order they were received, newest first, defaulting to the newest 50. The date received should link to the individual report.

If there are from 1 to 50 reports, the batch count should read only “Showing all {number}”, and there should be no batch navigation.

{{attachment:previous-reports-1-batch.png}}

If there are no reports at all, there should be no batch count, navigation, or table — just an explanatory sentence.

{{attachment:previous-reports-none.png}}

== Client implementation ==

The apport client will write a .upload file alongside a .crash file to indicate that the crash should be sent to the crash database. A small C daemon (currently "whoopsie", previously "reporterd") will set up an inotify watch on the /var/crash directory, and any time one of these .upload files appears, it will upload the .crash file. It will do this if and only if there is an active Internet connection, as determined by watching the NetworkManager DBus API for connectivity events, otherwise it will add it to a queue for later processing.

We will ensure NetworkManager brings up the interfaces as early as possible, to enable us to file crash reports during boot.

This needs to be a daemon, rather than another path of the apport client code, to account for there not being an Internet connection at the time of the crash and for crashes during boot, when we cannot assume the user will get back to a known-good state to file the report.

The canonical example here is the scenario posed in Microsoft’s Windows Error Reporting paper, where a piece of malware was causing the core desktop application (explorer.exe) to crash. They were still able to receive crash reports, as their client software still submitted reports very early on in the boot process.

The apport crash file will be parsed into an intermediate data structure (currently a GHashTable), with the core dump stripped out, and then converted into BSON to be transmitted in a HTTP POST operation. The server will reply with a UUID for subsequent operations and, optionally, a command for further action. Initially, this will just be a command to upload the core dump.

A new field is being added to the apport crash file, StacktraceAddressSignature. The server will check for this field, and if it already has a retraced core dump generated from the same signature, it will reply with just the UUID of the crash report entry in the database, indicating that a core dump need not be submitted.

If, however, the server does reply with a request to upload the core dump, it will be sent as zlib compressed data in an HTTP POST operation.

The URLs for posting will be of the form:
 - http://crashes.ubuntu.com/submit
 - http://crashes.ubuntu.com/550e8400-e29b-41d4-a716-446655440000/submit-core

Crash reports will be cleaned up after 14 days, as the system may never be connected to the Internet.

If the reporter daemon crashes, it will write a crash file like any other application. Its upstart job will have the respawn flag set, and a limit put in place so it doesn't go crazy.

If the reporter daemon moves to using apport-unpack to process the crash files, it should gracefully handle -ENOSPC.

Crash reports for applications not themselves part of packages in the Ubuntu will be handled. These will not be retraced, but they will be collected for statistical analysis. This removes the "the problem cannot be reported" dialog in Apport.

We will add an Origin and possibly a Site field to the apport reports, using the python-apt candidate.origins interface. This will allow us to answer questions like what percentage of crashes are coming from PPAs. More importantly, it will let us focus reports on packages from a particular PPA, like the unity-testing one.

<<Anchor(server)>>
== errors.ubuntu.com ==

{{attachment:site-front-page.png}}

By default, the front page should begin with a graph of “Errors per 24 hours” (bug Bug:1046269) for nearly all dates (bug Bug:1053410) and all Ubuntu versions from which errors were recorded.

Once at least six months of data has been recorded, the main graph should be followed by a thumbnail navigation graph for selecting a date range. If you do this, the filter controls for the table should change to the same date range, though the reverse should not happen (entering a date range by date should not change the main graph).

Next should be controls for changing the graph and the table. A spinner should appear at the trailing end of the first row whenever the graph and/or table are being updated.

{{attachment:site-front-page-filters.png}}

The OS version menu should begin with an item for “all” (the default), then “Ubuntu {development version}”, then tracked released versions from newest to oldest. For example, “all”, “Ubuntu S”, “Ubuntu 13.04”, “Ubuntu 12.10”, “Ubuntu 12.04”.

The package combo box should have menu items for “all packages” (the default), “-proposed”, “ubuntu-desktop”, and any other useful package sets. The text field should accept either any of these special names, or a package name. If the text (once space-stripped and lower-cased) does not match any of those when focus leaves the field, it should flash red and have no effect on the graph or table, but should retain its contents so you can correct typos. Whenever the package combo box value is a single package name, the spinner should appear until a menu appears listing package versions to filter on, with the default being “all versions”.

{{attachment:site-front-page-filters-package.png}}

<<Anchor(date-range)>>
The date menu should control the contents of only the table, not the graph. It should have items for “the past day” (the default), “the past week”, “the past month”, “the past year”, and “the date range”. Whenever “the date range” is selected, date fields should appear alongside for specifying the date range. If you have never used these before, they should default to the past year. Otherwise, they should remember whichever dates you used last.

{{attachment:site-front-page-filters-date.png}}

Finally, the table should appear.

For both the graph and the table, whenever one of them is loading, is interactively updating, or failed to load, it should be semi-transparent. If the table fails to load or update, an error message should also appear alongside the table filter controls if there is room, or below them otherwise: an error icon, the text “Sorry, the table data didn’t load.”, and a “Retry” button (bug Bug:1060037).

{{attachment:site-data-error.png}}

== Server architecture ==

[[/ServerArchitecture]] has additional details.

<<Anchor(contributing)>>
== How you can help ==

There are a few components to the error reporting system. To understand where to make your contribution, first understand how all the pieces fit together.

=== Anatomy of a crash ===

When an application crashes in Ubuntu, apport is called and a basic crash report is written into the {{{/var/crash}}} directory. This initial report contains the information that can be gathered quickly, such as the date and path to the application.

{{attachment:crash-report-basic.png}}

Meanwhile, another program called update-notifier is watching the {{{/var/crash}}} directory for new files. It sees that a .crash file has been created and runs Apport with this file as an argument. The following window then appears:

{{attachment:initial-apport-window.png}}

This is the first contact the user has with the issue since the application that crashed disappeared from view. At this point additional information is collected for the report. This occurs either when they press “Show Details”, so that these details may be presented to them for review, or when they dismiss the dialog with the “Send an error report to help fix this problem” box checked.

The additional details collected will be ones that could not be calculated quickly when the report was first created. As one example, the packages that this application depends on and their versions will be determined and written into the report.

When the user dismisses the dialog with the “Send an error report to help fix this problem” box checked, another file is created in the {{{/var/crash}}} directory with the same name as the crash report, but with a .upload extension. This file has no contents. It is just used as a signal to the program responsible for uploading the crash report that the user wants this report sent.

This program responsible for uploading crash reports is called Whoopsie. It’s always running on Ubuntu systems, watching the {{{/var/crash}}} directory for files ending in .upload. When it sees one of these, it checks to see if there’s a high-speed internet connection. If it cannot find one, it waits to send the report until later. Otherwise, it opens the matching .crash file and converts it into binary JSON data then sends this information to http://daisy.ubuntu.com.

||<#F1F1DD> Along with the report, Whoopsie sends an obfuscated (SHA512) system identifier (DMI system UUID). This information is collected so that we can show a graph of the average errors per calendar day. It also lets us answer questions like, “is Ubuntu more stable in the first week of use or subsequent weeks?” ||

The servers responsible for http://daisy.ubuntu.com receive these error reports, about 80,000 per day currently, and write them into a large Cassandra database.

Once the report is written into the database, it’s put through a process called “bucketing.” This takes the report and determines what makes it an instance of a larger problem. In its simplest form, this is a string that contains the path to the program that crashed, [[http://en.wikipedia.org/wiki/Unix_signal|the signal]] that occured, and the top few frames of the stack trace, all separated by colon characters:

{{{
/usr/bin/update-notifier:11:g_object_ref:g_list_foreach:get_mounts:get
}}}

Every time a crash is received, it is grouped with the other crashes that produce this same “crash signature.” We call this grouping a problem, or bucket. Every time a new crash is added to one of these buckets, a counter for the bucket is incremented for the date, month, and year that the crash was seen. This is how we identify what the most important problems in Ubuntu are: we present a table on http://errors.ubuntu.com of the buckets with the most number of instances:

{{attachment:update-manager-table.png}}

=== Anatomy of a crash, in detail ===

/* diagram needed */
{{{
apport -> whoopsie -> daisy.ubuntu.com -> errors.ubuntu.com
}}}

/* "Python exceptions are easier, so lets start with those..." */
/* Make this the "detailed explanation. */
 1. Apport is called by the kernel's [[http://www.kernel.org/doc/man-pages/online/pages/man5/core.5.html|core pattern handler]] and creates a [[http://people.canonical.com/~pitti/doc/apport-data-format.pdf|report file]] in {{{/var/crash}}}.
 2. The {{{update-notifier}}} application watches for changes in {{{/var/crash}}}. It sees this newly written report file and calls {{{/usr/share/apport/apport-gtk}}}.
 3. Apport displays a [[https://wiki.ubuntu.com/ErrorTracker#When_there_is_an_error|graphical dialog]] asking the user if they want to report the issue.
 4. If the user chooses to report the issue, a {{{.upload}}} file is created in {{{/var/crash}}}. The {{{whoopsie}}} application is watching {{{/var/crash}}} for these. It waits until there is an active Internet connection, then finds the matching report for the {{{.upload}}} file and uploads it to https://daisy.ubuntu.com.
 5. Crashes come in two parts. There's the metadata associated with the crash (date, environment variables, Ubuntu release, ...) and the crash itself. This latter part is called a [[http://en.wikipedia.org/wiki/Core_dump|core dump]] and is often very large. In order to avoid requiring everyone submitting a crash to also submit a core dump, we use a first pass signature called a [[http://bazaar.launchpad.net/~apport-hackers/apport/trunk/view/head:/apport/report.py#L1237|StacktraceAddressSignature]]. There may be a few {{{StacktraceAddressSignatures}}} for any {{{CrashSignature}}}.
 5. Daisy accepts these uploaded reports from users and writes them into a Cassandra database. If it has not yet received a core dump for this problem (checking the aforementioned {{{StacktraceAddressSignature}}}), then it replies to whoopsie asking for one.
 6. If requested, {{{whoopsie}}} sends up the core dump and Daisy writes this to disk (NFS) then puts the path to this file on a Rabbit queue for retracing.
 7. One of the retracers will then pick this core dump off the queue and process it into a [[http://en.wikipedia.org/wiki/Stack_trace|stack trace]]. The stack trace is then computed into a [[http://bazaar.launchpad.net/~apport-hackers/apport/trunk/view/head:/apport/report.py#L1151|crash signature]] which the database uses as a unique identifier for this problem.
 8. Daisy then increments a counter for the bucket this crash signature belongs to, indicating the number of users who have experienced instances of this problem.
 9. The bucket counts are read and displayed on the Django http://errors.ubuntu.com website.

=== How can I test this? ===

To see this in action for yourself, simply send any process the SEGV signal:
{{{
eog & sleep 5 && pkill -SEGV eog
}}}

=== Get the code ===
/* Replace with a wget | sh shell script? */
{{{
mkdir -p ~/bzr
cd ~/bzr
bzr branch lp:daisy
bzr branch lp:errors
bzr branch lp:whoopsie
bzr branch lp:apport
}}}

=== Find bugs to fix ===

 * We track all bugs against the [[https://bugs.launchpad.net/ubuntu-error-tracker|Ubuntu Error Tracker project]].
 * Errors that occur within the server-side infrastructure are reported [[https://errors.ubuntu.com/oops-local/|here]].

=== Further reading ===

 * [[/Contributing/Debconf]]
 * [[/Contributing/Errors]]

Ubuntu’s error tracker explains crashes, hangs, and other severe errors to end users; lets them report an error with the click of a button; and collects these reports from millions of users to show Ubuntu developers which errors are most common. It’s all open source, and you can help.

Rationale

To help Ubuntu reach a standard of quality similar to competing operating systems, developers need to know the answers to two questions:

  1. How reliable is Ubuntu right now? (Compared with yesterday, compared with the previous version, or compared with what it would be if everyone had installed every update.)

  2. What’s the best thing I can do right now to help improve its quality?

We can better answer both of those questions if we collect all the information we need, for as many types of problems as we can, from a large representative sample of people. This means not requiring people to sign in to any Web site, enter any text, submit hundreds of megabytes of data, receive e-mail, or do anything more complicated than clicking a button. It means collecting problem reports both before and after release. And it means analyzing and bucketing problems automatically, with developers able to configure the system to automatically retrieve more information about a particular kind of problem when it next occurs.

Statistics collected by Microsoft show that a bug reported by their Windows Error Reporting system “is 4.5 to 5.1 times more likely to be fixed than a bug reported directly by a human”, that fixing the right 1 percent of bugs addresses 50 percent of customer issues, and that fixing 20 percent of bugs addresses 80 percent of customer issues.

The client interface for the error tracker also serves a purpose which is less important for developers, but more important for end users: explaining why something weird just happened. In previous Ubuntu release versions, when most programs crashed there was no explanation of why the window had disappeared.

Client design

Privacy settings

In System Settings, the “Security & Privacy” panel should contain a “Diagnostics” tab for error and metrics collection.

settings-privacy-diagnostics.png

In a backport to Ubuntu 11.10 and earlier, a standalone “Privacy” window should contain equivalent controls for just the error collection.

settings-privacy-old-versions.png

In both cases, the “People using this computer can…” and following controls should be insensitive whenever you have not unlocked them as an administrator.

In a new Ubuntu installation (or an upgrade to a version that introduces these settings), “Send error reports to Canonical” should be checked by default. But “Send a report automatically if a problem prevents login” and “Send occasional system information to Canonical”, when present, should be unchecked by default.

(Error reports are also accessible to trusted Ubuntu developers who are not employed by Canonical. This is covered in the privacy policy.)

When there is an error

When there is an error that prevents login, and “Send a report automatically if a problem prevents login” is checked, the error should be sent automatically.

As soon as possible after any other type of error occurs, an alert should appear with text and buttons depending on the situation. The Esc and Enter keys should not do anything in these alerts, because you may have been just about to press one of those in the program that has the problem.

You are an admin, or error reporting is allowed

Your admin has blocked error reporting

Implemented in Ubuntu

#An OS package crashes (including kernel oopses) for the first time this version
Test case: sudo pkill -SEGV zeitgeist

12.04

An OS package crashes a subsequent time

12.04

“Ignore future problems of this type” means ignore future crashes of the same version of the same package.

#An application thread crashes for the first time this version

(in 12.04, shows “closed unexpectedly” error instead, bug 1033902)

An application thread crashes a subsequent time

For most other error types, the alert shouldn’t offer to be silent next time — because it still needs to appear to explain what’s happened, and (in the application hang case) to let you stop/relaunch the application:

#An application has a developer-specified error

#An application is hung for at least 30 seconds
Test case: eog & sleep 5 && pkill -STOP eog && sleep 20 && pkill -CONT eog then wait for 30 seconds

(targeted for 12.10)

This alert should be modal to the unresponsive window. (It appears much later than, because it is more intrusive than, the greying-out of the window after 5 seconds.) If that window becomes responsive again, the alert’s contents should become insensitive for one second (to ignore misclicks) before the alert closes.

#An application is hung for at least 5 seconds after you try to close its window
Test case: eog & sleep 5 && pkill -STOP eog & pkill -TERM eog && sleep 5

(targeted for 12.10)

This alert should be modal to the unresponsive window, and should therefore have no title of its own. If that parent window becomes responsive again, the alert’s contents should become insensitive for one second (to ignore misclicks) before the alert and the window both close.

#An application crashes
Test case: eog & pkill -SEGV eog

12.04

#Ubuntu restarts after a kernel crash

(targeted for 12.04 SRU)

#A package fails to install or update

(targeted for 12.04 SRU)

With non-application software crashing, we can’t tell programmatically whether it’s something you need to care about or not. So if you aren’t going to report the errors, we might as well let you ignore future errors:

#Third-party non-application software crashes for the first time this version
Test case: sh -c 'kill -SEGV $$'

(no alert shown)

12.04

Third-party non-application software crashes a subsequent time

12.04

For all cases where the “Send an error report to help fix this problem” checkbox is present, its state should persist across errors and across Ubuntu sessions.

#Any type of error, if you choose “Show Details”

12.04

If you choose “Show Details”, it should change to “Hide Details” while a text field containing the error report appears below the primary text. If necessary, a spinner and the text “Collecting information…” should appear centered inside the text field while the information is collected (other than the process name and version, which should appear instantly), pausing whenever the collection system is waiting for you to answer any questions. If no details are available (for example, the crash file is unreadable), below the process name should appear the paragraph “No details were recorded for this error.” Regardless, the field contents should end with the paragraph “Other information may be sent if Ubuntu developers request it.”

If you choose to send an error report, the alert should disappear immediately. Data should be collected (if it hasn’t been already), and reports should be sent in the background, with no progress or success/failure feedback. If you are not connected to the Internet at the time, reports should be queued. Any queued reports should be sent when you next agree to send an error report while online.

If you are using a pre-release version of Ubuntu, and the error report matches an existing Launchpad bug report, a further alert box should appear explaining its status and letting you open the bug report.

bug-report.png
Enter = “OK”

(targeted for 12.10)

Future work: Ensure that if there is a delay in displaying a crash, we adjust the text of the dialog to reflect this. As an example, if X crashes and the user has to log in again or reboot the computer.

Future work: If a software update is known to fix the problem, replace the primary alert with the software update alert (or progress window, depending on the update policy), with customized primary text. Or point them at a web page (not a wiki page!) with details if a workaround exists, but no fix is available yet.

Future work: Automate the communication with the user to facilitate things like leak detection in subsequent runs, without requiring additional interaction with the user. Our current process requires us to ask people who are subscribed to the bug to try a specially-instrumented build, with a traditionally very long feedback loop between the developer and the bug subscribers. We should make it entirely automatic. Just wait for the next user who sees the bug to click one "yes, I'd like to help make this product better" button.

When there are multiple simultaneous errors

To guard against the case where multiple errors of the same type cause a flood of alert boxes, there should be aggregate alert boxes for the two most likely cases, internal errors and application crashes.

If an alert box for a single error is open and unfocused, when another error of the same type happens, that alert box should morph into the aggregate version.

Multiple OS packages crash

(targeted for 12.04.1)

Multiple applications crash

(targeted for 12.04.1)

In these cases, the “Show Details” box should show details of all the errors, with separators between them.

When there is not enough memory for a core dump

When the kernel does not have enough memory

An error report should still be sent (or queued for sending) as normal for accounting purposes, just without the core dump.

When an update is available to fix a crash

When a crash (whether of an application or OS package) occurs, “Send error reports to Canonical” is checked, and Ubuntu hasn’t previously submitted this particular crash signature, it should send a basic crash signature to the server.

If it does not do this, or within five seconds it does not receive a response that the problem is fixed by a software update and it did not know from a previous submission that it is fixed by a software update, then the error alert should appear as normal.

Otherwise, the usual error alert should change:

  • the secondary text should be “The good news is, a software update is available to fix this problem.”
  • the checkbox label should be “Send an error report anyway”.
  • An extra “Install Updates…” button should be present on the leading side of the trailing group (even if you are not an admin).

app-crash-reportable-updateable.png

Choosing “Install Updates…” should launch Software Updater (leaving the application closed, if it was an application crash).

We also considered changing the Software Updater UI to appear unprompted sooner, or to have custom text, when updates are known to fix problems users on your system have submitted. We decided against it because it would be less obvious.

When there is a debconf prompt

debconf prompts for user-installed software in Ubuntu are, overwhelmingly, programming mistakes. Therefore, they should be presented as error alerts.

#A Debconf prompt

debconf-reportable.png

debconf-unreportable.png

(targeted for 12.10 — how to contribute)

If the TITLE command is used, that string should be used as the title of the window instead of “Ubuntu”.

The primary text should depend on the type of package and the situation:

Application

Non-application package

During installation

{Application Name} needs your help to finish installing.

The package “{package name}” needs your help to finish installing.

During postinst abort-remove

{Application Name} needs your help to finish its removal.

The package “{package name}” needs your help to finish its removal.

Controls should be included in the alert depending on the type of prompt.

#Type “string”

debconf-string.png

#Type “boolean”

debconf-boolean.png

#Type “select”

If there are six or fewer choices:

If there are more than six choices:

debconf-select-few.png

debconf-select-many.png

#Type “multiselect”

If there are six or fewer choices:

If there are more than six choices:

debconf-multiselect-few.png

debconf-multiselect-many.png

#Type “note”

debconf-note.png

#Type “text” or “error”

debconf-text.png

#Type “password”

debconf-password.png

Presenting debconf progress

This has nothing to do with error tracking, but is included here because the error tracker provides all the rest of debconf’s graphical UI.

When a maintainer script requests progress presentation (db_progress), the progress text and proportion should be shown in a progress window. Ideally, this progress window should morph to and/or from any consecutive debconf prompts.

debconf-progress.png

The window title should be of the form ‘Installing “package-name”’, ‘Reinstalling “package-name”’, ‘Updating “package-name”’, or “Removing ‘package-name’” as appropriate, or “Ubuntu” if the type of operation is unknown. The “Skip” button should be present if the operation is skippable.

Invitation for metrics collection

For any administrator, after the first time only that they respond to an error alert, a second alert should appear to invite them to opt in to metrics collection. (The “Esc” key should activate “Don’t Send” in this alert, but the “Enter” key should not do anything.)

privacy-settings-alert.png

The “Privacy…” button should open System Settings to the Privacy panel. Choosing “Send” should be equivalent to checking “Send occasional system information to Canonical” in the Privacy settings.

Accessing previous reports

Choosing “Show Previous Reports” in the settings interface should open a Web page listing those reports.

previous-reports.png

To avoid end users getting lost in developer material, the page should have no global navigation.

To avoid privacy problems, it should be impossible to share the URL of the page. How?

Error reports should be listed in the order they were received, newest first, defaulting to the newest 50. The date received should link to the individual report.

If there are from 1 to 50 reports, the batch count should read only “Showing all {number}”, and there should be no batch navigation.

previous-reports-1-batch.png

If there are no reports at all, there should be no batch count, navigation, or table — just an explanatory sentence.

previous-reports-none.png

Client implementation

The apport client will write a .upload file alongside a .crash file to indicate that the crash should be sent to the crash database. A small C daemon (currently "whoopsie", previously "reporterd") will set up an inotify watch on the /var/crash directory, and any time one of these .upload files appears, it will upload the .crash file. It will do this if and only if there is an active Internet connection, as determined by watching the NetworkManager DBus API for connectivity events, otherwise it will add it to a queue for later processing.

We will ensure NetworkManager brings up the interfaces as early as possible, to enable us to file crash reports during boot.

This needs to be a daemon, rather than another path of the apport client code, to account for there not being an Internet connection at the time of the crash and for crashes during boot, when we cannot assume the user will get back to a known-good state to file the report.

The canonical example here is the scenario posed in Microsoft’s Windows Error Reporting paper, where a piece of malware was causing the core desktop application (explorer.exe) to crash. They were still able to receive crash reports, as their client software still submitted reports very early on in the boot process.

The apport crash file will be parsed into an intermediate data structure (currently a GHashTable), with the core dump stripped out, and then converted into BSON to be transmitted in a HTTP POST operation. The server will reply with a UUID for subsequent operations and, optionally, a command for further action. Initially, this will just be a command to upload the core dump.

A new field is being added to the apport crash file, StacktraceAddressSignature. The server will check for this field, and if it already has a retraced core dump generated from the same signature, it will reply with just the UUID of the crash report entry in the database, indicating that a core dump need not be submitted.

If, however, the server does reply with a request to upload the core dump, it will be sent as zlib compressed data in an HTTP POST operation.

The URLs for posting will be of the form:

Crash reports will be cleaned up after 14 days, as the system may never be connected to the Internet.

If the reporter daemon crashes, it will write a crash file like any other application. Its upstart job will have the respawn flag set, and a limit put in place so it doesn't go crazy.

If the reporter daemon moves to using apport-unpack to process the crash files, it should gracefully handle -ENOSPC.

Crash reports for applications not themselves part of packages in the Ubuntu will be handled. These will not be retraced, but they will be collected for statistical analysis. This removes the "the problem cannot be reported" dialog in Apport.

We will add an Origin and possibly a Site field to the apport reports, using the python-apt candidate.origins interface. This will allow us to answer questions like what percentage of crashes are coming from PPAs. More importantly, it will let us focus reports on packages from a particular PPA, like the unity-testing one.

errors.ubuntu.com

site-front-page.png

By default, the front page should begin with a graph of “Errors per 24 hours” (bug 1046269) for nearly all dates (bug 1053410) and all Ubuntu versions from which errors were recorded.

Once at least six months of data has been recorded, the main graph should be followed by a thumbnail navigation graph for selecting a date range. If you do this, the filter controls for the table should change to the same date range, though the reverse should not happen (entering a date range by date should not change the main graph).

Next should be controls for changing the graph and the table. A spinner should appear at the trailing end of the first row whenever the graph and/or table are being updated.

site-front-page-filters.png

The OS version menu should begin with an item for “all” (the default), then “Ubuntu {development version}”, then tracked released versions from newest to oldest. For example, “all”, “Ubuntu S”, “Ubuntu 13.04”, “Ubuntu 12.10”, “Ubuntu 12.04”.

The package combo box should have menu items for “all packages” (the default), “-proposed”, “ubuntu-desktop”, and any other useful package sets. The text field should accept either any of these special names, or a package name. If the text (once space-stripped and lower-cased) does not match any of those when focus leaves the field, it should flash red and have no effect on the graph or table, but should retain its contents so you can correct typos. Whenever the package combo box value is a single package name, the spinner should appear until a menu appears listing package versions to filter on, with the default being “all versions”.

site-front-page-filters-package.png

The date menu should control the contents of only the table, not the graph. It should have items for “the past day” (the default), “the past week”, “the past month”, “the past year”, and “the date range”. Whenever “the date range” is selected, date fields should appear alongside for specifying the date range. If you have never used these before, they should default to the past year. Otherwise, they should remember whichever dates you used last.

site-front-page-filters-date.png

Finally, the table should appear.

For both the graph and the table, whenever one of them is loading, is interactively updating, or failed to load, it should be semi-transparent. If the table fails to load or update, an error message should also appear alongside the table filter controls if there is room, or below them otherwise: an error icon, the text “Sorry, the table data didn’t load.”, and a “Retry” button (bug 1060037).

site-data-error.png

Server architecture

/ServerArchitecture has additional details.

How you can help

There are a few components to the error reporting system. To understand where to make your contribution, first understand how all the pieces fit together.

Anatomy of a crash

When an application crashes in Ubuntu, apport is called and a basic crash report is written into the /var/crash directory. This initial report contains the information that can be gathered quickly, such as the date and path to the application.

crash-report-basic.png

Meanwhile, another program called update-notifier is watching the /var/crash directory for new files. It sees that a .crash file has been created and runs Apport with this file as an argument. The following window then appears:

initial-apport-window.png

This is the first contact the user has with the issue since the application that crashed disappeared from view. At this point additional information is collected for the report. This occurs either when they press “Show Details”, so that these details may be presented to them for review, or when they dismiss the dialog with the “Send an error report to help fix this problem” box checked.

The additional details collected will be ones that could not be calculated quickly when the report was first created. As one example, the packages that this application depends on and their versions will be determined and written into the report.

When the user dismisses the dialog with the “Send an error report to help fix this problem” box checked, another file is created in the /var/crash directory with the same name as the crash report, but with a .upload extension. This file has no contents. It is just used as a signal to the program responsible for uploading the crash report that the user wants this report sent.

This program responsible for uploading crash reports is called Whoopsie. It’s always running on Ubuntu systems, watching the /var/crash directory for files ending in .upload. When it sees one of these, it checks to see if there’s a high-speed internet connection. If it cannot find one, it waits to send the report until later. Otherwise, it opens the matching .crash file and converts it into binary JSON data then sends this information to http://daisy.ubuntu.com.

Along with the report, Whoopsie sends an obfuscated (SHA512) system identifier (DMI system UUID). This information is collected so that we can show a graph of the average errors per calendar day. It also lets us answer questions like, “is Ubuntu more stable in the first week of use or subsequent weeks?”

The servers responsible for http://daisy.ubuntu.com receive these error reports, about 80,000 per day currently, and write them into a large Cassandra database.

Once the report is written into the database, it’s put through a process called “bucketing.” This takes the report and determines what makes it an instance of a larger problem. In its simplest form, this is a string that contains the path to the program that crashed, the signal that occured, and the top few frames of the stack trace, all separated by colon characters:

/usr/bin/update-notifier:11:g_object_ref:g_list_foreach:get_mounts:get

Every time a crash is received, it is grouped with the other crashes that produce this same “crash signature.” We call this grouping a problem, or bucket. Every time a new crash is added to one of these buckets, a counter for the bucket is incremented for the date, month, and year that the crash was seen. This is how we identify what the most important problems in Ubuntu are: we present a table on http://errors.ubuntu.com of the buckets with the most number of instances:

update-manager-table.png

Anatomy of a crash, in detail

apport -> whoopsie -> daisy.ubuntu.com -> errors.ubuntu.com

  1. Apport is called by the kernel's core pattern handler and creates a report file in /var/crash.

  2. The update-notifier application watches for changes in /var/crash. It sees this newly written report file and calls /usr/share/apport/apport-gtk.

  3. Apport displays a graphical dialog asking the user if they want to report the issue.

  4. If the user chooses to report the issue, a .upload file is created in /var/crash. The whoopsie application is watching /var/crash for these. It waits until there is an active Internet connection, then finds the matching report for the .upload file and uploads it to https://daisy.ubuntu.com.

  5. Crashes come in two parts. There's the metadata associated with the crash (date, environment variables, Ubuntu release, ...) and the crash itself. This latter part is called a core dump and is often very large. In order to avoid requiring everyone submitting a crash to also submit a core dump, we use a first pass signature called a StacktraceAddressSignature. There may be a few StacktraceAddressSignatures for any CrashSignature.

  6. Daisy accepts these uploaded reports from users and writes them into a Cassandra database. If it has not yet received a core dump for this problem (checking the aforementioned StacktraceAddressSignature), then it replies to whoopsie asking for one.

  7. If requested, whoopsie sends up the core dump and Daisy writes this to disk (NFS) then puts the path to this file on a Rabbit queue for retracing.

  8. One of the retracers will then pick this core dump off the queue and process it into a stack trace. The stack trace is then computed into a crash signature which the database uses as a unique identifier for this problem.

  9. Daisy then increments a counter for the bucket this crash signature belongs to, indicating the number of users who have experienced instances of this problem.
  10. The bucket counts are read and displayed on the Django http://errors.ubuntu.com website.

How can I test this?

To see this in action for yourself, simply send any process the SEGV signal:

eog & sleep 5 && pkill -SEGV eog

Get the code

mkdir -p ~/bzr
cd ~/bzr
bzr branch lp:daisy
bzr branch lp:errors
bzr branch lp:whoopsie
bzr branch lp:apport

Find bugs to fix

Further reading

ErrorTracker (last edited 2022-11-09 21:58:30 by seth-arnold)