ErrorTracker

Differences between revisions 8 and 9
Revision 8 as of 2011-08-11 09:45:31
Size: 6246
Editor: mpt
Comment: Not just for crashes [as pointed out by Christopher James Halse Rogers]
Revision 9 as of 2011-08-11 09:49:35
Size: 6268
Editor: mpt
Comment: separate server design from requirements
Deletions are marked like this. Additions are marked like this.
Line 46: Line 46:
== Requirements ==

Rationale

To help Ubuntu reach a standard of quality similar to competing operating systems, developers should spend less time asking for information on individual bug reports, and more time fixing the bugs that affect users most often.

To determine which bugs those are, we should collect crash and hang reports from as many people as possible, before and after release. This means not requiring people to sign in to any Web site, enter any text, submit hundreds of megabytes of data, receive e-mail, or do anything more complicated than clicking a button. An automated system should then analyze which problems are caused by the same bug. If developers need more information about a particular kind of crash, they should be able to configure the system to automatically retrieve that information when the problem next occurs.

Statistics collected by Microsoft show that a bug reported by their Windows Error Reporting system “is 4.5 to 5.1 times more likely to be fixed than a bug reported directly by a human”, that fixing the right 1 percent of bugs addresses 50 percent of customer issues, and that fixing 20 percent of bugs addresses 80 percent of customer issues.

Prior art

Windows Error Reporting is probably the most advanced crash reporting system. As described in “Debugging in the (very) large: Ten years of implementation and experience” (PDF), it uses progressive data collection, where developers can request more than the “minidump” if necessary to understand particular problems. It also automatically notifies users if a software update fixes their problem. And hardware vendors can see crash reports specific to their hardware.

windows-app-progress.png windows-app.png windows-os.png

Mac OS X has a CrashReporter system that submits crash data to Apple. As described in Technical Note TN2123, “There is currently no way for third party developers to access the reports submitted via CrashReporter”.

mac-app.png mac-plugin.png mac-os.png mac-hang.png

As a result, some Mac software developers have created their own crash tracking systems, such as Adobe’s and Adium’s.

Mozilla uses Breakpad to collect and submit minidumps on the client side, and Socorro to analyze and present them on the server side. Anyone can access crash data at crash-stats.mozilla.com.

Client design

When there is an error, an alert should appear with text and buttons depending on the situation.

The problem can be reported

Your admin has blocked problem reporting

Part of the OS crashes

An application thread crashes

(no alert shown)

An application crashes

An application hangs

If you choose “Report…”, a secondary dialog should appear on top of the alert.

If you choose “Send”, the “Send more information automatically if requested” checkbox and “Send” button should become insensitive, and the “Privacy Policy” button should be replaced by a progress bar extending from the left margin to just before the “Cancel” button. If you have left “Send more information automatically if requested” checked, progress of the progress bar should be allocated amongst the subtasks of sending the initial report, asymptotically waiting for analysis, and sending any further information requested.

Once the problem report is cancelled or completed, the secondary dialog should close. In the primary alert, if the problem report was completed, “You can help fix the problem by submitting an error report.” should change to “Thanks for reporting this problem.”. If you subsequently click “Report…” again, the “Send” button and the “Send more information automatically if requested” checkbox should be insensitive.

Future work: If a software update is known to fix the problem, replace the primary alert with the software update alert (or progress window, depending on the update policy), with customized primary text.

Server design

Requirements

  • A collection of files are gathered client-side and inserted into the crash database record.
  • Processed versions of files (i.e. retracer output) can be added subsequently.
  • Some files must be kept private (i.e. core dumps)
  • Traces from multiple crash reports are algorithmically compared to find exact-dupes and likely-dupes.
  • Crash reports can be grouped by package, by distro release, or by both.
  • Statistics are generated to show number of [exact|exact+likely] dupes for each type of crash. Statistics can be provided by package, by distro release, by date range, or a combination.
  • Bug report(s) can be associated with a given set of crashes.
  • The user should have some way to check back on the status of their crash report; e.g. have some report ID they can look at to see statistics and/or any associated bug #.
  • For X and kernel crashes (at least), these reports need to be indexable by hardware. That is, we want to be able to answer both "how prevalent are GPU hangs on Intel hardware?" and "on what hardware does this GPU hang appear?". Probably either DMI data or PCIIDs or both are needed for this.

ErrorTracker (last edited 2018-02-27 11:56:11 by mpt)